Pandas 插值在最后一个数据点之后替换 NaN，但不在第一个数据点之前答案

【问题标题】：Pandas interpolation replacing NaNs after the last data point, but not before the first data pointPandas 插值在最后一个数据点之后替换 NaN，但不在第一个数据点之前
【发布时间】：2015-09-28 17:50:11
【问题描述】：

当使用 pandas interpolate() 填充 NaN 值时：

In [1]: s = pandas.Series([np.nan, np.nan, 1, np.nan, 3, np.nan, np.nan])

In [2]: s.interpolate()
Out[2]: 
0   NaN
1   NaN
2     1
3     2
4     3
5     3
6     3
dtype: float64

In [3]: pandas.version.version
Out[3]: '0.16.2'

，为什么pandas将索引5和6处的值替换为3s，而将0和1处的值保持原样？

我可以改变这种行为吗？我想将 NaN 留在索引 5 和 6。

（实际上，我希望它进行线性外推以填充所有 0、1、5 和 6，但这是一个不同的问题。如果你也回答它，可以获得奖励积分！）

【问题讨论】：

标签： python pandas interpolation

【解决方案1】：

在内部，interpolate 方法使用“限制”参数，以避免填充传播超过特定阈值。

>>>df=pd.DataFrame( [0, np.nan, np.nan, np.nan, np.nan,np.nan, 2] )
>>>df
df 
    0
0   0
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6   2
>>>df.interpolate(limit=2)
          0
0  0.000000
1  0.333333
2  0.666667
3       NaN
4       NaN
5       NaN
6  2.000000

默认情况下，限制应用于正向。在向后的方向上，有一个默认限制，设置为零。这就是为什么您的第一步没有被方法填充的原因。可以使用“limit_direction”参数改变方向。

df.interpolate(limit=2, limit_direction='backward')
          0
0  0.000000
1       NaN
2       NaN
3       NaN
4  1.333333
5  1.666667
6  2.000000

要填充数据框的第一步和最后一步，您可以将“limit”和“limit_direction”的非零值设置为“both”：

>>> df=pd.DataFrame( [ np.nan, np.nan, 0, np.nan, 2, np.nan,8,5,np.nan, np.nan] )
>>> df
    0
0 NaN
1 NaN
2   0
3 NaN
4   2
5 NaN
6   8
7   5
8 NaN
9 NaN
>>> df.interpolate(method='spline', order=1, limit=10, limit_direction='both')
          0
0 -3.807382
1 -2.083581
2  0.000000
3  1.364022
4  2.000000
5  4.811625
6  8.000000
7  5.000000
8  4.937632
9  4.138735

主题已讨论here

【讨论】：

您认为使用 limit_direction = 'both'（限制 = None）和使用外推法之间有区别吗，就像这里对 instace(stackoverflow.com/questions/22491628/…) 所做的那样？

【解决方案2】：

pandas 中的 interpolate 行为看起来很奇怪。您可以改用scipy.interpolate.interp1d 来产生预期的结果。对于线性外推，可以编写一个简单的函数来完成这项任务。

import pandas as pd
import numpy as np
import scipy as sp

s = pd.Series([np.nan, np.nan, 1, np.nan, 3, np.nan, np.nan])

# interpolate using scipy
# ===========================================
s_no_nan = s.dropna()
func = sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)
s_interpolated = pd.Series(func(s.index), index=s.index)

Out[107]: 
0   NaN
1   NaN
2     1
3     2
4     3
5   NaN
6   NaN
dtype: float64

# extrapolate using user-defined func
# ===========================================
def my_extrapolate_func(scipy_interpolate_func, new_x):
    x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
    y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
    slope = (y2 - y1) / (x2 - x1)
    return y1 + slope * (new_x - x1)

s_extrapolated = pd.Series(my_extrapolate_func(func, s.index.values), index=s.index)

Out[108]: 
0   -1
1    0
2    1
3    2
4    3
5    4
6    5
dtype: float64

【讨论】：

谢谢。我仍然希望有人能解释一下熊猫的情况。它应该只是包装 scipy...
包装 scipy 意味着 pandas 依赖于 scipy，我猜他们想避免这种情况。
谢谢。在这里回收了您的答案，以解决一个稍微不同的问题：stackoverflow.com/a/68917511/6366770