pandas.Series.interpolate() 沿“索引”显示意外结果答案

【问题标题】：pandas.Series.interpolate() along "index" shows unexpected resultspandas.Series.interpolate() 沿“索引”显示意外结果
【发布时间】：2020-07-10 23:27:54
【问题描述】：

在我的示例中名为“bla”的pandas.Series() 包含以 Pa 为索引的压力和以 m/s 为值的风速：

bla
100200.0    2.0
97600.0     NaN
91100.0     NaN
85000.0     3.0
82600.0     NaN
           ... 
6670.0      NaN
5000.0      2.0
4490.0      NaN
3880.0      NaN
3000.0      9.0
Length: 29498, dtype: float64

bla.index
Float64Index([100200.0,  97600.0,  91100.0,  85000.0,  82600.0,  81400.0,
               79200.0,  73200.0,  70000.0,  68600.0,
              ...
               11300.0,  10000.0,   9970.0,   9100.0,   7000.0,   6670.0,
                5000.0,   4490.0,   3880.0,   3000.0],
             dtype='float64', length=29498)

由于风速值通常为NaN，因此我打算根据不同的压力水平进行插值，以便有更多的风速值可供使用。

docs of interpolate() 声明有一个名为“index”的方法，它考虑索引值进行插值，但与初始值相比，结果没有意义：

bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0     **2.00**
97600.0     10.40
91100.0      8.00
85000.0      **3.00**
82600.0      9.75
            ...  
6670.0       3.00
5000.0       **2.00**
4490.0       9.00
3880.0       5.00
3000.0       **9.00**
Length: 29498, dtype: float64

我用粗体标记了原始值。我宁愿在使用“线性”时期待类似的东西：

bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0    **2.000000**
97600.0     2.333333
91100.0     2.666667
85000.0     **3.000000**
82600.0     4.600000
              ...   
6670.0      4.500000
5000.0      **2.000000**
4490.0      4.333333
3880.0      6.666667
3000.0      **9.000000**

尽管如此，我想正确使用“索引”作为插值方法，因为考虑到插值的压力水平，这应该是最准确的，以标记每个风速值之间的“距离”。

总的来说，我想了解使用带有压力级别的“索引”的插值结果如何变得如此违反直觉，以及如何使它们变得更健全。

【问题讨论】：

Length: 29498。那是你的问题。您已经展示了这些行的一小部分，可能在这 30,000 行中的某个地方有一个索引在 102K 和 85K 之间的非空值。在插值期间考虑这些值。（看bla.sort_index()）
谢谢，根据您的评论，我意识到我需要分别查看我的多索引数据帧的每个子集（请参阅我的答案）。

标签： python pandas interpolation series

【解决方案1】：

感谢@ALollz 在我的问题下方的第一条评论中，我提出了问题所在：

只是我的数据框有 2 个索引级别，外部是唯一的测量时间戳，内部是标准范围索引。我应该分别查看与唯一时间戳关联的每个子集。在这些子集中，插值是有意义的，并且产生的结果恰到好处。

示例：

# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
    # Extract the current subset
    df_subset = df.loc[timestamp, :]

    # Carry out interpolation on a column of interest
    df_subset["column of interest"] = df_subset[
        "column of interest"].interpolate(method="linear",
                                          axis=0,
                                          limit=1,
                                          limit_direction="both")

【讨论】：