数据帧中的线性外推答案

【问题标题】：Linear extrapolation in dataframes数据帧中的线性外推
【发布时间】：2021-03-17 05:17:44
【问题描述】：

我有一个数据集，其中包含 2009 年至 2019 年区域级别的家庭数量。数据集非常完整，但有些数据缺失。例如，我有这两个区域，IE01 和 IE04：

n2hn_df.loc['IE01']

    Out[2]: 
    2009    455300.0
    2010    460600.0
    2011    465500.0
    2012         NaN
    2013         NaN
    2014         NaN
    2015         NaN
    2016         NaN
    2017         NaN
    2018         NaN
    2019         NaN
    Name: IE01, dtype: float64



n2hn_df.loc['IE04']
Out[3]: 
2009         NaN
2010         NaN
2011         NaN
2012    320700.0
2013    315300.0
2014    310500.0
2015    307500.0
2016    315400.0
2017    323300.0
2018    329300.0
2019    339700.0
Name: IE04, dtype: float64

我想通过线性外推来完成数据集（因为这些年来家庭的数量不会发生巨大变化）。我知道插值很容易，但是像

n2hn_df.interpolate(method='linear',axis=1,limit_direction='both',inplace=True)

仅使用在两个方向上找到的最接近的值填充数据集。我还没有找到一种简单的方法来推断数据框中的数据，所以我想就最好的方法征求你的意见。我很感激你能提供的任何帮助。提前致谢！

编辑：我想从中推断数据的数据框的一个示例是：

【问题讨论】：

标签： python pandas dataframe scipy dataset

【解决方案1】：

我不久前做过类似的事情。它不是超级漂亮，但也许你可以使用它。例如，我使用以下 DataFrame（第二个示例的修改版本）：

         value
year          
2009       NaN
2010       NaN
2011       NaN
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017       NaN
2018       NaN
2019       NaN

year 是index！

1. step 正在填满NaNs 的结尾部分：

increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()

结果：

         value
year          
2009       NaN
2010       NaN
2011       NaN
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017  314075.0
2018  312750.0
2019  311425.0

作为increment，我使用了现有diffs 的mean。如果您想使用最后一个diff，请将其替换为：

increment = df.value.diff(1)[df.value.notna()].array[-1]

2。填充NaNs 的起始部分的步骤或多或少相同，只是将value 列反转，并在最后重新反转：

df.value = df.value.array[::-1]
increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()
df.value = df.value.array[::-1]

结果：

         value
year          
2009  324675.0
2010  323350.0
2011  322025.0
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017  314075.0
2018  312750.0
2019  311425.0

重要提示：该方法假定索引中没有间隙（缺失年份）。

正如我所说，不是很漂亮，但它对我有用。

（PS：只是为了澄清上面“相似”的使用：这确实是线性外推。）

编辑

示例帧（截图中帧的前 3 行）：

n2hn_df = pd.DataFrame(
        {'2010': [134.024, np.NaN, 36.711], '2011': [134.949, np.NaN, 41.6533],
         '2012': [128.193, np.NaN, 33.4578], '2013': [125.131, np.NaN, 33.4578],
         '2014': [122.241, np.NaN, 33.6356], '2015': [115.301, np.NaN, 35.5919],
         '2016': [108.927, 520.38, 40.1008], '2017': [106.101, 523.389, 41.38],
         '2018': [96.1861, 526.139, 49.0906], '2019': [np.NaN, np.NaN, np.NaN]},
        index=pd.Index(data=['AT', 'BE', 'BG'], name='NUTS_ID')
    )

            2010      2011      2012  ...     2017      2018  2019
NUTS_ID                               ...                         
AT       134.024  134.9490  128.1930  ...  106.101   96.1861   NaN
BE           NaN       NaN       NaN  ...  523.389  526.1390   NaN
BG        36.711   41.6533   33.4578  ...   41.380   49.0906   NaN

外推：

# Transposing frame
n2hn_df = n2hn_df.T
for col in n2hn_df.columns:
    # Extract column
    ser = n2hn_df[col].copy()

    # End piece
    increment = ser.diff(1).mean()
    idx_max_notna = ser[ser.notna()].index.array[-1]
    idx = ser.index[ser.index >= idx_max_notna]
    ser[idx] = ser[idx].fillna(increment).cumsum()

    # Start piece
    ser = pd.Series(ser.array[::-1])
    increment = ser.diff(1).mean()
    idx_max_notna = ser[ser.notna()].index.array[-1]
    idx = ser.index[ser.index >= idx_max_notna]
    ser[idx] = ser[idx].fillna(increment).cumsum()
    n2hn_df[col] = ser.array[::-1]

# Re-transposing frame
n2hn_df = n2hn_df.T

结果：

            2010      2011      2012  ...     2017      2018        2019
NUTS_ID                               ...                               
AT       134.024  134.9490  128.1930  ...  106.101   96.1861   91.456362
BE       503.103  505.9825  508.8620  ...  523.389  526.1390  529.018500
BG        36.711   41.6533   33.4578  ...   41.380   49.0906   50.638050

【讨论】：

感谢您的意见，很抱歉回复晚了。我想知道您使用的df.value。我收到'DataFrame' object has no attribute 'value' 错误。此外，我将编辑我的帖子和我想使用线性外推填充的实际数据框。如果您能帮助我调整您的代码，我将不胜感激。谢谢！
感谢您的反馈！我会尽力的，今晚会看的。只是几点/观察：（1）看起来你的框架是以年为列组织的？如果是这样，我建议转置它（n2hn_df = n2hn_df.T）。 (2) values 是我的示例框架中的列名，例如df.values.diff(1).mean() 是 df['values'].diff(1).mean() 等。在转置框架后，您宁愿在第一列中使用 n2hn_df['AT'].diff(1).mean() 之类的东西……最好的方法是遍历列并单独推断每个列。今晚会提出一些代码。
@JavierSando 我已经尝试使它适用于您发布的屏幕截图中的框架。我希望它有效。（因为我已经一个多月没有编写任何代码了，所以我有点生疏了。）如果您遇到任何问题，请告诉我。
谢谢！它工作得很好。对于极少数的数据集，我首先必须使用 df.interpolate 填补值之间的空白。然后我使用了你的代码，它达到了我的预期。我有一个问题：在你的第一篇文章中，你写了（PS：只是为了澄清上面“相似”的使用：这确实是线性外推。）。你在哪里找到你使用的公式？这不是您找到的典型线性外推公式，所以我想知道您为什么要按照自己的方式进行操作。
@JavierSando 好问题。我在任何地方都没有找到公式，只是想了一下：对于离散和等距的时间序列，“线性”仅表示恒定增量（对于所有 n，x_n = x_n-1 + c）。所以唯一的问题是如何选择增量。差异的平均值 (x_n - x_n-1) 是一个明显的选择，但不一定是最好的。它的适用性取决于数据的实际外观。