我不久前做过类似的事情。它不是超级漂亮,但也许你可以使用它。例如,我使用以下 DataFrame(第二个示例的修改版本):
value
year
2009 NaN
2010 NaN
2011 NaN
2012 320700.0
2013 315300.0
2014 310500.0
2015 307500.0
2016 315400.0
2017 NaN
2018 NaN
2019 NaN
year 是index!
1. step 正在填满NaNs 的结尾部分:
increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()
结果:
value
year
2009 NaN
2010 NaN
2011 NaN
2012 320700.0
2013 315300.0
2014 310500.0
2015 307500.0
2016 315400.0
2017 314075.0
2018 312750.0
2019 311425.0
作为increment,我使用了现有diffs 的mean。如果您想使用最后一个diff,请将其替换为:
increment = df.value.diff(1)[df.value.notna()].array[-1]
2。填充NaNs 的起始部分的步骤或多或少相同,只是将value 列反转,并在最后重新反转:
df.value = df.value.array[::-1]
increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()
df.value = df.value.array[::-1]
结果:
value
year
2009 324675.0
2010 323350.0
2011 322025.0
2012 320700.0
2013 315300.0
2014 310500.0
2015 307500.0
2016 315400.0
2017 314075.0
2018 312750.0
2019 311425.0
重要提示:该方法假定索引中没有间隙(缺失年份)。
正如我所说,不是很漂亮,但它对我有用。
(PS:只是为了澄清上面“相似”的使用:这确实是线性外推。)
编辑
示例帧(截图中帧的前 3 行):
n2hn_df = pd.DataFrame(
{'2010': [134.024, np.NaN, 36.711], '2011': [134.949, np.NaN, 41.6533],
'2012': [128.193, np.NaN, 33.4578], '2013': [125.131, np.NaN, 33.4578],
'2014': [122.241, np.NaN, 33.6356], '2015': [115.301, np.NaN, 35.5919],
'2016': [108.927, 520.38, 40.1008], '2017': [106.101, 523.389, 41.38],
'2018': [96.1861, 526.139, 49.0906], '2019': [np.NaN, np.NaN, np.NaN]},
index=pd.Index(data=['AT', 'BE', 'BG'], name='NUTS_ID')
)
2010 2011 2012 ... 2017 2018 2019
NUTS_ID ...
AT 134.024 134.9490 128.1930 ... 106.101 96.1861 NaN
BE NaN NaN NaN ... 523.389 526.1390 NaN
BG 36.711 41.6533 33.4578 ... 41.380 49.0906 NaN
外推:
# Transposing frame
n2hn_df = n2hn_df.T
for col in n2hn_df.columns:
# Extract column
ser = n2hn_df[col].copy()
# End piece
increment = ser.diff(1).mean()
idx_max_notna = ser[ser.notna()].index.array[-1]
idx = ser.index[ser.index >= idx_max_notna]
ser[idx] = ser[idx].fillna(increment).cumsum()
# Start piece
ser = pd.Series(ser.array[::-1])
increment = ser.diff(1).mean()
idx_max_notna = ser[ser.notna()].index.array[-1]
idx = ser.index[ser.index >= idx_max_notna]
ser[idx] = ser[idx].fillna(increment).cumsum()
n2hn_df[col] = ser.array[::-1]
# Re-transposing frame
n2hn_df = n2hn_df.T
结果:
2010 2011 2012 ... 2017 2018 2019
NUTS_ID ...
AT 134.024 134.9490 128.1930 ... 106.101 96.1861 91.456362
BE 503.103 505.9825 508.8620 ... 523.389 526.1390 529.018500
BG 36.711 41.6533 33.4578 ... 41.380 49.0906 50.638050