Pandas - 从具有偏移量的组中获取最后 n 个值。答案

【问题标题】：Pandas - get last n values from a group with an offset.Pandas - 从具有偏移量的组中获取最后 n 个值。
【发布时间】：2018-05-28 15:02:00
【问题描述】：

我有以日期为索引的数据框 (pandas,python3.5)。 electricity_use 是我应该预测的标签。
例如

          City Country  electricity_use
DATE                                   
7/1/2014     X       A             1.02
7/1/2014     Y       A             0.25
7/2/2014     X       A             1.21
7/2/2014     Y       A             0.27
7/3/2014     X       A             1.25
7/3/2014     Y       A             0.20
7/4/2014     X       A             0.97
7/4/2014     Y       A             0.43
7/5/2014     X       A             0.54
7/5/2014     Y       A             0.45
7/6/2014     X       A             1.33
7/6/2014     Y       A             0.55
7/7/2014     X       A             2.01
7/7/2014     Y       A             0.21
7/8/2014     X       A             1.11
7/8/2014     Y       A             0.34
7/9/2014     X       A             1.35
7/9/2014     Y       A             0.18
7/10/2014    X       A             1.22
7/10/2014    Y       A             0.27

当然数据更大。
我的目标是为每一行创建组中的最后 3 个 electricity_use ('City' 'country')，间隔为 5 天（即 - 从 5 天前取最后 3 个值）。日期可以不连续，但它们是有序的。
例如，对于最后两行，结果应该是：

          City Country  electricity_use prev_1 prev_2 prev_3
DATE                                                        
7/10/2014    X       A             1.22   0.54   0.97   1.25
7/10/2014    Y       A             0.27   0.45   0.43   0.20

因为日期是7/10/2014，而差距是5 days，所以我们从7/5/2014开始查找，这些是从该日期到每个组的最后3个值（在这种情况下，组是@ 987654331@和(Y,A)。

我用一个遍历每个组的循环来实现，但我觉得它可以以更有效的方式完成。

【问题讨论】：

应该可以通过移动或偏移日期然后合并或加入移动的日期、城市和国家来获得其他列值。
@Matts，你能写一个这样的代码吗？
@Binyamin 如果我花时间去做这件事，我什至可以弄清楚，但这不是我能轻易做到的。

标签： python algorithm pandas datetime group-by

【解决方案1】：

一种天真的方法是重新索引您的数据框并迭代合并 n 次

from datetime import datetime,timedelta

# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()

for i in range(3):
    df1['index'] = df['index'] - timedelta(5+i)
    df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))

一种更快的方法是使用 shift by 并删除虚假值

df['date'] = df.index

df.sort_values(by=['City','Country','date'],inplace=True)

temp = df[['City','Country','date']].groupby(['City','Country']).first()

# To pick the oldest date of every city + country group

df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))

df['diff_date'] = df['date'] - df['date_first']

df.diff_date = [int(i.days) for i in df['diff_date']]

# Do a shift by 5
for i range(5,8):
    df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
    df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

【讨论】：