【发布时间】:2020-10-13 01:57:43
【问题描述】:
我有一个熊猫数据框:
Date Party Status
-------------------------------------------
0 01-01-2018 John Sent
1 13-01-2018 Lisa Received
2 15-01-2018 Will Received
3 19-01-2018 Mark Sent
4 02-02-2018 Will Sent
5 28-02-2018 John Received
我想添加执行.cumsum() 的新列,但它以日期为条件。它看起来像这样:
Num of Sent Num of Received
Date Party Status in Past 30 Days in Past 30 Days
-----------------------------------------------------------------------------------
0 01-01-2018 John Sent 1 0
1 13-01-2018 Lisa Received 1 1
2 15-01-2018 Will Received 1 2
3 19-01-2018 Mark Sent 2 2
4 02-02-2018 Will Sent 2 2
5 28-02-2018 John Received 1 1
我设法通过编写以下代码来实现我所需要的:
def inner_func(date_var, status_var, date_array, status_array):
sent_increment = 0
received_increment = 0
for k in range(0, len(date_array)):
if((date_var - date_array[k]).days <= 30):
if(status_array[k] == "Sent"):
sent_increment += 1
elif(status_array[k] == "Received"):
received_increment += 1
return sent_increment, received_increment
import pandas as pd
import time
df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]),
"Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
"Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})
df = df.sort_values("Date")
date_array = []
status_array = []
for i in range(0, len(df)):
date_var = df.loc[i,"Date"]
date_array.append(date_var)
status_var = df.loc[i,"Status"]
status_array.append(status_var)
sent_count, received_count = inner_func(date_var, status_var, date_array, status_array)
df.loc[i, "Num of Sent in Past 30 days"] = sent_count
df.loc[i, "Num of Received in Past 30 days"] = received_count
但是,当df 很大时,该过程的计算量很大并且速度很慢,因为嵌套循环会遍历数据帧两次。有没有更 Pythonic 的方式来实现我想要实现的目标,而无需以我正在做的方式遍历数据框?
更新 2
Michael 提供了我正在寻找的解决方案:here。假设我想将解决方案应用于groupby 对象。例如,使用滚动解决方案计算每一方的累积总和:
Sent past 30 Received past 30
Date Party Status days by party days by party
-----------------------------------------------------------------------------------
0 01-01-2018 John Sent 1 0
1 13-01-2018 Lisa Received 0 1
2 15-01-2018 Will Received 0 1
3 19-01-2018 Mark Sent 1 0
4 02-02-2018 Will Sent 1 1
5 28-02-2018 John Received 0 1
我已尝试使用下面的groupby 方法重新生成解决方案:
l = []
grp_obj = df.groupby("Party")
grp_obj.rolling('30D', min_periods=1)["dummy"].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)
但我最终得到了不正确的值。我知道这是因为concat 方法正在组合数据帧而不考虑它们的索引,因为groupby 以不同的方式对数据进行排序。有没有办法可以修改附加列表以包含原始索引,以便我可以将 value_counts 数据框合并/加入原始索引?
【问题讨论】: