【发布时间】:2020-01-14 23:56:22
【问题描述】:
我对 Python 和数据科学有点陌生。
我有这两个数据框: df 数据框
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
Date A
0 2014-11-21 11:00:00 1
1 2014-11-21 11:00:03 2
2 2014-11-21 11:00:04 5
3 2014-11-21 11:00:05 3
4 2014-11-21 11:00:07 9
5 2014-11-21 11:00:08 6
6 2014-11-21 11:00:10 3
7 2014-11-21 11:00:11 0
8 2014-10-24 10:00:55 8
9 2014-10-24 10:00:59 10
info 数据框,此数据框包含我的最终 df 应包含的日期时间范围
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})
Start Stop
0 2014-11-21 11:00:00 2014-11-21 11:00:07
1 2014-11-21 11:00:08 2014-11-21 11:00:11
2 2014-10-24 10:00:55 2014-10-24 10:00:59
当且仅当df 中的实际行在info 中的行之一的范围内时,目标是计算df 中的累积总和two seconds window。例如,日期为2014-11-21 11:00:08 的行的累积总和应为0。因为它在一个范围的开头,另一个例子是日期为2014-11-21 11:00:07的行,它的cumsum应该是12(9+3)。
这是我到目前为止所取得的成就:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)
预期结果:
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 2
2 2014-11-21 11:00:04 5 0 7
3 2014-11-21 11:00:05 3 0 10
4 2014-11-21 11:00:07 9 0 12
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 3
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 10
实际结果:
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 3
2 2014-11-21 11:00:04 5 0 8
3 2014-11-21 11:00:05 3 0 11
4 2014-11-21 11:00:07 9 0 20
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 9
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 18
但这是对 groupnum 进行累积总和,我无法仅累积 2s。
那么有没有合适的方法来实现这一点?我将不胜感激。
我的英文不太好,希望我能解释清楚
【问题讨论】:
-
你能添加预期的结果吗?否则我可能无法正确理解您的问题
-
@CodeDifferent 哦,是的,当然,抱歉一开始就应该这样做。
-
@Arès 你在
df有没有时间在info以外的任何范围内? -
@Arès 在这种情况下你想要什么?删除行?
-
@Ben.T 是的,这就是目标