【问题标题】：Cumulative Sum 2s back using range dataset使用范围数据集返回累积和 2s
【发布时间】：2020-01-14 23:56:22
【问题描述】：

我对 Python 和数据科学有点陌生。

我有这两个数据框： df 数据框

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

                  Date   A
0  2014-11-21 11:00:00   1
1  2014-11-21 11:00:03   2
2  2014-11-21 11:00:04   5
3  2014-11-21 11:00:05   3
4  2014-11-21 11:00:07   9
5  2014-11-21 11:00:08   6
6  2014-11-21 11:00:10   3
7  2014-11-21 11:00:11   0
8  2014-10-24 10:00:55   8
9  2014-10-24 10:00:59  10

info 数据框，此数据框包含我的最终 df 应包含的日期时间范围

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})

                 Start                 Stop
0  2014-11-21 11:00:00  2014-11-21 11:00:07
1  2014-11-21 11:00:08  2014-11-21 11:00:11
2  2014-10-24 10:00:55  2014-10-24 10:00:59

当且仅当df 中的实际行在info 中的行之一的范围内时，目标是计算df 中的累积总和two seconds window。例如，日期为2014-11-21 11:00:08 的行的累积总和应为0。因为它在一个范围的开头，另一个例子是日期为2014-11-21 11:00:07的行，它的cumsum应该是12(9+3)。

这是我到目前为止所取得的成就：

import pandas as pd
import numpy as np

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)

预期结果：

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    2
2  2014-11-21 11:00:04   5         0    7
3  2014-11-21 11:00:05   3         0   10
4  2014-11-21 11:00:07   9         0   12
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    3
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   10

实际结果：

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    3
2  2014-11-21 11:00:04   5         0    8
3  2014-11-21 11:00:05   3         0   11
4  2014-11-21 11:00:07   9         0   20
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    9
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   18

但这是对 groupnum 进行累积总和，我无法仅累积 2s。

那么有没有合适的方法来实现这一点？我将不胜感激。

我的英文不太好，希望我能解释清楚

【问题讨论】：

你能添加预期的结果吗？否则我可能无法正确理解您的问题
@CodeDifferent 哦，是的，当然，抱歉一开始就应该这样做。
@Arès 你在df 有没有时间在info 以外的任何范围内？
@Arès 在这种情况下你想要什么？删除行？
@Ben.T 是的，这就是目标

标签： python pandas dataframe

【解决方案1】：

此方法可能不适用于 100M 行数据帧

要创建 groupnum 列，您可以 ufunc.outer 和 greater_equal 和 less_equal 比较 df 的每次开始和停止 info 的每个时间，并使用 @ 逐行比较它的位置987654329@。然后你可以在这个专栏上groupby 并同时使用rolling on 2s

# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
             & np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))

# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)

# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')

# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
              .reset_index(level=0, drop=True) # for index alignment

df = df.reset_index() # get back date as a column
print (df)
                 Date   A  groupnum   cum
0 2014-11-21 11:00:00   1         0   1.0
1 2014-11-21 11:00:03   2         0   2.0
2 2014-11-21 11:00:04   5         0   7.0
3 2014-11-21 11:00:05   3         0  10.0
4 2014-11-21 11:00:07   9         0  12.0
5 2014-11-21 11:00:08   6         1   6.0
6 2014-11-21 11:00:10   3         1   9.0
7 2014-11-21 11:00:11   0         1   3.0
8 2014-10-24 10:00:55   8         2   8.0
9 2014-10-24 10:00:59  10         2  10.0

编辑：如果不能以这种方式创建 arr_bool 您可以尝试迭代 info 的行并独立检查它是否在 start 之上和 stop 之下：

# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()

# create groups by sum 
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop)) 
                         for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1

# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')

# then same for the column cum
df['cum] = ...

【讨论】：

谢谢老兄，为什么你认为它不应该与 100M 行一起工作？是因为df = df.loc[arr_bool.any(axis=1), :].set_index('Date')吗？？
@Arès 不，arr_bool 的创建将是巨大的，它应该具有与df 相同的行数，并且与info 中的行数相同（它这就是为什么我之前问过这个问题）。即使它是一个布尔数组，它也可能对你的记忆来说太大了
@Arès 我从未使用过pd.IntervalIndex.from_arrays，所以我不确定它的效率（时间和空间）如何。也许在一个 1M 行的数据帧上，你可以同时试试看！如果无法创建 arr_bool，我将添加一个替代方案。但它将在info 上进行迭代
@Arès 看到我的编辑，以及关于循环的一个观察：，确实iterrows 不是最好的选择，但如果矢量化是不可能的，那么像编辑中的列表理解是最好的下一件事（看到这个answer) ;)
感谢 Ben 提供的所有这些重要信息，我真的很感激。这有助于我更好地理解 pandas 中的矢量化和迭代的缺点。

【解决方案2】：

我尝试了以下方法：

from datetime import datetime 
df = pandas.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
# !!! NOTE: you have typos in your code above
info = pandas.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})

df['Date'] = df['Date'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
info['Start'] = info['Start'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
info['Stop'] = info['Stop'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

现在我们将日期正确转换为datetime

for row in info.iterrows():
    mask = (df['Date']>=row[1]['Start'])&(df['Date']<=row[1]['Stop'])
    df.loc[mask, 'cumsum'] = df[mask]['A'].cumsum()

这将在您的数据框中添加一个名为 cumsum 的新列。结果应符合您的要求：

                Date    A   cumsum
0   2014-11-21 11:00:00 1   1.0
1   2014-11-21 11:00:03 2   3.0
2   2014-11-21 11:00:04 5   8.0
3   2014-11-21 11:00:05 3   11.0
4   2014-11-21 11:00:07 9   20.0
5   2014-11-21 11:00:08 6   6.0
6   2014-11-21 11:00:10 3   9.0
7   2014-11-21 11:00:11 0   9.0
8   2014-10-24 10:00:55 8   8.0
9   2014-10-24 10:00:59 10  18.0

更新 1：

对不起，我丢失了一件：为了重新采样，您可以这样做：

df.index = df['Date']
df.drop(labels=['Date'], axis=1, inplace=True)
for row in info.iterrows():
    mask = (df.index>=row[1]['Start'])&(df.index<=row[1]['Stop'])
    df.loc[mask, 'cumsum'] = df[mask]['A'].resample('2S').sum()

但如果存在 2 秒的间隔而其中没有值，这也不会产生正确的结果。面对这个问题，您可能需要在重新采样之前进行线性插值；）

更新 2：

现在，问题是原始数据帧中的时间帧与重新采样后的时间帧之间存在不匹配，为了了解正在发生的事情，请查看：

df.index = df['Date']
df.drop(labels=['Date'], axis=1, inplace=True)
res = []
for row in info.iterrows():
    mask = (df.index>=row[1]['Start'])&(df.index<=row[1]['Stop'])
    res.append(df[mask]['A'].resample('2S').sum())

res 将包含 3 个数据帧，每个数据帧对应 info 中的每个间隔：

2014-11-21 11:00:00    1
2014-11-21 11:00:02    2
2014-11-21 11:00:04    8
2014-11-21 11:00:06    9

2014-11-21 11:00:08    6
2014-11-21 11:00:10    3 

2014-10-24 10:00:54     8
2014-10-24 10:00:56     0
2014-10-24 10:00:58    10

如您所见，您的数据已从 0 开始每 2 秒正确重新采样一次，但索引不再匹配，这导致您在更新 1 中的 cumsum 列中看到 NaN。

现在，我认为要实现的正确解决方案是每 2 秒对数据进行正确且均匀的采样和求和的最后一个解决方案。无论如何，如果这不是您想要达到的结果，应该很容易按照您喜欢的方向修改我的解决方案;)

【讨论】：

难道没有其他非“迭代”的方法吗？我的意思是迭代它非常简单，但我想尽可能多地使用 pandas 方法来加速这个过程（在实际情况下，我会有 100M 行，用 itrows 来做这件事真的很慢）
@Arès info 的大小是多少，因为迭代是在这个上，而不是在 df 上？
@Ben.T 哦，是的，我认为 iterrows 在 df 上是我的错，但你认为 loc 对大量数据也有好处吗？了解它实际上是如何工作的
@Arès 使用 loc 进行布尔索引是矢量化的。但无论如何，这种方法的结果是您当前的结果，与您的预期结果不同