当时间增量不恒定时，跟踪有多少观测值落在固定时间窗口内答案

【问题标题】：Keeping track of how many observations fall within a fixed time window when time delta is not constant当时间增量不恒定时，跟踪有多少观测值落在固定时间窗口内
【发布时间】：2021-06-04 10:20:51
【问题描述】：

我有一个数据框，其中包含按时间索引的观察结果，但观察结果之间的时间增量不是恒定的。

df
>>>
    TimeStamp              x1        x2
1   2015-03-01 19:05:01    0.812     18.23
2   2015-03-01 19:22:17    0.121     13.91
3   2015-03-01 19:24:34    0.822     15.10
4   2015-03-01 19:28:53    0.093     22.38
5   2015-03-01 21:49:57    0.291     22.90
6   2015-03-01 23:59:01    0.672     23.12
7   2015-03-02 02:30:01    0.421     28.56
8   2015-03-02 02:30:01    0.591     31.72
9   2015-03-02 02:31:17    0.811     21.71
10  2015-03-02 04:37:19    0.142     16.39

我想计算每个样本的固定时间窗口内的观察次数。

如果我的时间窗口是 10 分钟，那么我想计算 [0, 2, 1, 0, 0, 0, 2, 1, 0] 因为在第一个样本的 10 分钟内观察到 0 个样本，2在第二个样品的 10 分钟内观察到样品，在第三个样品的 10 分钟内观察到一个样品，依此类推。可能有两个观察同时发生的情况，但它们是不同的观察（如 7 和 8）。

如果我的时间窗口是 1 小时，那么我想计算 [3, 2, 1, 0, 0, 0, 2, 1, 0] 因为在第一个样本的 1 小时内观察到 3 个样本，所以开。

我有一个功能可以做到这一点，但有两个问题； 1) 它非常慢，因为它逐行遍历数据；2) 有时返回的计数是负数，我觉得这很奇怪，因为 timedelta 总是 >= 0。

import pandas as pd
import datetime as dt

def get_count(data: pd.DataFrame, window_hours: int, window_minutes: int) -> np.ndarray:
    # we only want to iterate to the sample that is within window_hours + window_minutes from the end
    last_sample = data["TimeStamp"].iloc[-1] - dt.timedelta(days=0, hours=window_hours, minutes=window_minutes)
    count = np.empty(len(data[data["TimeStamp"] <= last_sample]), dtype=int)
    i = 0
    for index, row in data[data["TimeStamp"] <= last_day].iterrows():
        idx = np.where(data["TimeStamp"] <= (row["TimeStamp"] + dt.timedelta(days=0, hours=window_hours, minutes=window_minutes)))[0][-1]
        tmp = idx - index
        count[i] = tmp
        i += 1
    return count

有没有办法使用纯 pandas / numpy（避免 for 循环）来做到这一点，以便它更快，并提供我的方法似乎没有的所需输出？

【问题讨论】：

一个想法是group by time interval，然后只计算每个组中的实体数。
不知道为什么在第 2 个样本的 10 分钟内有 2 个样本，而在第 3 个样本中只有 1 个样本？
pandas resample 可能是适合这项工作的工具，但您的预期输出不清楚，因此很难理解如何提供好的答案
没有 resample 或 groupby 方法不起作用。这些方法需要唯一的组成员资格，即 Row1 -> Group A。例如，这样的计算允许 Row1 用于第 6 行和第 7 行的回顾。有效的方法需要大量内存（即交叉加入它们之后），所以通常这些方法是不可能的。

标签： python pandas timedelta

【解决方案1】：

使用掩码然后count()
灵活，如 Timedelta 的 args

df = pd.read_csv(io.StringIO("""   TimeStamp              x1        x2
1   2015-03-01 19:05:01    0.812     18.23
2   2015-03-01 19:22:17    0.121     13.91
3   2015-03-01 19:24:34    0.822     15.10
4   2015-03-01 19:28:53    0.093     22.38
5   2015-03-01 21:49:57    0.291     22.90
6   2015-03-01 23:59:01    0.672     23.12
7   2015-03-02 02:30:01    0.421     28.56
8   2015-03-02 02:30:01    0.591     31.72
9   2015-03-02 02:31:17    0.811     21.71
10  2015-03-02 04:37:19    0.142     16.39"""), sep="\s\s+", engine="python")

df.TimeStamp = pd.to_datetime(df.TimeStamp)

def within(dfa, **kwargs):
    return dfa.TimeStamp.apply(lambda t: dfa.loc[dfa.TimeStamp.gt(t) & 
                                                 dfa.TimeStamp.le(t+pd.Timedelta(**kwargs)),
                                                 "TimeStamp"].count())

df["10min"] = within(df, minutes=10)
df["4hour"] = within(df, hours=4)

	TimeStamp	x1	x2	10min	4hour
1	2015-03-01 19:05:01	0.812	18.23	0	4
2	2015-03-01 19:22:17	0.121	13.91	2	3
3	2015-03-01 19:24:34	0.822	15.1	1	2
4	2015-03-01 19:28:53	0.093	22.38	0	1
5	2015-03-01 21:49:57	0.291	22.9	0	1
6	2015-03-01 23:59:01	0.672	23.12	0	3
7	2015-03-02 02:30:01	0.421	28.56	1	2
8	2015-03-02 02:30:01	0.591	31.72	1	2
9	2015-03-02 02:31:17	0.811	21.71	0	1
10	2015-03-02 04:37:19	0.142	16.39	0	0

【讨论】：