Pandas 时间窗口内具有特定值的连续前行的计数答案

【问题标题】：Count of Contiguously Preceding Rows with Specific Value within Time Window in PandasPandas 时间窗口内具有特定值的连续前行的计数
【发布时间】：2021-12-03 21:23:14
【问题描述】：

真的在为一个研究项目处理这个熊猫任务。

我有一个数据框df，它有两列：time（日期时间列）和result（布尔列）。我想用 7 天的回溯期计算紧接在当前行之前的 TRUE 行的连续性。

例如：
如果前一行为假，则计数为 0
如果前一行是真的，那么我想知道在该行之前的 7 天内真实行的连续性是多少。

下面的预期输出示例。

time	result	DESIRED OUTPUT
5/1/21	TRUE	0 (no preceding rows)
5/6/21	TRUE	1
5/8/21	FALSE	2 (immediately preceded by streak of 2 TRUE rows in past 7 days)
5/10/21	FALSE	0
5/11/21	TRUE	0
5/14/21	TRUE	1 (preceding row is TRUE)
5/20/21	TRUE	1 (immediately preceded by streak of one TRUE rows in 1 week window)
5/21/21	TRUE	2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/22/21	TRUE	2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/23/21	FALSE	3 (immediately preceded by streak of three TRUE rows in 1 week window)
5/24/21	TRUE	0 (preceded by FALSE row)
5/26/21	TRUE	1 (immediately preceded by streak of 1 TRUE row)

几天来，我一直在搜寻 Stack Overflow 并绞尽脑汁，但就是想不出一种方法来做好这件事。 shift 和 groupby 的技巧，例如df * (df.groupby((df != df.shift()).cumsum()).cumcount()) 可以完美运行，除非它不考虑 7 天回溯窗口并且数据是不定期采样的，所以我无法假设 7 天期间会出现多少行。

非常感谢大家的宝贵时间和帮助！

【问题讨论】：

“时间”列中是否有重复的日期？
不，它没有。谢谢你的澄清。

标签： pandas dataframe time-series

【解决方案1】：

我相信你是在正确的轨道上。这个solution 回答了在特定条件下重置计数的概念。我相信您正在寻找的额外部分是为了具有按周分组的灵活性，看起来像这样：

df.groupby([pd.Grouper(key='time', freq='W')])['result'].count() # Simple Count Example

频率可以用任意一个锚定偏移量here代替

【讨论】：

对不起，我不明白我会如何使用它。请您再解释一下好吗？我不确定如何将它与我已经尝试过的代码结合起来。

【解决方案2】：

这似乎符合要求

import pandas as pd
import numpy as np
data = np.array([['5/1/21', True, '0 (no preceding rows)'],
       ['5/6/21', True, '1'],
       ['5/8/21', False,
        '2 (immediately preceded by streak of 2 TRUE rows in past 7 days)'],
       ['5/10/21', False, '0'],
       ['5/11/21', True, '0'],
       ['5/14/21', True, '1 (preceding row is TRUE)'],
       ['5/20/21', True,
        '1 (immediately preceded by streak of one TRUE rows in 1 week window)'],
       ['5/21/21', True,
        '2 (immediately preceded by streak of two TRUE rows in 1 week window)'],
       ['5/22/21', True,
        '2 (immediately preceded by streak of two TRUE rows in 1 week window)'],
       ['5/23/21', False,
        '3 (immediately preceded by streak of three TRUE rows in 1 week window)'],
       ['5/24/21', True, '0 (preceded by FALSE row)'],
       ['5/26/21', True,
        '1 (immediately preceded by streak of 1 TRUE row)'],
       ['5/27/21', True,
        '2 (immediately preceded by streak of 2 TRUE row)']])
df = pd.DataFrame(data = data, columns = ['time','result','output_check'])

df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
# i will note that this casting of the result to num shouldn't be necessary, 
# but something in my process is being wonky and I'm too lazy to sort why
# pandas is not willing atm to sum a boolean column
df['result_num'] = np.where(df['result'] == 'True', 1, 0)
df['result_num_vice'] = np.where(df['result'].shift(1) == 'False', 1, 0)
# each time that we hit a zero, we restart the counter so this is basically a 
# group, so lets use cum sum to create a counter that increases by 1 each time 
# we hit a zero. We can then use this counter as the id. 
df['id'] = df['result_num_vice'].transform('cumsum')

df['output'] = (
    df.groupby(['id'])['result_num'].apply(
        lambda x:x.rolling('8d', closed = 'right').sum()
    )
)

# each true row after the initial will include itself in the count, so lets just
# subtract one from each row with true 
df['output'] = np.where(
    (df['result_num'] == 1) & (df['output']>0), 
    df['output'] - 1,
    df['output']
)
df = df[['result','output_check','output']]
df

输出：

            result  output_check           output
time            
2021-05-01  True    0 (no preceding rows)   0.0
2021-05-06  True    1   1.0
2021-05-08  False   2 (immediately preceded by streak of 2 TRUE ro...   2.0
2021-05-10  False   0   0.0
2021-05-11  True    0   0.0
2021-05-14  True    1 (preceding row is TRUE)   1.0
2021-05-20  True    1 (immediately preceded by streak of one TRUE ...   1.0
2021-05-21  True    2 (immediately preceded by streak of two TRUE ...   2.0
2021-05-22  True    2 (immediately preceded by streak of two TRUE ...   2.0
2021-05-23  False   3 (immediately preceded by streak of three TRU...   3.0
2021-05-24  True    0 (preceded by FALSE row)   0.0
2021-05-26  True    1 (immediately preceded by streak of 1 TRUE row)    1.0
2021-05-27  True    2 (immediately preceded by streak of 2 TRUE row)    2.0

【讨论】：

【解决方案3】：

我相信您正在寻找滚动聚合。 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html 是文档。这是一些代码：

import pandas as pd
import numpy as np

df = pd.DataFrame(

[['5/1/21', True ],
['5/6/21',  True],
['5/8/21',  False],
['5/10/21',False],
['5/11/21'  ,True],
['5/14/21', True] ,
['5/20/21', True],
['5/21/21', True ],
['5/22/21', True],
['5/23/21', False ],
['5/24/21', True],
['5/26/21', True ]], columns=['date', 'result']
)

df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df = df.set_index('date')
rolling_result = df.result.rolling('7D').sum()
print(rolling_result)

和结果

请注意，我将您的日期列转换为日期时间索引，我认为这是工作所必需的。如果您不想转换整个内容，您总是可以制作一个小的临时数据框来执行此操作。

使用sum() 有效，因为对布尔列求和会得到其中的 True 值的数量。

【讨论】：

对不起，我应该澄清一下。输出列是我想要实现的。我的数据框目前只有日期和结果列。感谢您尝试回答我的问题。感谢您的时间和专业知识！我已经编辑了我的原始问题，以便将输出列称为“DESIRED OUTPUT”以使这一点更清楚。
好的 - 但我认为我的代码仍然有效。我将调整输入数据框，但滚动仍然可以完成这项工作。我发布的图片是我的结果。
我编辑了我的输入。
谢谢你，Neha，你试图帮助我。您的解决方案的输出仍然无法满足我的需求，但我很欣赏这种尝试。我不希望 7 周窗口中的 True 行的总和，而是紧接在该行之前的连续 TRUE 行的条纹限制在 7 天的回溯窗口中。抱歉，我知道这很令人困惑，我可能没有尽可能清楚地描述它。