【问题标题】:Count of Contiguously Preceding Rows with Specific Value within Time Window in PandasPandas 时间窗口内具有特定值的连续前行的计数
【发布时间】:2021-12-03 21:23:14
【问题描述】:

真的在为一个研究项目处理这个熊猫任务。

我有一个数据框df,它有两列:time(日期时间列)和result(布尔列)。我想用 7 天的回溯期计算紧接在当前行之前的 TRUE 行的连续性。

例如:
如果前一行为假,则计数为 0
如果前一行是真的,那么我想知道在该行之前的 7 天内真实行的连续性是多少。

下面的预期输出示例。

time result DESIRED OUTPUT
5/1/21 TRUE 0 (no preceding rows)
5/6/21 TRUE 1
5/8/21 FALSE 2 (immediately preceded by streak of 2 TRUE rows in past 7 days)
5/10/21 FALSE 0
5/11/21 TRUE 0
5/14/21 TRUE 1 (preceding row is TRUE)
5/20/21 TRUE 1 (immediately preceded by streak of one TRUE rows in 1 week window)
5/21/21 TRUE 2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/22/21 TRUE 2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/23/21 FALSE 3 (immediately preceded by streak of three TRUE rows in 1 week window)
5/24/21 TRUE 0 (preceded by FALSE row)
5/26/21 TRUE 1 (immediately preceded by streak of 1 TRUE row)

几天来,我一直在搜寻 Stack Overflow 并绞尽脑汁,但就是想不出一种方法来做好这件事。 shift 和 groupby 的技巧,例如df * (df.groupby((df != df.shift()).cumsum()).cumcount()) 可以完美运行,除非它不考虑 7 天回溯窗口并且数据是不定期采样的,所以我无法假设 7 天期间会出现多少行。

非常感谢大家的宝贵时间和帮助!

【问题讨论】:

  • “时间”列中是否有重复的日期?
  • 不,它没有。谢谢你的澄清。

标签: pandas dataframe time-series


【解决方案1】:

我相信你是在正确的轨道上。这个solution 回答了在特定条件下重置计数的概念。我相信您正在寻找的额外部分是为了具有按周分组的灵活性,看起来像这样:

df.groupby([pd.Grouper(key='time', freq='W')])['result'].count() # Simple Count Example

频率可以用任意一个锚定偏移量here代替

【讨论】:

  • 对不起,我不明白我会如何使用它。请您再解释一下好吗?我不确定如何将它与我已经尝试过的代码结合起来。
【解决方案2】:

这似乎符合要求

import pandas as pd
import numpy as np
data = np.array([['5/1/21', True, '0 (no preceding rows)'],
       ['5/6/21', True, '1'],
       ['5/8/21', False,
        '2 (immediately preceded by streak of 2 TRUE rows in past 7 days)'],
       ['5/10/21', False, '0'],
       ['5/11/21', True, '0'],
       ['5/14/21', True, '1 (preceding row is TRUE)'],
       ['5/20/21', True,
        '1 (immediately preceded by streak of one TRUE rows in 1 week window)'],
       ['5/21/21', True,
        '2 (immediately preceded by streak of two TRUE rows in 1 week window)'],
       ['5/22/21', True,
        '2 (immediately preceded by streak of two TRUE rows in 1 week window)'],
       ['5/23/21', False,
        '3 (immediately preceded by streak of three TRUE rows in 1 week window)'],
       ['5/24/21', True, '0 (preceded by FALSE row)'],
       ['5/26/21', True,
        '1 (immediately preceded by streak of 1 TRUE row)'],
       ['5/27/21', True,
        '2 (immediately preceded by streak of 2 TRUE row)']])
df = pd.DataFrame(data = data, columns = ['time','result','output_check'])

df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
# i will note that this casting of the result to num shouldn't be necessary, 
# but something in my process is being wonky and I'm too lazy to sort why
# pandas is not willing atm to sum a boolean column
df['result_num'] = np.where(df['result'] == 'True', 1, 0)
df['result_num_vice'] = np.where(df['result'].shift(1) == 'False', 1, 0)
# each time that we hit a zero, we restart the counter so this is basically a 
# group, so lets use cum sum to create a counter that increases by 1 each time 
# we hit a zero. We can then use this counter as the id. 
df['id'] = df['result_num_vice'].transform('cumsum')

df['output'] = (
    df.groupby(['id'])['result_num'].apply(
        lambda x:x.rolling('8d', closed = 'right').sum()
    )
)

# each true row after the initial will include itself in the count, so lets just
# subtract one from each row with true 
df['output'] = np.where(
    (df['result_num'] == 1) & (df['output']>0), 
    df['output'] - 1,
    df['output']
)
df = df[['result','output_check','output']]
df

输出:

            result  output_check           output
time            
2021-05-01  True    0 (no preceding rows)   0.0
2021-05-06  True    1   1.0
2021-05-08  False   2 (immediately preceded by streak of 2 TRUE ro...   2.0
2021-05-10  False   0   0.0
2021-05-11  True    0   0.0
2021-05-14  True    1 (preceding row is TRUE)   1.0
2021-05-20  True    1 (immediately preceded by streak of one TRUE ...   1.0
2021-05-21  True    2 (immediately preceded by streak of two TRUE ...   2.0
2021-05-22  True    2 (immediately preceded by streak of two TRUE ...   2.0
2021-05-23  False   3 (immediately preceded by streak of three TRU...   3.0
2021-05-24  True    0 (preceded by FALSE row)   0.0
2021-05-26  True    1 (immediately preceded by streak of 1 TRUE row)    1.0
2021-05-27  True    2 (immediately preceded by streak of 2 TRUE row)    2.0

【讨论】:

    【解决方案3】:

    我相信您正在寻找滚动聚合。 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html 是文档。这是一些代码:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(
    
    [['5/1/21', True ],
    ['5/6/21',  True],
    ['5/8/21',  False],
    ['5/10/21',False],
    ['5/11/21'  ,True],
    ['5/14/21', True] ,
    ['5/20/21', True],
    ['5/21/21', True ],
    ['5/22/21', True],
    ['5/23/21', False ],
    ['5/24/21', True],
    ['5/26/21', True ]], columns=['date', 'result']
    )
    
    df.date = pd.to_datetime(df.date, infer_datetime_format=True)
    df = df.set_index('date')
    rolling_result = df.result.rolling('7D').sum()
    print(rolling_result)
    

    和结果

    请注意,我将您的日期列转换为日期时间索引,我认为这是工作所必需的。如果您不想转换整个内容,您总是可以制作一个小的临时数据框来执行此操作。

    使用sum() 有效,因为对布尔列求和会得到其中的 True 值的数量。

    【讨论】:

    • 对不起,我应该澄清一下。输出列是我想要实现的。我的数据框目前只有日期和结果列。感谢您尝试回答我的问题。感谢您的时间和专业知识!我已经编辑了我的原始问题,以便将输出列称为“DESIRED OUTPUT”以使这一点更清楚。
    • 好的 - 但我认为我的代码仍然有效。我将调整输入数据框,但滚动仍然可以完成这项工作。我发布的图片是我的结果。
    • 我编辑了我的输入。
    • 谢谢你,Neha,你试图帮助我。您的解决方案的输出仍然无法满足我的需求,但我很欣赏这种尝试。我不希望 7 周窗口中的 True 行的总和,而是紧接在该行之前的连续 TRUE 行的条纹限制在 7 天的回溯窗口中。抱歉,我知道这很令人困惑,我可能没有尽可能清楚地描述它。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-12-22
    • 1970-01-01
    • 1970-01-01
    • 2018-06-13
    • 2020-02-14
    相关资源
    最近更新 更多