【问题标题】:How to index a DataFrame based on the number of consecutive days如何根据连续天数索引 DataFrame
【发布时间】:2016-05-05 12:43:51
【问题描述】:

我有一个带有不规则日期时间索引的 pandas 数据框。现在我想根据连续的连续观察来索引数据框。换句话说,我只想保留有x 或更多连续观察的值。

举个例子:

idx = pd.DatetimeIndex(['2003-04-11', '2003-04-12', '2003-04-13','2003-04-17','2003-05-02', '2003-05-03', '2003-05-04','2003-07-23', '2003-07-24'])
df = pd.DataFrame(np.random.random((9,2)),index=idx)
df
              0        1
2003-04-11    0.954287 0.331016    
2003-04-12    0.553477 0.858590    
2003-04-13    0.179510 0.103970     
2003-04-17    0.608664 0.746860     
2003-05-02    0.691829 0.081192     
2003-05-03    0.790748 0.319989     
2003-05-04    0.955903 0.668918     
2003-07-23    0.630201 0.297902     
2003-07-24    0.692403 0.847222 

2003-04-11 ~ 13 有 3 次连续观察,然后2003-04-17 有一次观察,2003-05-02 ~ 04 有另外 3 次连续观察,并以2003-07-23 ~ 24 的两次连续观察结束。

如何索引这些连续 3 天或更长时间的观察结果?在这个例子中,它应该保持以下观察:

              0        1
2003-04-11    0.954287 0.331016    
2003-04-12    0.553477 0.858590    
2003-04-13    0.179510 0.103970   
2003-05-02    0.691829 0.081192     
2003-05-03    0.790748 0.319989     
2003-05-04    0.955903 0.668918   

【问题讨论】:

    标签: python datetime pandas indexing slice


    【解决方案1】:

    虽然答案被接受,但您可以尝试不同的方法:

    df1 = df.loc[df.groupby((~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum() ).transform(len).iloc[:, 0] == 3]
    print df1
                       0         1
    2003-04-11  0.350339  0.904514
    2003-04-12  0.903141  0.423335
    2003-04-13  0.394534  0.803299
    2003-05-02  0.158032  0.565684
    2003-05-03  0.715311  0.772509
    2003-05-04  0.136462  0.533705
    

    一步一步:

    print ~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))
    #2003-04-11     True
    #2003-04-12    False
    #2003-04-13    False
    #2003-04-17     True
    #2003-05-02     True
    #2003-05-03    False
    #2003-05-04    False
    #2003-07-23     True
    #2003-07-24    False
    #dtype: bool
    
    print (~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int)
    #2003-04-11    1
    #2003-04-12    0
    #2003-04-13    0
    #2003-04-17    1
    #2003-05-02    1
    #2003-05-03    0
    #2003-05-04    0
    #2003-07-23    1
    #2003-07-24    0
    #dtype: int32
    print (~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum()
    #2003-04-11    1
    #2003-04-12    1
    #2003-04-13    1
    #2003-04-17    2
    #2003-05-02    3
    #2003-05-03    3
    #2003-05-04    3
    #2003-07-23    4
    #2003-07-24    4
    #dtype: int32
    
    print df.groupby((~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len)
    #            0  1
    #2003-04-11  3  3
    #2003-04-12  3  3
    #2003-04-13  3  3
    #2003-04-17  1  1
    #2003-05-02  3  3
    #2003-05-03  3  3
    #2003-05-04  3  3
    #2003-07-23  2  2
    #2003-07-24  2  2
    print df.groupby((~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0]
    #2003-04-11    3
    #2003-04-12    3
    #2003-04-13    3
    #2003-04-17    1
    #2003-05-02    3
    #2003-05-03    3
    #2003-05-04    3
    #2003-07-23    2
    #2003-07-24    2
    #Name: 0, dtype: float64
    
    print df.groupby((~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0] == 3
    #2003-04-11     True
    #2003-04-12     True
    #2003-04-13     True
    #2003-04-17    False
    #2003-05-02     True
    #2003-05-03     True
    #2003-05-04     True
    #2003-07-23    False
    #2003-07-24    False
    #Name: 0, dtype: bool
    print df.loc[df.groupby((~(df.index.to_series().diff() ==  pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0] == 3]
    #                   0         1
    #2003-04-11  0.120301  0.635707
    #2003-04-12  0.747283  0.681601
    #2003-04-13  0.118192  0.777899
    #2003-05-02  0.481396  0.294547
    #2003-05-03  0.619790  0.058048
    #2003-05-04  0.179386  0.348843
    

    【讨论】:

    • 感谢您的回答,它也适用于我的要求,即检测是否有超过 3 次连续观察(更改为 >= 3)。它也有点快,因为没有双 for 循环。总而言之,这是一个更好的答案
    • 很高兴能帮到您!祝你好运!
    【解决方案2】:

    这假设索引已排序并且所有值都是升序的,基本上我们在从 2 行中减去行标签时识别相差 2 天的行(使用shift)然后我执行一个列表理解生成范围,对它们进行排序并使用它们使用loc进行索引:

    In [133]:
    row_labels = df.index[(df.index.to_series() - df.index.to_series().shift(2)) == pd.Timedelta(2, unit='d')]
    rows = [x - pd.Timedelta(n, unit='d') for n in range(0,3) for x in row_labels]
    rows = sorted(rows)
    df.loc[rows]
    
    Out[133]:
                       0         1
    2003-04-11  0.352054  0.228887
    2003-04-12  0.776784  0.594784
    2003-04-13  0.137554  0.852900
    2003-05-02  0.589869  0.574012
    2003-05-03  0.061270  0.590426
    2003-05-04  0.245350  0.340445
    

    可以看到初始计算的结果:

    In [134]:
    df.index[(df.index.to_series() - df.index.to_series().shift(2)) == pd.Timedelta(2, unit='d')]
    
    Out[134]:
    DatetimeIndex(['2003-04-13', '2003-05-04'], dtype='datetime64[ns]', freq=None)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-04-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多