【问题标题】:Assign group number based on time series data in Python在 Python 中根据时间序列数据分配组号
【发布时间】:2021-02-11 23:12:54
【问题描述】:

我想将dfID 分组,然后将State1 开头并以第一个0 结尾的行分组(如果最后没有0 ,如下面的预期输出所示,1s 将被视为一组)。如果连续有1s,则继续下一个值,直到找到0。以第一个 1s 开头并以第一个 0 结尾的行属于一个组。如果观察到连续的0s,我们不感兴趣(除了第一个,应该是一个组的结尾)。然后我想为每个组中的行分配相同的组号。在df 的示例中,ID 有 2 个值 - 3264,它们被视为独立组。

df:

        ID  Timestamp               Value   State
103177  64  2010-09-21 23:13:21.090 21.5    1.0
252019  64  2010-09-22 00:44:14.890 21.5    1.0
271381  64  2010-09-22 00:44:15.890 21.5    0.0
268939  64  2010-09-22 00:44:17.890 23.0    0.0
259875  64  2010-09-22 00:44:18.440 23.0    1.0
18870   64  2010-09-22 00:44:19.890 24.5    1.0
205910  32  2010-09-22 00:44:23.440 24.5    1.0
103865  32  2010-09-22 01:04:33.440 23.5    0.0
152281  32  2010-09-22 01:27:01.790 22.5    1.0
138988  32  2010-09-22 02:18:52.850 21.5    0.0

可重现的例子:

df = pd.DataFrame({'ID': {103177: 64,
  252019: 64,
  271381: 64,
  268939: 64,
  259875: 64,
  18870: 64,
  205910: 32,
  103865: 32,
  152281: 32,
  138988: 32},
 'Timestamp': {103177: Timestamp('2010-09-21 23:13:21.090000'),
  252019: Timestamp('2010-09-22 00:44:14.890000'),
  271381: Timestamp('2010-09-22 00:44:15.890000'),
  268939: Timestamp('2010-09-22 00:44:17.890000'),
  259875: Timestamp('2010-09-22 00:44:18.440000'),
  18870: Timestamp('2010-09-22 00:44:19.890000'),
  205910: Timestamp('2010-09-22 00:44:23.440000'),
  103865: Timestamp('2010-09-22 01:04:33.440000'),
  152281: Timestamp('2010-09-22 01:27:01.790000'),
  138988: Timestamp('2010-09-22 02:18:52.850000')},
 'Value': {103177: 21.5,
  252019: 21.5,
  271381: 21.5,
  268939: 23.0,
  259875: 23.0,
  18870: 24.5,
  205910: 24.5,
  103865: 23.5,
  152281: 22.5,
  138988: 21.5},
 'State': {103177: 1.0,
  252019: 1.0,
  271381: 0.0,
  268939: 0.0,
  259875: 1.0,
  18870: 1.0,
  205910: 1.0,
  103865: 0.0,
  152281: 1.0,
  138988: 0.0}})

df

预期输出:

        ID  Timestamp               Value   State   Group
103177  64  2010-09-21 23:13:21.090 21.5    1.0     1
252019  64  2010-09-22 00:44:14.890 21.5    1.0     1
271381  64  2010-09-22 00:44:15.890 21.5    0.0     1
268939  64  2010-09-22 00:44:17.890 23.0    0.0     -
259875  64  2010-09-22 00:44:18.440 23.0    1.0     2   (* `State` only has `1`, didn't end with `0`.)
18870   64  2010-09-22 00:44:19.890 24.5    1.0     2   (* `State` only has `1`, didn't end with `0`.)
205910  32  2010-09-22 00:44:23.440 24.5    1.0     3   * New `ID`, thus `Group` increases by 1.
103865  32  2010-09-22 01:04:33.440 23.5    0.0     3
152281  32  2010-09-22 01:27:01.790 22.5    1.0     4
138988  32  2010-09-22 02:18:52.850 21.5    0.0     4

【问题讨论】:

    标签: python pandas group-by time-series


    【解决方案1】:

    您可以使用带有掩码的np.where 在 ID 发生更改或状态为 1 而不是上一行的情况下获取 1。然后使用cumsum 增加值。对于你想要得到的 0 -,你可以在之后使用 loc 和另一个掩码。

    df['gr'] = np.cumsum( 
        np.where(df['ID'].ne(df['ID'].shift())  #new ID
                | (df['Status'].eq(1) #status 1
                   & df['Status'].ne(df['Status'].shift())), # previous status not the same
                 1, 0))
    
    # I would rather use np.nan than '-' to keep numeric values but up to you
    df.loc[df['Status'].eq(0) 
           & df['Status'].eq(df['Status'].shift()), 'gr'] = '-'
    
    print(df)
            ID               Timestamp  Value  Status gr
    103177  64 2010-09-21 23:13:21.090   21.5     1.0  1
    252019  64 2010-09-22 00:44:14.890   21.5     1.0  1
    271381  64 2010-09-22 00:44:15.890   21.5     0.0  1
    268939  64 2010-09-22 00:44:17.890   23.0     0.0  -
    259875  64 2010-09-22 00:44:18.440   23.0     1.0  2
    18870   64 2010-09-22 00:44:19.890   24.5     1.0  2
    205910  32 2010-09-22 00:44:23.440   24.5     1.0  3
    103865  32 2010-09-22 01:04:33.440   23.5     0.0  3
    152281  32 2010-09-22 01:27:01.790   22.5     1.0  4
    138988  32 2010-09-22 02:18:52.850   21.5     0.0  4
    

    【讨论】:

      【解决方案2】:

      让我们试试吧:

      # identify the groups within each ID
      groups = (1-df.iloc[::-1].Status).groupby(df['ID']).cumsum().iloc[::-1]
      
      # mask out the single-zero groups:
      single_zero = s.groupby([df['ID'],s]).transform('size').ne(1)
      
      # use groupby().ngroup() to identify the expected output
      df['Group'] = df[single_zero].groupby([df['ID'],s], sort=False).ngroup() + 1
      

      输出:

              ID               Timestamp  Value  Status  Group
      103177  64 2010-09-21 23:13:21.090   21.5     1.0    1.0
      252019  64 2010-09-22 00:44:14.890   21.5     1.0    1.0
      271381  64 2010-09-22 00:44:15.890   21.5     0.0    1.0
      268939  64 2010-09-22 00:44:17.890   23.0     0.0    NaN
      259875  64 2010-09-22 00:44:18.440   23.0     1.0    2.0
      18870   64 2010-09-22 00:44:19.890   24.5     1.0    2.0
      205910  32 2010-09-22 00:44:23.440   24.5     1.0    3.0
      103865  32 2010-09-22 01:04:33.440   23.5     0.0    3.0
      152281  32 2010-09-22 01:27:01.790   22.5     1.0    4.0
      138988  32 2010-09-22 02:18:52.850   21.5     0.0    4.0
      

      选项 2:略有不同的方法,较少的 groupby:

      groups = df.groupby(['ID', df.iloc[::-1].Status.eq(0)
                                   .groupby(df['ID']).cumsum()
                                   .iloc[::-1]],
                          sort=False                    
                         ).ngroup() + 1
      
      single_zero = groups.groupby(groups).transform('size').eq(1)
      
      df['Group'] = (groups - single_zero.cumsum()).mask(single_zero)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-02-25
        • 2020-06-18
        • 1970-01-01
        • 2021-06-30
        • 1970-01-01
        • 2021-01-07
        • 2022-01-17
        • 1970-01-01
        相关资源
        最近更新 更多