在 Python 中根据时间序列数据分配组号答案

【问题标题】：Assign group number based on time series data in Python在 Python 中根据时间序列数据分配组号
【发布时间】：2021-02-11 23:12:54
【问题描述】：

我想将df 按ID 分组，然后将State 以1 开头并以第一个0 结尾的行分组（如果最后没有0 ，如下面的预期输出所示，1s 将被视为一组）。如果连续有1s，则继续下一个值，直到找到0。以第一个 1s 开头并以第一个 0 结尾的行属于一个组。如果观察到连续的0s，我们不感兴趣（除了第一个，应该是一个组的结尾）。然后我想为每个组中的行分配相同的组号。在df 的示例中，ID 有 2 个值 - 32 和 64，它们被视为独立组。

df:

        ID  Timestamp               Value   State
103177  64  2010-09-21 23:13:21.090 21.5    1.0
252019  64  2010-09-22 00:44:14.890 21.5    1.0
271381  64  2010-09-22 00:44:15.890 21.5    0.0
268939  64  2010-09-22 00:44:17.890 23.0    0.0
259875  64  2010-09-22 00:44:18.440 23.0    1.0
18870   64  2010-09-22 00:44:19.890 24.5    1.0
205910  32  2010-09-22 00:44:23.440 24.5    1.0
103865  32  2010-09-22 01:04:33.440 23.5    0.0
152281  32  2010-09-22 01:27:01.790 22.5    1.0
138988  32  2010-09-22 02:18:52.850 21.5    0.0

可重现的例子：

df = pd.DataFrame({'ID': {103177: 64,
  252019: 64,
  271381: 64,
  268939: 64,
  259875: 64,
  18870: 64,
  205910: 32,
  103865: 32,
  152281: 32,
  138988: 32},
 'Timestamp': {103177: Timestamp('2010-09-21 23:13:21.090000'),
  252019: Timestamp('2010-09-22 00:44:14.890000'),
  271381: Timestamp('2010-09-22 00:44:15.890000'),
  268939: Timestamp('2010-09-22 00:44:17.890000'),
  259875: Timestamp('2010-09-22 00:44:18.440000'),
  18870: Timestamp('2010-09-22 00:44:19.890000'),
  205910: Timestamp('2010-09-22 00:44:23.440000'),
  103865: Timestamp('2010-09-22 01:04:33.440000'),
  152281: Timestamp('2010-09-22 01:27:01.790000'),
  138988: Timestamp('2010-09-22 02:18:52.850000')},
 'Value': {103177: 21.5,
  252019: 21.5,
  271381: 21.5,
  268939: 23.0,
  259875: 23.0,
  18870: 24.5,
  205910: 24.5,
  103865: 23.5,
  152281: 22.5,
  138988: 21.5},
 'State': {103177: 1.0,
  252019: 1.0,
  271381: 0.0,
  268939: 0.0,
  259875: 1.0,
  18870: 1.0,
  205910: 1.0,
  103865: 0.0,
  152281: 1.0,
  138988: 0.0}})

df

预期输出：

        ID  Timestamp               Value   State   Group
103177  64  2010-09-21 23:13:21.090 21.5    1.0     1
252019  64  2010-09-22 00:44:14.890 21.5    1.0     1
271381  64  2010-09-22 00:44:15.890 21.5    0.0     1
268939  64  2010-09-22 00:44:17.890 23.0    0.0     -
259875  64  2010-09-22 00:44:18.440 23.0    1.0     2   (* `State` only has `1`, didn't end with `0`.)
18870   64  2010-09-22 00:44:19.890 24.5    1.0     2   (* `State` only has `1`, didn't end with `0`.)
205910  32  2010-09-22 00:44:23.440 24.5    1.0     3   * New `ID`, thus `Group` increases by 1.
103865  32  2010-09-22 01:04:33.440 23.5    0.0     3
152281  32  2010-09-22 01:27:01.790 22.5    1.0     4
138988  32  2010-09-22 02:18:52.850 21.5    0.0     4

【问题讨论】：

标签： python pandas group-by time-series

【解决方案1】：

您可以使用带有掩码的np.where 在 ID 发生更改或状态为 1 而不是上一行的情况下获取 1。然后使用cumsum 增加值。对于你想要得到的 0 -，你可以在之后使用 loc 和另一个掩码。

df['gr'] = np.cumsum( 
    np.where(df['ID'].ne(df['ID'].shift())  #new ID
            | (df['Status'].eq(1) #status 1
               & df['Status'].ne(df['Status'].shift())), # previous status not the same
             1, 0))

# I would rather use np.nan than '-' to keep numeric values but up to you
df.loc[df['Status'].eq(0) 
       & df['Status'].eq(df['Status'].shift()), 'gr'] = '-'

print(df)
        ID               Timestamp  Value  Status gr
103177  64 2010-09-21 23:13:21.090   21.5     1.0  1
252019  64 2010-09-22 00:44:14.890   21.5     1.0  1
271381  64 2010-09-22 00:44:15.890   21.5     0.0  1
268939  64 2010-09-22 00:44:17.890   23.0     0.0  -
259875  64 2010-09-22 00:44:18.440   23.0     1.0  2
18870   64 2010-09-22 00:44:19.890   24.5     1.0  2
205910  32 2010-09-22 00:44:23.440   24.5     1.0  3
103865  32 2010-09-22 01:04:33.440   23.5     0.0  3
152281  32 2010-09-22 01:27:01.790   22.5     1.0  4
138988  32 2010-09-22 02:18:52.850   21.5     0.0  4

【讨论】：

【解决方案2】：

让我们试试吧：

# identify the groups within each ID
groups = (1-df.iloc[::-1].Status).groupby(df['ID']).cumsum().iloc[::-1]

# mask out the single-zero groups:
single_zero = s.groupby([df['ID'],s]).transform('size').ne(1)

# use groupby().ngroup() to identify the expected output
df['Group'] = df[single_zero].groupby([df['ID'],s], sort=False).ngroup() + 1

输出：

        ID               Timestamp  Value  Status  Group
103177  64 2010-09-21 23:13:21.090   21.5     1.0    1.0
252019  64 2010-09-22 00:44:14.890   21.5     1.0    1.0
271381  64 2010-09-22 00:44:15.890   21.5     0.0    1.0
268939  64 2010-09-22 00:44:17.890   23.0     0.0    NaN
259875  64 2010-09-22 00:44:18.440   23.0     1.0    2.0
18870   64 2010-09-22 00:44:19.890   24.5     1.0    2.0
205910  32 2010-09-22 00:44:23.440   24.5     1.0    3.0
103865  32 2010-09-22 01:04:33.440   23.5     0.0    3.0
152281  32 2010-09-22 01:27:01.790   22.5     1.0    4.0
138988  32 2010-09-22 02:18:52.850   21.5     0.0    4.0

选项 2：略有不同的方法，较少的 groupby：

groups = df.groupby(['ID', df.iloc[::-1].Status.eq(0)
                             .groupby(df['ID']).cumsum()
                             .iloc[::-1]],
                    sort=False                    
                   ).ngroup() + 1

single_zero = groups.groupby(groups).transform('size').eq(1)

df['Group'] = (groups - single_zero.cumsum()).mask(single_zero)

【讨论】：