将多条记录分组为一条记录并在python数据框中赋值答案

【问题标题】：Group multiple records as one records and assign values in python data frame将多条记录分组为一条记录并在python数据框中赋值
【发布时间】：2018-01-17 21:42:26
【问题描述】：

我在python 中有一个数据框。数据框的列是Id、loc_time、loc_number、status。

数据如下：

Id  loc_time    loc_number  status
1   01:25.5     1105        testing on
2   02:25.9     1105        testing off
3   03:28.5     1105        testing off
4   04:25.5     1105        testing off
5   05:25.9     1105        testing on
6   06:25.5     1105        testing on
7   07:25.9     1105        testing off
8   08:25.6     1105        testing off
9   09:25.9     1106        testing on
10  10:25.6     1105        testing on
11  11:26.0     1105        testing off
12  12:25.6     1105        testing off
13  13:26.0     1105        testing on
14  14:25.6     1106        testing on
15  15:26.0     1105        testing off
16  16:25.6     1105        testing off
17  17:26.0     1105        testing on
18  18:25.7     1105        testing on
19  19:26.0     1105        testing off
20  20:25.7     1105        testing off
21  21:26.1     1105        testing on
22  22:25.7     1106        testing on
23  22:33.7     1107        testing on
24  23:26.1     1105        testing off
25  24:25.7     1105        testing off
26  25:26.1     1105        testing on
27  27:25.7     1105        testing on
28  22:35.7     1106        testing off

现在我想创建一个包含Id、loc_time、loc_number、status 和count 列的新数据框。

Id  loc_time    loc_number  status          count
1   03:28.5     1105        testing on      03
2   06:25.5     1105        testing         03
3   10:25.6     1105        testing         03
4   13:26.0     1105        testing         03
5   17:26.0     1105        testing         03
6   20:25.7     1105        testing         03
7   24:25.7     1105        testing         03
8   27:25.7     1105        testing off     02
9   22:25.7     1106        testing on      03
10  22:35.7     1106        testing off     01
11  22:33.7     1107        testing on      01

我想将前十个时间戳记录分组为一条记录，并分配测试状态并计算记录数。

我想对接下来的十个记录执行相同的操作并将状态分配为测试。

对于最后一组数据，我希望状态为 test off

我该怎么做？

当 1 - 10 个时间戳针对相同的 loc_number 组合在一起时，则状态测试开始。

如果相同 loc_number 的 1-10 个时间戳之后有超过 10 个时间戳，则状态为 test 等等

如果相同 loc_number 的前一组 10 个时间戳之后少于 10 个时间戳，则状态为 test off

组合在一起的最后一个时间戳应该被测试关闭。

【问题讨论】：

只需循环步长为 10 的旧 df 并在新 df 中添加步骤之间的值，例如 for i in range(0,len(df),10): df2=pd.DataFrame({"loc_time":np.sum(df["loc_time][i:i+10])})
不应该为预期结果的最后一行关闭测试吗？
@Alexander 否，因为它是一个新的loc_number，如果是新的 loc 号，那么它应该是 testing on
@New_learner，你一直在更改我的数据和所需的输出。

标签： python pandas group-by

【解决方案1】：

现在应该可以工作了。如果您不想为该列上的数据框编制索引，您可以随时删除 df2 = df2.set_index('ID')（最后一行）。

首先，我需要按loc_number 和loc_time 对数据框进行排序。

接下来，我需要为这些大小不等的组创建连续的数字块（例如 1、1、1、2、2、1、1、1、2、2、2、3、3，假设两个 loc_numbers ）。为此，我对loc_number 进行了分组，并执行了使用地板除法的转换，使用列表推导将每个项目的索引除以分组大小（例如 3）。

transform(lambda group: [i // group_size for i in range(len(group))]))

接下来，我对loc_number 和这个新的loc_counter 进行分组，以完成其余的聚合。

我使用列表推导来获取每个组的第一项和最后一项。然后我使用.loc 将状态设置为testing_off 或testing_on，视情况而定。

group_size = 3
df.sort_values(['loc_number', 'loc_time'], inplace=True)
df2 = (
    df
    .assign(
        status='testing',
        loc_counter=df.groupby('loc_number')['loc_number']
                      .transform(lambda group: [i // group_size for i in range(len(group))]))
    .groupby(['loc_number', 'loc_counter'])
    .agg({'loc_time': 'last', 'loc_number': 'last', 'loc_counter': 'count', 'status': 'last'})
    .rename(columns={'loc_counter': 'count'})
    .reset_index(drop=True)   
)

df2['ID'] = range(1, len(df2) + 1)
df2 = df2[['ID', 'loc_time', 'loc_number', 'status', 'count']]

first_group_items = [group[0] for group in df2.groupby('loc_number').groups.itervalues()]
last_group_items = [group[-1] for group in df2.groupby('loc_number').groups.itervalues()]

df2.loc[last_group_items, 'status'] = 'testing_off'
df2.loc[first_group_items, 'status'] = 'testing_on'

df2 = df2.set_index('ID')

>>> df2
   loc_time  loc_number       status  count
ID                                         
1   03:28.5        1105   testing_on      3
2   06:25.5        1105      testing      3
3   10:25.6        1105      testing      3
4   13:26.0        1105      testing      3
5   17:26.0        1105      testing      3
6   20:25.7        1105      testing      3
7   24:25.7        1105      testing      3
8   27:25.7        1105  testing_off      2
9   22:25.7        1106   testing_on      3
10  22:35.7        1106  testing_off      1
11  22:33.7        1107   testing_on      1

【讨论】：