在 groupby 之后执行唯一行操作答案

【问题标题】：Perform unique row operation after a groupby在 groupby 之后执行唯一行操作
【发布时间】：2022-01-23 17:38:15
【问题描述】：

我遇到了一个问题，我已经完成了所有 groupby 操作并得到了如下所示的结果数据帧，但问题出现在最后一次计算一个附加列的操作中

当前数据框：

code        industry               category     count     duration
2       Retail                      Mobile        4         7
3       Retail                      Tab           2         33
3       Health                      Mobile        5         103
2       Food                         TV           1         88

问题：想要一个额外的列 operation 计算特定 code 列条目的行业“零售”计数比率

例如：代码 2 有 2 个 industry 条目零售和食品，因此 operation 列应该具有值 4/(4+1) = 0.8 和代码3 类似，如下所示

O/P：

code        industry               category     count     duration  operation
2       Retail                      Mobile        4         7         0.8
3       Retail                      Tab           2         33        -
3       Health                      Mobile        5         103       2/7 = 0.285
2       Food                         TV           1         88        -

这里也有帮助，如果我只做 groupby，我会错过 category 和 duration 的信息还有什么更好的方式来代表 output df 可以有多个行业和操作仅限于只是retail

【问题讨论】：

df.groupby("code")["count"].transform(lambda x: x / x.sum())?您可以通过将code 作为索引并在找到sum 之后使用索引对齐来对其进行更多矢量化。
@user3483203 能否请您详细说明，您的方法不涉及industry 这是一个重要因素

标签： python-3.x pandas dataframe pandas-groupby

【解决方案1】：

我想不出一个单一的操作。但是通过字典的方式应该有效。哦，提前为其他回答者提供创建示例数据框的代码。

st_l = [[2,'Retail','Mobile', 4, 7],
       [3,'Retail', 'Tab', 2, 33],
       [3,'Health', 'Mobile', 5, 103],
       [2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns= 
     ['code','industry','category','count','duration'])

现在我的尝试：

sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)

【讨论】：

还没考虑industry？它是计算的重要因素
我不太确定我是否理解正确。我的解决方案现在为所有行业计算它。但是，如果您只想了解零售业而不想看到其他人，您可以添加如下内容： df['operation'] = df.apply(lambda x: np.nan if x['industry'] == 'retail' else x['operation'], axis=1)
但这只会删除值。

【解决方案2】：

您可以使用groupby.transform() 创建一个包含每个代码总数的新列，然后使用loc 仅查找行业“零售”的行并执行您的划分：

df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)

df.drop('total_per_code',axis=1,inplace=True)

打印回来：

  code industry category  count  duration  operation
0     2   Retail   Mobile      4         7   0.800000
1     3   Retail      Tab      2        33   0.285714
2     3   Health   Mobile      5       103        NaN
3     2     Food       TV      1        88        NaN

【讨论】：