使用pandas按列分组，然后根据条件新建一列答案

【问题标题】：Use pandas to group by column and then create a new column based on a condition使用pandas按列分组，然后根据条件新建一列
【发布时间】：2019-04-06 00:31:19
【问题描述】：

我需要用 pandas 轻松重现 SQL 所做的事情：

select
    del_month
    , sum(case when off0_on1 = 1 then 1 else 0 end) as on1
    , sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month

这是一个示例，说明性的熊猫数据框：

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})

这是我用 pandas 重现上述 SQL 的尝试。第一行有效。第二行报错：

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))

这是第二行的错误：

TypeError: 'function' object is not iterable

这个previous question of mine 的 lambda 函数有问题，已解决。更大的问题是如何在分组数据上重现 SQL 的“sum(case when)”逻辑。我正在寻找一个通用的解决方案，因为我需要经常做这种事情。我上一个问题中的答案建议在 lambda 函数中使用 map()，但是“off0”列的以下结果不是我需要的。 “on1”列是我想要的。整个组的答案应该是相同的（即“del_month”）。

【问题讨论】：

标签： python sql pandas lambda pandas-groupby

【解决方案1】：

简单地将条件逻辑表达式中的 True 相加：

import pandas as pd

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 
                   'off0_on1':[0,0,1,1,0,1,1,1]})

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))    
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))

print(a1)    
#    del_month  off0_on1  on1  off0
# 0          1         0    2     2
# 1          1         0    2     2
# 2          1         1    2     2
# 3          1         1    2     2
# 4          2         0    3     1
# 5          2         1    3     1
# 6          2         1    3     1
# 7          2         1    3     1

同样，如果方言支持它，你可以在 SQL 中做同样的事情，这是最应该的：

select
    del_month
    , sum(off0_on1 = 1) as on1
    , sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month

要在 pandas 中复制上述 SQL，不要使用 transform，而是在 groupby().apply() 调用中发送多个聚合：

def aggfunc(x):
    data = {'on1': sum(x['off0_on1'] == 1),
            'off0': sum(x['off0_on1'] == 0)}

    return pd.Series(data)

g = a1.groupby('del_month').apply(aggfunc)

print(g)    
#            on1  off0
# del_month           
# 1            2     2
# 2            3     1

【讨论】：

美丽。这正是我一直在寻找的。非常感谢！现在有没有一种方法可以在不链接另一个 groupby 的情况下折叠“del_month”（如 SQL 示例代码中所示）？
很高兴听到！要折叠 del_month，请不要使用 transform（用于内联聚合），而是仅在 groupby 上运行多个聚合。
您介意为我打一个例子吗？我会投票赞成。非常感谢。我是新手。 :)

【解决方案2】：

使用get_dummies 只需要一个groupby 调用，这样更简单。

v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)

df
   del_month  off0  on1
0          1     2    2
1          1     2    2
2          1     2    2
3          1     2    2
4          2     1    3
5          2     1    3
6          2     1    3
7          2     1    3

另外，对于聚合的情况，直接调用sum，不要使用apply：

(pd.get_dummies(df.pop('off0_on1'))
   .groupby(df.del_month)
   .sum()
   .rename({0: 'off0', 1: 'on1'}, axis=1))

           off0  on1
del_month           
1             2    2
2             1    3

【讨论】：

非常有趣的解决方案。你很有创意。不确定这是否与@Parfait 的解决方案一样具有普遍性，但我肯定会认真考虑一下。另外，我是新手，所以我不知道哪个更好..：P
@Sean_Calgary 您使用str.get_dummies 将 off_on 列转换为一个热编码的 2 列的 DataFrame，然后将它们相加......这与 Parfait 完全相同，但完成了它在一个groupby中。第二行只是做一些整理来获取您的列名。如果您想弄清楚哪个更好，我建议您在数据上运行这两种解决方案，然后使用对您更快的方法。
你们太棒了。这是一个优雅而富有创意的解决方案。你的智商一定是170！我不确定我是否可以在所有可以使用 apply(custom_function) 的情况下使用 pd.get_dummies()，但也许我只需要尝试一下并多考虑一下。太棒了！
@Sean_Calgary 还没有，但不客气。