如何在 Pandas 的数据透视表上应用具有条件的聚合函数？答案

【问题标题】：How to apply aggregate function with a condition on a pivot table in Pandas?如何在 Pandas 的数据透视表上应用具有条件的聚合函数？
【发布时间】：2020-02-10 11:15:52
【问题描述】：

我的数据框看起来“像”这样：

index   name     method     values
0.      A       estimated     4874
1.      A       counted        847
2.      A       estimated     1152
3.      B       estimated      276
4.      B       counted       6542
5.      B       counted       1152
6.      B       estimated     3346
7.      C       counted       7622
8.      C       estimated       26
...

我想要做的是将每个“名称”的“估计”和“计数”值的总数相加。我尝试像在这段代码中那样使用 pivot_table 来做，但我一次只能为其中一种方法做。有没有办法可以在相同的代码中为这两种方法做到这一点？

count = df.groupby(['name']).apply(lambda sub_df: sub_df\
        .pivot_table(index=['method'], values=['values'], 
                     aggfunc= {'values': lambda x: x[df.iloc[x.index['method']=='estimated'].sum()}, 
                     margins=True, margins_name == 'total_estimated')
count

我想得到的最后是这样的：

index   name     method       values
0.      A       estimated       4874
1.      A       counted          847
2.      A       estimated       1152
3.      A    total_counted       847
4.      A   total_estimated     6026
5.      B       estimated        276
6.      B       counted         6542
7.      B       counted         1152
8.      B       estimated       3346
9.      B    total_counted      7694
10.     B   total_estimated     3622
11.     C       counted         7622
12.     C       estimated         26
13.     C    total_counted      7622
14.     C   total_estimated       26
...

【问题讨论】：

标签： python pandas indexing pivot-table aggregate

【解决方案1】：

使用DataFrame.pivot_table 要数，那么我们可以用DataFrame.stack + DataFrame.join 或DataFrame.melt + DataFrame.merge 加入原始DataFrame：

#if index is a columns
#df = df = df.set_index('index')
new_df = (df.join(df.pivot_table(index = 'name',
                                  columns = 'method',
                                  values = 'values',
                                  aggfunc = 'sum')
                    .add_prefix('total_') 
                    .stack()
                    .rename('new_value'),
                  on = ['name','method'],how = 'outer')

            .assign(values = lambda x: x['values'].fillna(x['new_value']))
            .drop(columns = 'new_value')
            .sort_values(['name','method'])
)
print(new_df)

或

#if index is a columns
#df = df = df.set_index('index')
new_df = (df.merge(df.pivot_table(index = 'name',
                                  columns = 'method',
                                  values = 'values',
                                  aggfunc = 'sum')
            .add_prefix('total_')         
            .T
            .reset_index()
            .melt('method',value_name = 'values'),
                   on = ['name','method'],how = 'outer')
            .assign(values = lambda x: x['values_x'].fillna(x['values_y']))
            .loc[:,df.columns]
            .sort_values(['name','method'])
)
print(new_df)

输出

   name           method  values
2     A          counted   847.0
0     A        estimated  4874.0
1     A        estimated  1152.0
9     A    total_counted   847.0
10    A  total_estimated  6026.0
5     B          counted  6542.0
6     B          counted  1152.0
3     B        estimated   276.0
4     B        estimated  3346.0
11    B    total_counted  7694.0
12    B  total_estimated  3622.0
7     C          counted  7622.0
8     C        estimated    26.0
13    C    total_counted  7622.0
14    C  total_estimated    26.0

但如果我是你，我会改用DataFrame.add_suffix：

new_df = (df.join(df.pivot_table(index = 'name',
                                  columns = 'method',
                                  values = 'values',
                                  aggfunc = 'sum')
                    .add_suffix('_total') 
                    .stack()
                    .rename('new_value'),
                  on = ['name','method'],how = 'outer')

            .assign(values = lambda x: x['values'].fillna(x['new_value']))
            .drop(columns = 'new_value')
            .sort_values(['name','method'])
         )
print(new_df)

      name           method  values
index                              
1.0      A          counted   847.0
8.0      A    counted_total   847.0
0.0      A        estimated  4874.0
2.0      A        estimated  1152.0
8.0      A  estimated_total  6026.0
4.0      B          counted  6542.0
5.0      B          counted  1152.0
8.0      B    counted_total  7694.0
3.0      B        estimated   276.0
6.0      B        estimated  3346.0
8.0      B  estimated_total  3622.0
7.0      C          counted  7622.0
8.0      C    counted_total  7622.0
8.0      C        estimated    26.0
8.0      C  estimated_total    26.0

【讨论】：