【问题标题】:Combine complex aggregation function when using pandas groupby使用pandas groupby时结合复杂的聚合函数
【发布时间】:2017-08-30 10:05:40
【问题描述】:

想想下表

np.random.seed(42)
ix = pd.date_range('2017-01-01', '2017-01-15', freq='60s')
df = pd.DataFrame(
    {
        'val': np.random.random(size=ix.shape[0]),
        'active': np.random.choice([0,1], size=ix.shape[0])
    },
    index=ix
)
df.sample(10)

屈服:

                    active   val
2017-01-02 06:05:00 1   0.774654
2017-01-04 08:15:00 1   0.934796
2017-01-13 01:02:00 0   0.792351...

我的目标是计算:

  • 每天的总和
  • 每天的活动总和

每天的总和这是直截了当的:

gb = df.groupby(pd.to_datetime(df.index.date))
overall_sum_per_day = gb['val'].sum().rename('overall')

每个活跃日的总和这有点棘手(请参阅this)。

active_sum_per_day = gb.agg(lambda x: x[x.active==1]['val'].sum())['val'].rename('active')

我的问题如何将两者结合起来。使用concat

pd.concat([overall_sum_per_day, active_sum_per_day], axis=1)

我可以实现我的目标。但是我没能一口气实现它并一次应用两个聚合。是否可以?看到这个comment

【问题讨论】:

  • 查看我的回答,了解如何清理 groupby 和应用函数。

标签: python pandas


【解决方案1】:

你可以使用GroupBy.apply:

b = gb.apply(lambda x: pd.Series([x['val'].sum(), x.loc[x.active==1, 'val'].sum()], 
                                  index=['overall', 'active']))
print (b)
               overall      active
2017-01-01  715.997165  366.856234
2017-01-02  720.101832  355.100828
2017-01-03  711.247370  335.231948
2017-01-04  713.688122  338.088299
2017-01-05  716.127970  342.889442
2017-01-06  697.319129  338.741027
2017-01-07  708.121948  361.086977
2017-01-08  731.032093  370.697884
2017-01-09  718.386679  342.162494
2017-01-10  709.706473  349.657514
2017-01-11  720.477342  368.407343
2017-01-12  738.286682  378.618305
2017-01-13  735.805583  372.039108
2017-01-14  727.502271  345.612816
2017-01-15    0.613559    0.613559

另一种解决方案:

b = gb.agg(lambda x: [x['val'].sum(), x.loc[x.active==1, 'val'].sum()])
       .rename(columns={'val':'overall'})
print (b)
                active     overall
2017-01-01  715.997165  366.856234
2017-01-02  720.101832  355.100828
2017-01-03  711.247370  335.231948
2017-01-04  713.688122  338.088299
2017-01-05  716.127970  342.889442
2017-01-06  697.319129  338.741027
2017-01-07  708.121948  361.086977
2017-01-08  731.032093  370.697884
2017-01-09  718.386679  342.162494
2017-01-10  709.706473  349.657514
2017-01-11  720.477342  368.407343
2017-01-12  738.286682  378.618305
2017-01-13  735.805583  372.039108
2017-01-14  727.502271  345.612816
2017-01-15    0.613559    0.613559

【讨论】:

    【解决方案2】:

    IIUC 我们可以一步完成,使用您的原始 DF:

    In [105]: df.groupby([df.index.normalize(), 'active'])['val'] \
         ...:   .sum() \
         ...:   .unstack(fill_value=0) \
         ...:   .rename(columns={0:'overall', 1:'active'}) \
         ...:   .assign(overall=lambda x: x['overall'] + x['active'])
    Out[105]:
    active         overall      active
    2017-01-01  715.997165  366.856234
    2017-01-02  720.101832  355.100828
    2017-01-03  711.247370  335.231948
    2017-01-04  713.688122  338.088299
    2017-01-05  716.127970  342.889442
    ...                ...         ...
    2017-01-11  720.477342  368.407343
    2017-01-12  738.286682  378.618305
    2017-01-13  735.805583  372.039108
    2017-01-14  727.502271  345.612816
    2017-01-15    0.613559    0.613559
    
    [15 rows x 2 columns]
    

    解释:

    In [64]: df.groupby([df.index.normalize(), 'active'])['val'].sum()
    Out[64]:
                active
    2017-01-01  0         349.140931
                1         366.856234
    2017-01-02  0         365.001004
                1         355.100828
    2017-01-03  0         376.015422
                             ...
    2017-01-13  0         363.766475
                1         372.039108
    2017-01-14  0         381.889455
                1         345.612816
    2017-01-15  1           0.613559
    Name: val, Length: 29, dtype: float64
    
    In [65]: df.groupby([df.index.normalize(), 'active'])['val'].sum().unstack(fill_value=0)
    Out[65]:
    active               0           1
    2017-01-01  349.140931  366.856234
    2017-01-02  365.001004  355.100828
    2017-01-03  376.015422  335.231948
    2017-01-04  375.599823  338.088299
    2017-01-05  373.238528  342.889442
    ...                ...         ...
    2017-01-11  352.069999  368.407343
    2017-01-12  359.668377  378.618305
    2017-01-13  363.766475  372.039108
    2017-01-14  381.889455  345.612816
    2017-01-15    0.000000    0.613559
    
    [15 rows x 2 columns]
    

    【讨论】:

    • 你应该使用 .assign 和 lambda 而不是 eval,这有点神奇
    • @Jeff,好的,感谢您的评论!我一回到我的笔记本上就会改变它(用我的手机写)
    • @Jeff,使用assign - 我将如何访问动态创建的列?
    • @jeff 这是resample.apply 错误出现的问题
    【解决方案3】:

    我认为使用为日期时间分组而构建的pd.Grouper 进行分组会更干净。为了清晰起见,您还可以定义一个函数。

    def func(df):
        active = (df['active'] * df['val']).sum()
        overall = df['val'].sum()
        return pd.Series(data=[active, overall], index=['active','overall'])
    
    df.groupby(pd.Grouper(freq='d')).apply(func)
    
                    active     overall
    2017-01-01  366.856234  715.997165
    2017-01-02  355.100828  720.101832
    2017-01-03  335.231948  711.247370
    2017-01-04  338.088299  713.688122
    2017-01-05  342.889442  716.127970
    2017-01-06  338.741027  697.319129
    2017-01-07  361.086977  708.121948
    2017-01-08  370.697884  731.032093
    2017-01-09  342.162494  718.386679
    2017-01-10  349.657514  709.706473
    2017-01-11  368.407343  720.477342
    2017-01-12  378.618305  738.286682
    2017-01-13  372.039108  735.805583
    2017-01-14  345.612816  727.502271
    2017-01-15    0.613559    0.613559
    

    您应该能够使用resampleapplythere is a bug 执行此操作。

    df.resample('d').apply(func) # should work but doens't produce correct output           
    
                    active  val
    2017-01-01  366.856234  NaN
    2017-01-02  355.100828  NaN
    2017-01-03  335.231948  NaN
    2017-01-04  338.088299  NaN
    2017-01-05  342.889442  NaN
    2017-01-06  338.741027  NaN
    2017-01-07  361.086977  NaN
    2017-01-08  370.697884  NaN
    2017-01-09  342.162494  NaN
    2017-01-10  349.657514  NaN
    2017-01-11  368.407343  NaN
    2017-01-12  378.618305  NaN
    2017-01-13  372.039108  NaN
    2017-01-14  345.612816  NaN
    2017-01-15    0.613559  NaN
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-01-12
      • 2014-11-23
      • 2020-03-07
      • 2017-08-04
      • 1970-01-01
      • 2021-01-22
      • 2018-12-16
      相关资源
      最近更新 更多