Pandas GroupBy 在同一 DataFrame 的子集上答案

【问题标题】：Pandas GroupBy on subsets of same DataFramePandas GroupBy 在同一 DataFrame 的子集上
【发布时间】：2014-08-08 19:40:02
【问题描述】：

此问题是对my earlier one 的扩展。我有一个熊猫数据框：

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)],
                    'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
                   },  columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])

我将它按colour 和code 分组，并得到一些关于size 和scaled_size 的统计数据，如下所示：

grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()

现在，我要做的是针对不同的weeks_elapsed 间隔对df 多次运行上述计算。 下面是一个蛮力解决方案，有没有更简洁和更快的方法来运行它？另外，我如何在单个数据帧中连接不同时间间隔的结果？

cut_offs= [4,12]
grouped = {c:{} for c in cut_offs}
for c in cut_offs:
   grouped[c] =df.ix[df.weeks_elapsed <= c ].groupby(['code', 'colour']).agg( 
                                                 {'size': [np.sum, np.average, np.size,pd.Series.idxmax],
                                                  'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
                                                 }).reset_index()

我对 np.avg 和 np.size 的不同 weeks_elapsed 间隔特别感兴趣。

【问题讨论】：

你能更正你的初始df代码吗？ 'weeks_elapsed' 与 'w_elapsed' 在列中，'adjust_size' 与 'scaled_size' 相同
抱歉，现在更正了。

标签： python pandas group-by conditional-statements dataframe

【解决方案1】：

所以这不是一个完全有效的答案，但也许可以扩展它最终让你到达那里。

filter = array([12, 4])
for f in filter:
        df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f

现在，df 看起来像

>>> df.head()
Out[384]: 
   id  weeks_elapsed   code colour texture  size  adjusted_size  filter
0   1             20    one  white    soft    64            494     NaN
1   2              3  three  white    hard    22            650       4
2   3             22    two  black    hard    41            770     NaN
3   4              2    two  black    hard     4            325       4
4   5              4    two  black    hard    19            536       4

其中filter 包含该行所属的最小组。下一步将是

>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
                                    'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]: 
    filter   code colour  adjusted_size                            size  \
                                    sum     average  size  idxmax   sum   
0        4    one  black           2195  548.750000     4      45   142   
1        4    one  white            286  286.000000     1      81    58   
2        4  three  black            927  463.500000     2      99   121   
3        4  three  white           5850  585.000000    10      95   511   
4        4    two  black           1102  367.333333     3       4    94   
5        4    two  white            852  852.000000     1      75     2   
6       12    one  white           2499  499.800000     5      72   267   
7       12  three  black           4709  588.625000     8      84   431   
8       12  three  white            569  189.666667     3      97   171   
9       12    two  black           2446  611.500000     4      49   241   
10      12    two  white           2859  714.750000     4      43   203   


      average  size  idxmax  
0   35.500000     4       5  
1   58.000000     1      81  
2   60.500000     2      99  
3   51.100000    10      88  
4   31.333333     3      21  
5    2.000000     1      75  
6   53.400000     5      69  
7   53.875000     8      12  
8   57.000000     3      59  
9   60.250000     4      36  
10  50.750000     4      43

但是，这些并不完全是您要查找的组：filter=4 的观察结果只会在属于 4 的组中，而不是在 filter=12 的组中。

我尝试查看expanding_mean，但这只会是逐行的。到目前为止，这还不完整，但也许它可以帮助其他人回答这个问题。

【讨论】：

【解决方案2】：

好的，这是另一种选择。根据我的研究（我只是在学习自己），拥有重叠组的唯一方法显然是TimeGrouper，这实际上是您想要的。然而，那个需要你的数据在一个时间范围内。实现此目的的一种方法如下：

filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
    df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
    {'size': [np.sum, np.average, np.size, pd.Series.idxmax],
     'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
    }).reset_index())

现在results 包含了奇怪格式的所有内容。变回来

results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]: 
     code colour  scaled_size                   scaled_size  size             \
                          sum     average  size      idxmax   sum    average   
25    one  black         4655  517.222222     9  2014-01-01   331  36.777778   
25    one  white         2444  305.500000     8  2014-01-01   292  36.500000   
25  three  black         2068  344.666667     6  2014-01-01   246  41.000000   
25  three  white         2859  571.800000     5  2014-01-01   260  52.000000   
25    two  black         6330  575.454545    11  2014-01-01   599  54.454545   
25    two  white         3200  533.333333     6  2014-01-01   291  48.500000   
12    one  black         4004  667.333333     6  2014-01-02   331  55.166667   
12    one  white         2965  741.250000     4  2014-01-02   130  32.500000   
12  three  black         3040  608.000000     5  2014-01-02   344  68.800000   
12  three  white         3795  474.375000     8  2014-01-02   359  44.875000   
12    two  black         2198  314.000000     7  2014-01-02   323  46.142857   
12    two  white         3427  571.166667     6  2014-01-02   271  45.166667   
4     one  black         1501  500.333333     3  2014-01-03    73  24.333333   
4     one  white         1710  570.000000     3  2014-01-03   210  70.000000   
4   three  black         1461  730.500000     2  2014-01-03    14   7.000000   
4   three  white          961  480.500000     2  2014-01-03    14   7.000000   
4     two  black         1656  552.000000     3  2014-01-03   189  63.000000   
4     two  white         2462  410.333333     6  2014-01-03   352  58.666667   

               size  
    size     idxmax  
25     9 2014-01-01  
25     8 2014-01-01  
25     6 2014-01-01  
25     5 2014-01-01  
25    11 2014-01-01  
25     6 2014-01-01  
12     6 2014-01-02  
12     4 2014-01-02  
12     5 2014-01-02  
12     8 2014-01-02  
12     7 2014-01-02  
12     6 2014-01-02  
4      3 2014-01-03  
4      3 2014-01-03  
4      2 2014-01-03  
4      2 2014-01-03  
4      3 2014-01-03  
4      6 2014-01-03

【讨论】：

【解决方案3】：

@FooBar 的答案可能更好（还没有完全消化），但这是另一种方法。

首先根据您的过滤条件创建一个返回自定义平均函数的函数。内部函数将只采用系列，外部函数定义要过滤的值以及该系列来自哪个数据框。

In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
     ...:     def inner(x):
     ...:         return np.average(x[base_df[filter_by] <= filter_value])
     ...:     inner.__name__ = 'avg<=' + str(filter_value)
     ...:     return inner

然后，在您的 groupby 操作中，为具有列表理解的不同截止值构建过滤器平均函数的版本，如下所示。上面的__name__ 行是必需的，这样大小下的标题就不同了。

In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i) 
                                                      for i in cut_offs]})
Out[249]: 
                   size           
                  avg<=4    avg<=12
code  colour                      
one   black   55.166667  56.555556
      white   81.750000  58.583333
three black         NaN  32.000000
      white   40.333333  36.400000
two   black   32.000000  37.714286
      white   95.000000  45.000000

np.size 可以使用相同的方法，甚至可以内置到更通用的装饰器中。

【讨论】：