数据操作 - 使用数据帧聚合功能答案

【问题标题】：Data Manipulation - using data frame aggregation function数据操作 - 使用数据帧聚合功能
【发布时间】：2018-05-14 11:21:48
【问题描述】：

你们之前对我的问题非常有帮助 - 请参阅下面的链接。我正在寻找对具有字母数字值的索引进行排序。我已经运行了这个今天成功的脚本，但是收到了一个错误：

/Library/Python/2.7/site-packages/pandas/core/groupby.py:4036: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Traceback (most recent call last)
aggfunc={'sum': np.sum}, fill_value=0)
  File "/Library/Python/2.7/site-packages/pandas/core/reshape/pivot.py", line 136, in pivot_table
    agged = grouped.agg(aggfunc)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 4036, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

追溯到枢轴：

df = df.pivot_table(index=['customer'], columns=['Duration'],
                                                     aggfunc={'sum': np.sum}, 
    fill_value=0)

在此错误之前我应用的唯一更改是对数据框的一个数据列引入计算，而不是在 SQL 语句中运行计算。

新计算：

df['Duration'] = df['Duration']/30

旧的分组和聚合：

df = df.pivot_table(index=['customer'], columns=['Duration'],
                                             aggfunc={'sum': np.sum}, fill_value=0)
c = df.columns.levels[1]
c = sorted(ns.natsorted(c), key=lambda x: not x.isdigit())
df = df.reindex_axis(pd.MultiIndex.from_product([df.columns.levels[0], c]), axis=1)

新代码sn-p：

df = df.groupby(['customer', 'Duration']).agg({'sum': np.sum})
c = df.columns.get_level_values(1)
c = sorted(ns.natsorted(c), key=lambda x: not x.isdigit())
df = df.reindex_axis(pd.MultiIndex.from_product([df.columns.levels[0], c]), axis=1)

采用新方法的多索引级别：

MultiIndex(levels=[[u'Invoice A', u'Invoice B', u'Invoice C', u'Invoice B'], [u'0', u'1', u'10', u'11', u'2', u'2Y', u'3', u'3Y', u'4', u'4Y', u'5', u'5Y', u'6', u'7', u'8', u'9', u'9Y']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]], names=['customer', u'Duration'])

分配此c = df.columns.get_level_values(1) 时，我收到一条错误消息： IndexError: Too many levels: Index has only 1 level, not 2

输入样本：

customer              Duration             sum          
Invoice A                1                 1250
Invoice B                2                 2000
Invoice B                3                 1200
Invoice C                2                 10250
Invoice D                3                 20500
Invoice D                5                 18900
Invoice E                2Y                5000
Invoice F                1                 5000
Invoice F                1Y                12100

不知道为什么，因为级别和名称都有两个级别。最终结果是一个按customer 排序的数据框，列按Duration 排序，显示每个Duration 的sum。另外，我在之前的代码版本中使用 pivot 的原因是为了保持这种输出格式：

Duration                            2          2Y         3         3Y   
customer                                                                     
Invoice A                         2550        0.00      0.00       2000   
Invoice B                         5000        2500      1050       0.00
Invoice C                         12500       0.00      1120       2050
Invoice D                         0.00        1500      0.00       8010

我走对了吗？

Data Manipulation - stackoverflow

【问题讨论】：

很难找到实际问题在您的问题中的位置。也许你正在寻找这个stackoverflow.com/questions/44635626/…
并且您正在搜索列中的级别，请确保它必须是df.index.get_level_values

标签： pandas sorting indexing aggregate pandas-groupby

【解决方案1】：

你可以使用instaed agg函数sum()然后通过unstack重塑：

import natsort as ns

df = df.groupby(['customer', 'Duration'])['sum'].sum().unstack()

c = sorted(ns.natsorted(df.columns), key=lambda x: not x.isdigit())
df = df.reindex(columns=c)
print (df)
Duration        1        2        3        5       1Y      2Y
customer                                                     
Invoice A  1250.0      NaN      NaN      NaN      NaN     NaN
Invoice B     NaN   2000.0   1200.0      NaN      NaN     NaN
Invoice C     NaN  10250.0      NaN      NaN      NaN     NaN
Invoice D     NaN      NaN  20500.0  18900.0      NaN     NaN
Invoice E     NaN      NaN      NaN      NaN      NaN  5000.0
Invoice F  5000.0      NaN      NaN      NaN  12100.0     NaN

【讨论】：

jezrael - 你的解决方案看起来不错。 df.reindex(c, axis=1) 出现一个奇怪的错误。 File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 2494, in reindex 'argument "{0}"'.format(list(kwargs.keys())[0])) TypeError: reindex() got an unexpected keyword argument "axis"
也许是最后一个版本的 pandas，试试df = df.reindex(columns=c)