如何将多个级别的汇总总和从数据框中获取到时间序列列中答案

【问题标题】：How to get multiple levels of aggregated sums into time series columns from a dataframe如何将多个级别的汇总总和从数据框中获取到时间序列列中
【发布时间】：2019-09-05 23:52:04
【问题描述】：

我有一个 pandas 数据框，它在各个层级都有每月计数。它是长格式，我想转换为宽格式，每个聚合级别都有列。

格式如下：

date | country | state | county | population 
01-01| cc1     | s1    | c1     | 5
01-01| cc1     | s1    | c2     | 4
01-01| cc1     | s2    | c1     | 10
01-01| cc1     | s2    | c2     | 11
02-01| cc1     | s1    | c1     | 6
02-01| cc1     | s1    | c2     | 5
02-01| cc1     | s2    | c1     | 11
02-01| cc1     | s2    | c2     | 12
.
.

现在我想把它转换成以下格式：

date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..

01-01| 30         | 9      | 21    | ...| 5         | 4        | 10         | 11        |..
02-01| 34         | 11     | 23    | ...| 6         | 5        | 11         | 12        |..
.
.

状态总数为，4，s1....s4。

每个州的县都可以标记为 c1.... c10（有些州可能更少，我希望这些列为零。）

我想获得每个聚合级别的时间序列，按日期排序。我怎么得到这个？

【问题讨论】：

看起来像pivot_table/groupby 问题然后合并。
您的意思是：在每个聚合级别创建一个数据透视表，其中包含日期 count_for_that_level。然后按日期合并所有这些单独的数据透视表？这似乎很笨重，有没有更清洁的方法来做到这一点？

标签： python pandas time-series hierarchy

【解决方案1】：

让我们使用 sum 与 level 参数和 pd.concat 所有数据帧一起这样做。

#Aggregate to lowest level of detail
df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()

#Reshape dataframe and flatten multiindex column header
df_county = df_agg.unstack([-1, -2])
df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]

#Sum to next level of detail and reshape
df_state = df_agg.sum(level=[0, 1, 2]).unstack()
df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]

#Sum to country level 
df_country = df_agg.sum(level=[0, 1])

#pd.concat horizontally with axis=1
df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()

输出：

  country   date  population  s1_population  s2_population  s1_c1_population  \
0     cc1  01-01          30              9             21                 5   
1     cc1  02-01          34             11             23                 6   

   s1_c2_population  s2_c1_population  s2_c2_population  
0                 4                10                11  
1                 5                11                12

【讨论】：

f'{s}_{c}_{p}' 中的 f 是什么。 ?在第三行代码中？
这是使用 f-string 格式重新排列和展平多索引列标题。
在 df_country 数据框中，unstack 后您将有三层列标题...因此，对于 p、c、s 代表人口级别、县级和州级...所以我使用f-string 格式会更改这些级别的位置并展平为单个级别。
这是 python 3.6+ 的功能。抱歉，如果您使用的是 python 2，则需要以不同的方式编写。
使用 .format 或 map 查看这篇文章。 stackoverflow.com/a/43859132/6361531