使用 groupby 命令从数据框列表中堆叠条形图答案

【问题标题】：Stacked bar plots from list of dataframes with groupby command使用 groupby 命令从数据框列表中堆叠条形图
【发布时间】：2020-03-10 20:45:44
【问题描述】：

我希望使用groupby.size 命令从结果中创建一个 (2x3) 堆叠条形图子图，让我解释一下。我有一个数据框列表：list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]。这些df的一个小例子是：

...     Create Time          Location       Area Id     Beat    Priority    ... Closed Time

    2011-01-01 00:00:00    ST&SAN PABLO AV    1.0        06X      1.0   ... 2011-01-01 00:28:17

    2011-01-01 00:01:11    ST&HANNAH ST       1.0        07X      1.0   ... 2011-01-01 01:12:56
             .
             .
             .

（由于布局混乱，只能添加几列）我正在使用groupby.size 命令来获取这些数据库所需的事件计数，见下文：

list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
for i in list_df:
    print(i.groupby(['Beat', 'Priority']).size())
    print(' ')

制作：

Beat  Priority
01X   1.0          394
      2.0         1816
02X   1.0          644
      2.0         1970
02Y   1.0          661
      2.0         2309
03X   1.0          857
      2.0         2962
.
.
.

我希望使用beat 列确定前 10 个 TOTALS。因此，例如上面的总数是：

Beat  Priority           Total for Beat
01X   1.0       394         
      2.0       1816         2210
02Y   1.0       661          
      2.0       2309         2970
03X   1.0       857
      2.0       2962         3819
.
.
.

到目前为止，我使用了plot 而不是我的groupby.size，但它并没有像我上面描述的那样完成总和。看看下面：

list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
fig, axes = plt.subplots(2, 3)
for d, i in zip(list_df, range(6)):
    ax = axes.ravel()[i];
    d.groupby(['Beat', 'Priority']).size().nlargest(10).plot(ax=ax, kind='bar', figsize=(15, 7), stacked=True, legend=True)
    ax.set_title(f"Top 10 Beats for {i+ 2011}")
    plt.tight_layout()

我希望有 2x3 的子图布局，但我以前做过这样的堆叠条形图：

提前致谢。这比我想象的要难！

【问题讨论】：

标签： python pandas matplotlib plot pandas-groupby

【解决方案1】：

数据系列需要是列，所以你可能想要

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# create fake input data
ncols = 300
list_df = [pd.DataFrame({'Beat': np.random.choice(['{:02d}X'.format(i) for i in range(15)], ncols),
                         'Priority': np.random.choice(['1', '2'], ncols), 
                         'othercolumn1': range(ncols), 
                         'othercol2': range(ncols), 
                         'year': [yr] * ncols}) for yr in range(2011, 2017)]                                                                     

In [22]: print(list_df[0].head(5))
  Beat Priority  othercolumn1  othercol2  year
0  06X        1             0          0  2011
1  05X        1             1          1  2011
2  04X        1             2          2  2011
3  01X        2             3          3  2011
4  00X        1             4          4  2011

fig, axes = plt.subplots(2, 3)   

for i, d in enumerate(list_df):
    ax = axes.flatten()[i]
    dplot = d[['Beat', 'Priority']].pivot_table(index='Beat', columns='Priority', aggfunc=len)
    dplot = (dplot.assign(total=lambda x: x.sum(axis=1))
                  .sort_values('total', ascending=False)
                  .head(10)
                  .drop('total', axis=1))
    dplot.plot.bar(ax=ax, figsize=(15, 7), stacked=True, legend=True)

【讨论】：

我简化了数据库的布局，认为最好只使用这两列，因为它们是我想要使用的。现在我想我把结果复杂化了。请检查编辑
修改为完整示例
太棒了！你是绝对的传奇！这完全符合我的要求，非常感谢