【发布时间】:2018-08-12 15:14:42
【问题描述】:
我正在尝试创建一个日历,用于汇总项目目录中的信息,并按时间顺序和项目类型对其进行组织。我一直在使用 Pandas,但无法正确构建基本结构。例如,给定这个数据集:
Type Name Health Month Year
0 Marketing ProjectA OK Jan 2018
1 Science ProjectB Warning Apr 2018
2 Marketing ProjectC OK Mar 2018
3 Development ProjectD OK Feb 2018
4 Marketing ProjectE OK Jan 2018
5 Development ProjectF Warning Feb 2018
6 Development ProjectG Trouble May 2018
7 Marketing ProjectH Trouble May 2018
8 Development ProjectI Warning Feb 2018
9 Marketing ProjectJ OK May 2018
10 Science ProjectK Warning Apr 2018
使用Remove none values from dataframe 中显示的技巧,我可以创建字段来跟踪决赛桌中每个项目的排名顺序:
df['aggval'] = df['Year'].map(str) + df['Month'] + df['Type']
df['index'] = df.groupby(['aggval']).cumcount()
产生 2 个额外的列:
Type Name Health Month Year aggval index
0 Marketing ProjectA OK Jan 2018 2018JanMarketing 0
1 Science ProjectB Warning Apr 2018 2018AprScience 0
2 Marketing ProjectC OK Mar 2018 2018MarMarketing 0
3 Development ProjectD OK Feb 2018 2018FebDevelopment 0
4 Marketing ProjectE OK Jan 2018 2018JanMarketing 1
5 Development ProjectF Warning Feb 2018 2018FebDevelopment 1
6 Development ProjectG Trouble May 2018 2018MayDevelopment 0
7 Marketing ProjectH Trouble May 2018 2018MayMarketing 0
8 Development ProjectI Warning Feb 2018 2018FebDevelopment 2
9 Marketing ProjectJ OK May 2018 2018MayMarketing 1
10 Science ProjectK Warning Apr 2018 2018AprScience 1
使用这些提取列,我们现在可以进行旋转以创建项目汇总表的初始版本:
pv1 = pd.pivot_table(df, values='Name', index=['Type', 'index'], columns=['Year', 'Month'], aggfunc=lambda x: "".join(x)).fillna('')
pv1 = pv1.reindex(columns = zip(12 * [2018], ['Jan', 'Feb', 'Mar', 'Apr', 'May']))
生成下面的报告。这基本上是正确的:它收集和列出项目,显示它们的名称,并按类型(泳道)和按时间顺序组织它们:
Year 2018
Month Jan Feb Mar Apr May
Type index
Development 0 ProjectD ProjectG
1 ProjectF
2 ProjectI
Marketing 0 ProjectA ProjectC ProjectH
1 ProjectE ProjectJ
Science 0 ProjectB
1 ProjectK
我现在很难尝试扩展此模型以同时显示每个项目的名称和运行状况。
我可以在 Health 字段中添加第二个数据透视表值:
pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
# pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
生产:
Health Name
Year 2018 2018
Month Apr Feb Jan Mar May Apr Feb Jan Mar May
Type index
Development 0 OK Trouble ProjectD ProjectG
1 Warning ProjectF
2 Warning ProjectI
Marketing 0 OK OK Trouble ProjectA ProjectC ProjectH
1 OK OK ProjectE ProjectJ
Science 0 Warning ProjectB
1 Warning ProjectK
这是正确的想法 - 每个项目的项目 Health 和 Name 都显示在正确的 Month 和正确的 Type 泳道中,但我希望它们按项目并排显示。重新索引列会在标题级别产生正确的结果,但会清除具有 Nan 值的单元格:
pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
产生:
2018
Year Jan Feb Mar Apr May
Month Health Name Health Name Health Name Health Name Health Name
Type index
Development 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Marketing 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Science 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
同样,结构现在是正确的,但单元格值不再显示项目特定的数据。我错过了什么?
【问题讨论】:
标签: pandas pivot pivot-table