使用重复索引重塑 Pandas 数据框答案

【问题标题】：Reshape Pandas Dataframe with duplicate Index使用重复索引重塑 Pandas 数据框
【发布时间】：2017-03-05 22:45:41
【问题描述】：

当前数据框：

CountryName      IndicatorCode    Year         Value  
Arab World     TX.VAL.MRCH.RS.ZS  1960  1.646954e+01  
Arab World     TX.VAL.MRCH.R1.ZS  1960  2.260207e+00
Arab World     TX.VAL.MRCH.RS.ZS  1961  1.244584e+01
Arab World     TX.VAL.MRCH.R1.ZS  1961  1.860104e+00  
Zimbabwe       DT.DIS.OFFT.CD     2015  8.377700e+07
Zimbabwe       DT.INT.OFFT.CD     2015  2.321300e+07
Zimbabwe       DT.AMT.PROP.CD     2015  6.250000e+05

我想将 IndicatorCode 列的每个值转换为不同的列，这些列应包含来自 Value 列的相应行的数据。
比如做reshape之后：

CountryName Year TX.VAL.MRCH.RS.ZS TX.VAL.MRCH.R1.ZS  
Arab World  1960 1.646954e+01      2.260207e+00
Arab World  1961 1.244584e+01      1.860104e+00

最终数据框列应为：

[CountryName, Year, TX.VAL.MRCH.RS.ZS, TX.VAL.MRCH.R1.ZS, DT.DIS.OFFT.CD,DT.INT.OFFT.CD, DT.AMT.PROP.CD]

我尝试使用 pivot，但没有成功。我也不能将国家名称作为索引，因为它不是唯一的。

temp = indicators_df.pivot(columns='IndicatorCode',  values='Value')

得到ValueError: negative dimensions are not allowed

【问题讨论】：

标签： python pandas

【解决方案1】：

您可以使用pivot_table，它接受多个列作为索引、值和列：

df.pivot_table("Value", ["CountryName", "Year"], "IndicatorCode").reset_index()

一些解释：

这里传递的参数是按位置传递的，即它们的顺序是values, index, and columns或者：

df.pivot_table(values = "Value", index = ["CountryName", "Year"], columns = "IndicatorCode").reset_index()

values 是填充最终数据框单元格的内容，index 是被重复数据删除并在结果中保留为列的列，列变量是在结果中以列标题为中心的变量。

【讨论】：

如果可能的话，你能说出每个参数代表什么吗（稍微解释一下）。我现在正在阅读它，虽然没有得到它。谢谢。

【解决方案2】：

set_index + unstack

s = df.set_index(['CountryName', 'Year', 'IndicatorCode']).Value
s.unstack().reset_index().rename_axis([None], 1)

【讨论】：