【发布时间】:2017-07-24 13:14:43
【问题描述】:
我正在按年份对数据框进行分组(它是列上多索引的一个级别),应用一个将 df 填充为 11 列的函数(根据需要添加尽可能多的空列),并且然后返回填充的df。但这会引发错误。
finalFormat = (penultimateFormatNot11Columns.groupby( level = 'Year',
axis = 1 )
.apply( padDFToXColumns )
)
raise ValueError("cannot reindex from a duplicate axis")
在正在应用的填充函数内部,返回的 paddedDF 在任一轴上都没有任何重复的级别
>>> paddedDF.index.duplicated().any()
False
>>> paddedDF.columns.duplicated().any()
False
>>>
任何想法这个错误来自哪里?
填充函数
def padDFToXColumns( df, TOT_COLUMNS = 11 ):
"""
Pad out the number of columns in df to TOT_COLUMNS (add TOT_COLUMNS - len(df) empty columns)
"""
numColsInDF = len(df.columns)
if numColsInDF > TOT_COLUMNS:
print("ERROR: Number Of Columns (%s) Exceeds Max Columns (%s)" % (numColsInDF, TOT_COLUMNS))
return
### Add Empty Columns ###
numColsToAdd = TOT_COLUMNS - numColsInDF
columnsToAdd = [ 'EmptyColumn' + str(num) for num in range(numColsInDF + 1, TOT_COLUMNS + 1) ]
emptyColumns = pd.DataFrame( columns = columnsToAdd, index = np.arange(len(df.index)) )
paddedDF = df.join(emptyColumns)
#paddedDF.reset_index( drop = True, inplace = True )
return paddedDF
数据帧
>>> mydata.head()
SurveyYear Age Race Gender WeightAdjusted
0 1996 39 1.White 1.Female 1039.13
1 1996 9 1.White 2.Male 995.13
2 1996 8 1.White 2.Male 775.66
3 1996 39 1.White 2.Male 404.28
4 1996 33 3.Hispanic 1.Female 404.28
>>> groupbyKeys = ['SurveyYear', 'Age', 'Race', 'Gender']
>>> cellPopulations = mydata.groupby(groupbyKeys).agg( {'WeightAdjusted':'sum'})
>>> cellPopulations.head(20)
WeightAdjusted
SurveyYear Age Race Gender
1996 0 1.White 1.Female 1204859.60
2.Male 1227666.34
2.Black 1.Female 307495.16
2.Male 263571.07
3.Hispanic 1.Female 320359.68
2.Male 392902.80
4.Asian 1.Female 78615.49
2.Male 82341.54
5.Other 1.Female 16134.33
2.Male 19365.76
1 1.White 1.Female 1195134.70
2.Male 1195659.14
2.Black 1.Female 328376.10
2.Male 383293.79
3.Hispanic 1.Female 322862.58
2.Male 404322.04
4.Asian 1.Female 79499.56
2.Male 73783.69
5.Other 1.Female 20647.55
2.Male 24222.52
>>> unstackKey = ['SurveyYear', 'Age', 'Gender']
>>> penultimateFormatNot11Columns = cellPopulations.unstack(unstackKey)
>>> penultimateFormatNot11Columns
WeightAdjusted ...
SurveyYear 1996 ... 1997
Age 0 1 2 3 4 ... 76 77 78 79 80
Gender 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male ... 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male
Race ...
1.White 1204859.60 1227666.34 1195134.70 1195659.14 1197386.21 1288700.89 1251324.65 1307458.14 1236790.33 1374989.75 ... 764103.31 506844.04 702775.64 425705.16 666705.33 423419.49 577674.82 366109.58 3898404.40 2283771.11
2.Black 307495.16 263571.07 328376.10 383293.79 291976.23 326400.85 310870.61 323344.13 301025.43 323199.08 ... 68272.99 43254.98 50082.98 34347.45 50788.70 36772.29 31393.21 20720.47 366569.11 180108.23
3.Hispanic 320359.68 392902.80 322862.58 404322.04 344564.20 340702.86 303325.95 321065.53 382663.64 311911.38 ... 39084.04 17362.56 27507.45 18803.48 17619.95 24060.91 35665.78 23802.81 174972.00 105530.84
4.Asian 78615.49 82341.54 79499.56 73783.69 96289.08 88222.32 96411.97 92029.56 77070.10 90370.15 ... 30196.58 27745.90 18419.49 15406.79 7272.27 17891.33 18116.50 3606.67 57684.54 42662.74
5.Other 16134.33 19365.76 20647.55 24222.52 17469.53 27237.94 11220.90 6996.58 23640.43 14917.77 ... 4441.26 nan 1487.90 2845.89 522.43 2453.52 303.66 2982.57 18870.12 6232.88
【问题讨论】:
-
我认为您可以添加一些有错误的数据样本,谢谢。
-
添加了有关基础数据及其制作方式的更多信息。
标签: python pandas pandas-groupby