【问题标题】:Pandas: reshape data frame熊猫:重塑数据框
【发布时间】:2016-05-03 18:54:41
【问题描述】:

我有以下数据框:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

zz=pd.read_csv(url)
zz.head(5)

    date    feccandid   feccandcfscore.dyn  pacid   paccfscore  cid     catcode     type_x  di  amtsum  state   log_diff_unemployment   party   type_y  bills   years_exp   disposition     billsum
0   2006    S8NV00073   0.496   C00000422   0.330   N00006619   H1100   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     3
1   2006    S8NV00073   0.496   C00375360   0.176   N00006619   H1100   24K     D   4500    NV  -0.024693   Republican  rep     s22-109     12  support     3
2   2006    S8NV00073   0.496   C00113803   0.269   N00006619   H1130   24K     D   2500    NV  -0.024693   Republican  rep     s22-109     12  support     2
3   2006    S8NV00073   0.496   C00249342   0.421   N00006619   H1130   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     2
4   2006    S8NV00073   0.496   C00255752   0.254   N00006619   H1130   24K     D   4000    NV  -0.024693   Republican  rep     s22-109     12  support     2

我想对其进行操作,使date 列是一个索引,feccandid 值是列标题(我稍后会将它们设为第二个索引,以便我可以将框架发送到面板)和另一列标题变成行。期望的输出看起来是这样的:

date    feccandid              S8NV00072    S8NV00074   S8NV00075   S8NV00076   S8NV00077
2006    feccandcfscore.dyn        0.496        0.496        0.496     0.496       0.496
2006    pacid                  C00000422    C00375360   C00113803   C00249342   C00255752
2006    paccfscore                  0.33        0.176      0.269         0.421    0.254
2006    cid N00006619           N00006619   N00006619   N00006619   N00006619
2006    catcode                  H1100      H1100          H1130    H1130      H1130
2006    type_x                    24K         24K            24K    24K     24K
2006    di                           D          D              D        D       D
2006    amtsum                      5000      4500          2500        5000       4000
2006    state                        NV        NV           NV        NV         NV
2006    log_diff_unemployment   -0.024693   -0.024693   -0.024693   -0.024693   -0.024693
2006    party                     Republican    Republican  Republican  Republican  Republican
2006    type_y                            rep         rep         rep       rep      rep
2006    bills                           s22-109      s22-109    s22-109    s22-109     s22-109
2006    years_exp                             12        12        12       12      12
2006    disposition                      support       support  support support support
2006    billsum                            3               3        2      2       2

我已按照 jezrael

的建议尝试了以下方法
zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)

zz.head()

    feccandcfscore.dyn  ...     billsum
feccandid   H0AL02087   H0AL07060   H0AR01083   H0AR02107   H0AR03055   H0AR04038   H0AZ01259   H0AZ03362   H0CA15148   H0CA19173   ...     S8MI00158   S8MN00438   S8MS00055   S8MT00010   S8NC00239   S8NE00117   S8NM00010   S8NV00073   S8OR00207   S8WI00026
date                                                                                    
2005    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2006    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     2.125   NaN     NaN
2007    NaN     0.016   NaN     NaN     NaN     -0.151  NaN     NaN     -0.777  NaN     ...     1.000000    NaN     1.666667    1.552632    NaN     NaN     2.0     1.000   NaN     2.0
2008    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     1.285714    NaN     NaN     5.431373    NaN     NaN     NaN     NaN     NaN     NaN
2009    NaN     NaN     NaN     NaN     NaN     -0.086  NaN     NaN     -0.790  NaN     ...     NaN     NaN     NaN     2.433333    NaN     NaN     NaN     NaN     3.0     2.8

这与我想要的很接近,只是我试图将 feccandid 作为唯一的列标题和原始列标题(在最后一个示例中,作为最顶层的列标题) 转置为行。

【问题讨论】:

  • 您想将修改过的(由 pivot_table 聚合的)元素传递给面板吗?如果是这样,那么您可以使用:zz.columns = zz.columns.reorder_levels((1,0)) 对数据透视表中的标签重新排序。之后,您可以使用zz.T.to_panel() 将其发送到面板,然后使用swapaxis()。如果您希望保持所有元素不变,那么它也是可行的(我可以稍后编写一些代码)但我不确定生成的面板的大小是否会膨胀。
  • @ptrj:我刚看到你的消息。我会在今天晚些时候回复你。我将尝试用数据改变一些东西。
  • @ptrj:感谢您的评论!我认为没有必要保持元素不变。我会按照你的建议努力。

标签: python pandas pivot melt


【解决方案1】:

我认为你可以使用pivot_table(默认聚合函数是np.mean):

df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df

如果您需要将NaN 替换为0

print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)

编辑:

我创建了小样本DataFrame 正如ptrj 所说,您可以使用Tto_panel 来创建panel。那么也许你需要transpose:

import pandas as pd

zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002}, 
                   'feccandid': {0: 'S8NV00072', 1: 'S8NV00074', 
                                 2: 'S8NV00072', 3: 'S8NV00074'}, 
                   'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
                   'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})

print zz
   billsum  date  feccandid  pacid
0        1  2001  S8NV00072    0.3
1        2  2001  S8NV00074    0.1
2        5  2002  S8NV00072    0.7
3        6  2002  S8NV00074    0.4

zz = zz.pivot_table(index='date', 
                         columns='feccandid',
                         fill_value=0, 
                         aggfunc=np.mean)
print zz.T   
date               2001  2002
        feccandid            
billsum S8NV00072   1.0   5.0
        S8NV00074   2.0   6.0
pacid   S8NV00072   0.3   0.7
        S8NV00074   0.1   0.4
wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074

print wp.transpose(2, 0, 1)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid

【讨论】:

  • 这几乎行得通!不幸的是,它将feccanid 值作为列标题,但将它们放在现有列标题的下方。我会将结果作为编辑发布。
  • 我刚刚应用了您的编辑,但出现以下错误:TypeError: Must pass list-like as names.
  • 没问题。您在列中获得多索引。期望的输出是什么?
  • 所需输出:将其发送到面板,feccandidItems axis 单个面板,dateMajor_axis,所有其他列为 minor_axis
  • 我尝试将示例添加到我的答案中。请检查它是否是您想要的。顺便说一句,这个方法是here
猜你喜欢
  • 2017-08-13
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-11-03
  • 2017-04-08
  • 1970-01-01
相关资源
最近更新 更多