通过 pandas 从多级 Excel 文件中整理数据答案

【问题标题】：Tidy data from multilevel Excel file via pandas通过 pandas 从多级 Excel 文件中整理数据
【发布时间】：2017-03-12 04:47:01
【问题描述】：

我想从一个看起来像这样的 Excel 文件中生成整洁的数据，其中包含三个级别的“合并”标题：

Pandas 可以很好地读取文件，带有多级标题：

# df = pandas.read_excel('test.xlsx', header=[0,1,2])

为了可重复性，您可以复制粘贴：

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})

我想对此进行规范化，以便级别标题位于可变行中，但将 a、b 和 c 列保留为列：

如果没有多级标题，我会使用pandas.melt(df, id_vars=['a', 'b', 'c']) 来获得我想要的。 pandas.melt(df) 给了我想要的三个变量列，但显然不保留 a、b 和 c 列。

【问题讨论】：

标签： python excel pandas

【解决方案1】：

将 DF 分成两部分，以便于熔化和重新连接。

first_half = df.iloc[:, :3]
second_half = df.iloc[:, 3:]

熔化第二个碎片。

melt_second_half = pd.melt(second_half)

通过计算将熔化的DF 中的行数除以它自己的长度得到的值，重复第一个片段中的值。

repeats = int(melt_second_half.shape[0]/first_half.shape[0])
first_reps = pd.concat([first_half] * repeats, ignore_index=True)
col_names = first_reps.columns.get_level_values(2)
melt_first_half = pd.DataFrame(first_reps.values, columns=col_names)

根据 value 列连接并返回结果 DF 并排序。

df_concat = pd.concat([melt_first_half, melt_second_half], axis=1)
df_concat.sort_values('value').reset_index(drop=True)

【讨论】：

这绝对有效，但我确信一定有更直接的方法。如果今天晚些时候没有更清洁的方法，我会接受这个。

【解决方案2】：

应该很简单：

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])

long_df = wide_df.stack().stack().stack()

这是一个带有模拟 CSV 文件的示例（注意第 4 行标记索引，第一列标记标题级别）：

from io import StringIO
from textwrap import dedent

import pandas

mockcsv = StringIO(dedent("""\
    num,,,this1,this1,this1,this1,that1,that1,that1,that1
    let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
    animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
    a,b,c,,,,,,,,
    a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
    a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
    a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))


wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()

所以wide_df 看起来像这样：

num      this1                  that1                 
let      thisA     thatA        thisB     thatB       
animal     cat dog  bird lizard   cat dog  bird lizard
a  b  c                                               
a1 b1 c1    x1  x2    x3     x4    x5  x6    x7     x8
      c2    y1  y2    y3     y4    y5  y6    y7     y8
   b2 c1    z1  z2    z3     z4    z5  6z    zy     z8

还有long_df

a   b   c   animal  let    num  
a1  b1  c1  bird    thatA  this1    x3
                    thatB  that1    x7
            cat     thisA  this1    x1
                    thisB  that1    x5
            dog     thisA  this1    x2
                    thisB  that1    x6
            lizard  thatA  this1    x4
                    thatB  that1    x8
        c2  bird    thatA  this1    y3
                    thatB  that1    y7
            cat     thisA  this1    y1
                    thisB  that1    y5
            dog     thisA  this1    y2
                    thisB  that1    y6
            lizard  thatA  this1    y4
                    thatB  that1    y8
    b2  c1  bird    thatA  this1    z3
                    thatB  that1    zy
            cat     thisA  this1    z1
                    thisB  that1    z5
            dog     thisA  this1    z2
                    thisB  that1    6z
            lizard  thatA  this1    z4
                    thatB  that1    z8

使用 OP 中显示的文字数据，您可以通过以下操作在不修改任何内容的情况下获得此信息：

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
    pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
        .reset_index(level=0, drop=True)
        .rename_axis(index_names, axis='index')
        .rename_axis(col_names, axis='columns')
        .stack()
        .stack()
        .stack()
        .to_frame()
)

我认为棘手的部分是检查您的每个文件以找出 index_names 应该是什么。

【讨论】：

我也想过这个，但是这把名字弄乱了。 long_df 似乎不再包含任何提及 a 和 b。
@chthonicdaemon 也许我添加的示例会更好地解释它。
好的，所以这个解决方案需要我编辑 Excel 文件，为索引名称和标签名称添加额外的行。这不是破坏交易，但考虑到我有很多这样的文件，有没有办法按原样使用文件？
@chthonicdaemon 添加了另一种方法
我实际上可以读取文件两次以获取索引标题。没问题。我喜欢这种方法。谢谢。