【问题标题】:Pandas read_excel with duplicate header values带有重复标题值的 Pandas read_excel
【发布时间】:2021-06-21 14:07:42
【问题描述】:

我有一张 Excel 表格,我想将其读入 pandas 多索引数据框。复杂之处在于 excel 表包含重复的标题值。阅读 pandas 时,将 .x 添加到第二级标题的末尾而不是第一级。有没有办法必须重命名顶级标题而不是二级标题?

示例 excel 文件:

阅读脚本:

from pathlib import Path
import pandas as pd


def main():
    xl_file = Path('.') / 'pandasExample.xlsx'
    df = pd.read_excel(xl_file, sheet_name='Sheet1', header=[
                            0, 1], skiprows=[0])
    print(df)


if __name__ == '__main__':
    main()

输出:

  Rectangle        Ellipse    Rectangle
      Width Height       a  b   Width.1 Height.1 Width.2 Height.2
0        10     20       1  2        20       30      40       50

期望的输出:

  Rectangle        Ellipse    Rectangle.1        Rectangle.2       
      Width Height       a  b      Width Height      Width Height
0        10     20       1  2         20     30         40     50

【问题讨论】:

    标签: python excel pandas


    【解决方案1】:

    这是一个不同的答案,可以产生问题中列出的确切所需输出。

    from pathlib import Path
    import pandas as pd
    from typing import List
    
    
    def rename_headers(headers: List[str]) -> List[str]:
        header_dict = {}
        new_headers = []
        for header in headers:
            header_prefix = header.split('.')[0]
            header_occurance = header_dict.get(header_prefix, 0)
            if header_occurance > 0:
                new_header = header_prefix + f'.{header_occurance}'
            else:
                new_header = header_prefix
            new_headers.append(new_header)
            header_occurances[header_prefix] = header_occurance + 1
        return new_headers
    
    def main():
        xl_file = Path('.') / 'pandasExample.xlsx'
    
        # Read first level headers
        header_df = pd.read_excel(xl_file, sheet_name='Sheet1', header=[
            0], skiprows=[0], nrows=1)
        headers = list(filter(lambda x: not x.startswith(
            'Unnamed'), list(header_df.columns)))
    
        # Generate the desired headers
        new_headers = rename_headers(headers)
    
        # Read in the full dataframe
        df = pd.read_excel(xl_file, sheet_name='Sheet1', header=[
            0, 1], skiprows=[0])
    
        # Create a dictionary that identifies the parameters for each unique header
        unique_headers = pd.unique(pd.Index(df.columns.get_level_values(0)))
        parameters = {}
        for header in unique_headers:
            parameters[header] = pd.unique(
                [column.split('.')[0] for column in df[header].columns])
    
    
        unstack_df = df.head(1).stack()
        # Keep order of the original index after stack
        index = df.head(1).unstack().index.get_level_values(1)
        unstack_df = unstack_df.reindex(zip([0] * len(index), index))
        unstack_df = unstack_df.reset_index()
    
        # Create the new level 0 and level 1 headers
        level_0 = []
        for header in new_headers:
            level_0 += [header] * len(parameters[header.split('.')[0]])
        level_1 = [parameter.split('.')[0] for parameter in unstack_df['level_1']]
    
        # Rename level 0 and level 1 columns for the dataframe
        df.columns = pd.MultiIndex.from_tuples(zip(level_0, level_1))
        print(df)
    
    
    if __name__ == '__main__':
        main()
    

    输出:

      Rectangle        Ellipse    Rectangle.1        Rectangle.2       
          Width Height       a  b       Width Height       Width Height
    0        10     20       1  2          20     30          40     50
    

    【讨论】:

      【解决方案2】:

      取消堆叠数据框,然后将 level_0 重新分配给唯一标签。我是手动完成的,但您可以通过在每两列中添加一个后缀以编程方式完成。设置多索引然后堆叠结果。元组中的三个值:level 0 和 level 1 and 0

       df=pd.read_excel('dup_header.xls',skiprows=2,nrows=10)
       unstack_df=df.stack()
       unstack_df=unstack_df.reset_index()
       unstack_df['level_0']=['Rectangle1','Rectangle1','Ellipse','Ellipse','Rectangle2','Rectangle2','Rectangle3','Rectangle3']
       unstack_df=unstack_df.set_index(['level_0','level_1'])
       stack_series=unstack_df.stack()
      
       df=stack_series.to_frame()
       df.columns=['value']
       #print(df.index)
       #print(df.values)
       print(df)
      

      输出:

                                   value
        level_0     level_1    
        Rectangle1  Width     0    10
                    Height    0    20
        Ellipse     a         0     1
                    b         0     2
        Rectangle2  width     0    20
                    height    0    30
        Rectangle3  width.1   0    40
                    height.1  0    50
      

      【讨论】:

      • 这不适用于我的示例所示的多索引标题。 ValueError: cannot specify names when specifying a multi-index header
      • 跳过标题级别 0 并将其替换为唯一标签,然后在数据帧中设置多索引,然后将值堆叠,然后将结果转换为数据帧,见上文
      • 我猜这行得通。将不得不编写一些额外的代码来找出标头值。
      猜你喜欢
      • 2016-05-21
      • 2018-11-25
      • 1970-01-01
      • 1970-01-01
      • 2020-01-20
      • 2020-02-16
      • 2019-05-02
      • 1970-01-01
      • 2017-06-20
      相关资源
      最近更新 更多