【问题标题】:Python Pandas: Create dataframe from Excel file with multi (merged cell) headersPython Pandas:从具有多个(合并单元格)标题的 Excel 文件创建数据框
【发布时间】:2020-03-13 19:58:51
【问题描述】:

我对 Python (Pandas) 比较陌生,我想用它来自动执行 Excel 任务并提高工作效率 :)

目前我正坐在 Excel 销售报告下方,其中“年份”是一个合并单元格。

           |               2018                          |              2019                      |
| Product  |  January  |  February  |  March  |  April   |  January  |  February |  March | April |
| A        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| B        |        9  |        10  |     65  |     50   |     8     |     63    |   65   |    50 |     
| C        |        7  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| D        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |

现在我想将报告重塑为堆叠格式,然后我可以将其写回 Excel,并用于进一步分析:

Product  |  Year  |  Month  |  Values
A        |   2018 | January |       8    
B        |   2018 | February|       9

我的想法是创建一个数据框并使用 pd.melt()

不幸的是,我在尝试创建数据框时已经在第一步失败了。

“年份”只写在 2 个单元格中,其余显示“未命名 x”。

import pandas as pd

// change console output
desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

//Read Excel file and create dataframe

df = pd.read_excel("Stackoverflow_example.xlsx")

print(df)




  Unnamed: 0     2018 Unnamed: 2 Unnamed: 3 Unnamed: 4     2019 Unnamed: 6 Unnamed: 7 Unnamed: 8
0    Product  January   February      March      April  January   February      March      April
1          A        8         10         65         50        8         10         65         50
2          B        9         10         65         50        8         63         65         50
3          C        7         10         65         50        8         10         65         50
4          D        8         10         65         50        8         10         65         50

如果有人能帮助我解决这个问题,那就太好了。

非常感谢。

编辑:

添加 header=[0,1], index_col=[0] 有效,但我仍在努力寻找将其转换为堆叠格式的方法.....

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

print(df)

----------------------------------------------------------------------

           2018                         2019                     
Product January February March April January February March April
A             8       10    65    50       8       10    65    50
B             9       10    65    50       8       63    65    50
C             7       10    65    50       8       10    65    50
D             8       10    65    50       8       10    65    50

它有效,但同时弄乱了列标题名称(level_0,“产品”在“月”列中......


import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

print(df)

-----------------------------------------------------------------------------
   level_0   Product  2018  2019
0        A     April    50    50
1        A  February    10    10
2        A   January     8     8
3        A     March    65    65
4        B     April    50    50
5        B  February    10    63
6        B   January     9     8
7        B     March    65    65
8        C     April    50    50
9        C  February    10    10
10       C   January     7     8
11       C     March    65    65
12       D     April    50    50
13       D  February    10    10
14       D   January     8     8
15       D     March    65    65

我尝试重命名列并将索引设置为“Product”,导致“Month 2018 2019”下方的“单元格”为空

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

df.columns = ["Product", "Month", "2018", "2019"]
df = df.set_index("Product")

print(df)

----------------------------------------------------------

           Month  2018  2019
Product                      
A           April    50    50
A        February    10    10
A         January     8     8
A           March    65    65
B           April    50    50
B        February    10    63
B         January     9     8
B           March    65    65
C           April    50    50
C        February    10    10
C         January     7     8
C           March    65    65
D           April    50    50
D        February    10    10
D         January     8     8
D           March    65    65

【问题讨论】:

  • 谢谢,jezrael - 这行得通,但我仍在努力将其转换为堆叠格式:/
  • 你能检查答案吗?
  • 之前使用过这类数据(主要是 SAP BW !!)如果我的回答有帮助,请告诉我。
  • @SebK - 哎呀,有必要 unstack,答案已编辑。
  • 非常感谢,伙计们!两种解决方案都可以正常工作:)

标签: python excel pandas dataframe header


【解决方案1】:

首先为MultiIndex 在列中添加参数header=[0,1] 并为避免MultiIndex 在第一列中添加index_col=[0] 以将第一列转换为索引:

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

然后通过DataFrame.unstack 重塑,通过Series.rename_axis 更改索引名称,最后通过Series.reset_indexSeries 转换为列:

df = df.unstack().rename_axis(('Year','Month','Product')).reset_index(name='Value')

#if order of columns is impiortant change it by subset
df = df[['Product','Year','Month','Value']]
print(df.head())

  Product  Year     Month  Value
0       A  2018   January      8
1       B  2018   January      9
2       C  2018   January      7
3       D  2018   January      8
4       A  2018  February     10

【讨论】:

    【解决方案2】:

    一种方法是使用pd.MultiIndexstackmelt

    print(df)
        Unnamed:_0     2018 Unnamed:_2 Unnamed:_3 Unnamed:_4     2019 Unnamed:_6  \
    0    Product  January   February      March      April  January   February   
    1          A        8         10         65         50        8         10   
    2          B        9         10         65         50        8         63   
    3          C        7         10         65         50        8         10   
    4          D        8         10         65         50        8         10   
    
      Unnamed:_7 Unnamed:_8  
    0      March      April  
    1         65         50  
    2         65         50  
    3         65         50  
    4         65         50  
    

    首先我们需要重命名未命名的列并设置列产品名称

    df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()
    

    因为我们使用了ffill,所以第一列将是 nan,让我们调用这个 Product 并将其设置为索引。

    df.rename(columns={np.nan : 'Product'},inplace=True)
    df.set_index('Product',inplace=True)
    

    然后让我们从新列创建多索引:

    print(df)
            2018      2018   2018   2018     2019      2019   2019   2019
    Product                                                                  
    Product  January  February  March  April  January  February  March  April
    A              8        10     65     50        8        10     65     50
    B              9        10     65     50        8        63     65     50
    C              7        10     65     50        8        10     65     50
    D              8        10     65     50        8        10     65     50
    
    
    df.columns = pd.MultiIndex.from_arrays([df.columns,df.iloc[0].values])
    df_new = df.iloc[1:].stack().reset_index().melt(id_vars=['Product','level_1'])
    print(df_new)
        Product   level_1 variable value
    0        A     April     2018    50
    1        A  February     2018    10
    2        A   January     2018     8
    3        A     March     2018    65
    4        B     April     2018    50
    5        B  February     2018    10
    6        B   January     2018     9
    7        B     March     2018    65
    8        C     April     2018    50
    9        C  February     2018    10
    10       C   January     2018     7
    11       C     March     2018    65
    12       D     April     2018    50
    13       D  February     2018    10
    14       D   January     2018     8
    15       D     March     2018    65
    16       A     April     2019    50
    17       A  February     2019    10
    18       A   January     2019     8
    19       A     March     2019    65
    20       B     April     2019    50
    21       B  February     2019    63
    22       B   January     2019     8
    23       B     March     2019    65
    24       C     April     2019    50
    25       C  February     2019    10
    26       C   January     2019     8
    27       C     March     2019    65
    28       D     April     2019    50
    29       D  February     2019    10
    30       D   January     2019     8
    31       D     March     2019    65
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-21
      • 1970-01-01
      • 1970-01-01
      • 2023-03-19
      相关资源
      最近更新 更多