合并两个数据框（都具有多索引）答案

【问题标题】：Merge two dataframes( both have multi-index)合并两个数据框（都具有多索引）
【发布时间】：2021-12-04 07:46:33
【问题描述】：

我有一个类似Merge two dataframes with multi-index的问题。

在：

import pandas as pd
import numpy as np
row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
index_arrays = [np.array(['first', 'first', 'second', 'second']), np.array(['one','two','one','two'])]
df1 = pd.DataFrame([row_x1,row_x2,row_x3,row_x4], columns=list('ABC'), index=index_arrays)
print(df1)

出来：

             A   B   C
first  one  a1  b1  c1
       two  a2  b2  c2
second one  a3  b3  c3
       two  a4  b4  c4

在：

row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']
index_arrays = [np.array(['first','first', 'second',]), np.array(['one','three','two'])]
df2 = pd.DataFrame([row_y1,row_y2,row_y3], columns=list('DEF'), index=index_arrays)
print(df2)

出来：

               D   E   F
first  one    d1  e1  f1
       three  d2  e2  f2
second two    d3  e3  f3

也就是说，如何将它们合并以实现df3（如下）？

在：

row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']

row_z1 = row_x1 + row_y1
row_z2 = row_x2 + [np.nan, np.nan, np.nan]
row_z3 = [np.nan, np.nan, np.nan] + row_y2
row_z4 = row_x3 + [np.nan, np.nan, np.nan]
row_z5 = row_x4 + row_y3
index_arrays = [np.array(['first', 'first', 'first', 'second', 'second']), np.array(['one','two','three','one','two'])]
df3 = pd.DataFrame([row_z1,row_z2,row_z3,row_z4,row_z5], columns=list('ABCDEF'), index=index_arrays)
print(df3)

出来：

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       two     a2   b2   c2  NaN  NaN  NaN
       three  NaN  NaN  NaN   d2   e2   f2
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

PS。感谢@Andreuccio 的问题！

感谢@Ajay Verma 和@EBDS。这确实是手动创建 df 数据的解决方案。但是我对以下情况感到很困惑：

我有两个来自统计数据的数据框。然后我复制了pd.merge()对应的数据

在：

df1 = data1[data1.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

出来：

                         0       1       2       3
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0
               2      38.0    23.0    21.0    29.0
               3      37.0    27.0    32.0    37.0
               4       5.0    14.0    11.0    23.0
               5      31.0    56.0    41.0    65.0
               7     389.0   258.0   337.0   243.0
               NaN  1323.0  1388.0  1307.0  1311.0

在：

df2 = data2[data2.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

出来：

                         0       1       2       3
BASIC_GZAG_TMB 1     207.0   232.0   252.0   223.0
               2      26.0    18.0    19.0    20.0
               3      43.0    41.0    50.0    42.0
               4      35.0    27.0    37.0    15.0
               5      54.0    52.0    78.0    64.0
               6       1.0  1306.0     1.0     4.0
               7     206.0   263.0   227.0   230.0
               NaN  1374.0  1306.0  1282.0  1348.0

然后我通过以下方式合并 df1 和 df2：

df1.merge(df2, left_index=True, right_index=True, how='outer')

出来：

                       0_x     1_x     2_x     3_x     0_y     1_y     2_y  \
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0   207.0   232.0   252.0   
               2      38.0    23.0    21.0    29.0    26.0    18.0    19.0   
               3      37.0    27.0    32.0    37.0    43.0    41.0    50.0   
               4       5.0    14.0    11.0    23.0    35.0    27.0    37.0   
               5      31.0    56.0    41.0    65.0    54.0    52.0    78.0   
               7     389.0   258.0   337.0   243.0   206.0   263.0   227.0   
               NaN  1323.0  1388.0  1307.0  1311.0  1374.0  1306.0  1282.0   

                       3_y  
BASIC_GZAG_TMB 1     223.0  
               2      20.0  
               3      42.0  
               4      15.0  
               5      64.0  
               7     230.0  
               NaN  1348.0

我对 df2 中存在的 6 的索引在结果中消失感到困惑。

我知道如果我使用 df2.merge(df1...) 可以成为一个解决方案。但其实data1和data2是动态生成的，不知道哪个有更多的索引。我只想得到 df1 和 df2 的联合。

【问题讨论】：

标签： python pandas dataframe numpy

【解决方案1】：

如果需要按数字词排序...一、二、三...

使用 pandas 合并命令
使用 key= vectorize parse 对索引进行排序

代码：

from number_parser import parse
dfx = (
    df1.merge(df2,left_index=True,right_index=True,how='outer')
    .sort_index(key=lambda x: np.vectorize(parse)(x).astype(float)) )

另一个例子：

您可能需要安装 number_parse：

!pip install number_parser

更新：

由于我没有新数据，所以我使用原始数据来测试“丢失的 6”。我还将列名更改为相同，并添加了一个 nan 索引。

data1 = df1.copy(deep=True)
data2 = df2.copy(deep=True)
df1 = data1[data1.index.get_level_values(0) == 'first'].copy()
df2 = data2[data2.index.get_level_values(0) == 'first'].copy()

dfx = df1.merge(df2, left_index=True, right_index=True, how='outer').sort_index(
        key=lambda x: np.vectorize(parse)(x)
        )

如您所见，它没有丢失任何值。问题可能不在于合并部分，需要检查导致这种情况的源数据。

【讨论】：

非常感谢您的回答。但是还有另一种情况是 pd.merge() 无法解决的。我在原始问题中添加了描述。
@jiachuan 我已经尝试模拟您所看到的问题。但问题没有出现。我不知道该怎么做。可能需要查看您的源数据和代码。看看 Ajay 有没有给你答案。
这可能是MultiIndex中默认空索引的原因。我已经用字符串'null'替换了默认的空索引，然后可以获得所需的结果。谢谢@EBDS！

【解决方案2】：

您可以使用 Pandas merge。文档链接：link

df = df1.merge(df2, left_index=True, right_index=True, how='outer')
print(df)

输出

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       three  NaN  NaN  NaN   d2   e2   f2
       two     a2   b2   c2  NaN  NaN  NaN
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

【讨论】：

非常感谢您的回答。但是还有另一种情况是 pd.merge() 无法解决的。我在原始问题中添加了描述。