如何连接两个 ID 不匹配的数据框并创建新列来表示数据框 ID 的来源？答案

【问题标题】：How to join two dataframes where IDs do not match and create new column to represent what dataframe ID came from?如何连接两个 ID 不匹配的数据框并创建新列来表示数据框 ID 的来源？
【发布时间】：2018-11-29 14:34:28
【问题描述】：

我有两个这样的数据框

df1:

id    column1    column2 
1      30          90
2      1            2

df2:

id    column1    column2 
1      30          90
3      1            2

我想创建逻辑来合并这两个 ID 不匹配的数据框（列名相同），然后我想创建一个新列来说明 ID 来自哪个数据框。我该怎么做？

最终合并的df：

id    column1    column2    df_name
2      30          90         df1
3      1            2         df2

编辑：

最终 df 是否可以从两个数据帧中提取所有列？

 id    column1.df1    column2.df1   column1.df2    column2.df2     df_name
    2      30          90                 30            90           df1
    3      1            2                  1             2           df2

【问题讨论】：

为什么最终合并的df 中的id 会发生变化？
这里的问题不清楚，在最终合并 df 中，id 2 来自哪里？
@user3471881 id 在 final_merged_df 中发生了变化，因为我只想要两个数据帧中不同的 ID。有帮助吗
@W-B 2 来自 df1。我只想要一个数据帧，其中两个数据帧之间的 ID 不同
@RustyShackleford 刚刚添加了我的解决方案。此外，您的更新只是重复了该列两次，您确定您仍然想要它们，因为它会占用内存

标签： python python-3.x pandas dataframe

【解决方案1】：

第一个concat DataFrames 在一起：

df = (pd.concat([df1, df2],  keys=('df1','df2'))
        .rename_axis(('df_name','idx'))
        .reset_index(level=1, drop=True)
        .reset_index())

print (df)
  df_name  id  column1  column2
0     df1   1       30       90
1     df1   2        1        2
2     df2   1       30       90
3     df2   3        1        2

然后得到所有相同的id：

a = df1.merge(df2, on='id')['id']

最后由isin过滤：

df = df[~df['id'].isin(a)]
print (df)
  df_name  id  column1  column2
1     df1   2        1        2
3     df2   3        1        2

编辑：

类似@W-B的解决方案，只是增加了参数id和suffixes：

df = (df1.merge(df2,indicator=True,how='outer', on='id', suffixes=('_df1','_df2'))
         .query("_merge != 'both'"))
df['_merge'] = df['_merge'].map({'left_only':'df1','right_only':'df2'})

print (df)
   id  column1_df1  column2_df1  column1_df2  column2_df2 _merge
1   2          1.0          2.0          NaN          NaN    df1
2   3          NaN          NaN          1.0          2.0    df2

如果想要所有行，还需要具有相同id 的行：

df['_merge'] = df['_merge'].map({'left_only':'df1','right_only':'df2', 'both':'df1+df2'})

print (df)
   id  column1_df1  column2_df1  column1_df2  column2_df2   _merge
0   1         30.0         90.0         30.0         90.0  df1+df2
1   2          1.0          2.0          NaN          NaN      df1
2   3          NaN          NaN          1.0          2.0      df2

【讨论】：

这正是我要找的！即使它们具有相同的名称，我是否可以从两个数据框中提取列？
很好的编辑，但我仍然不建议 op 包含重复的列 ...:-)

【解决方案2】：

让我们一起来merge

df=df1.merge(df2,indicator = True,how='outer').loc[lambda x : x['_merge'].ne('both')]
df['df_name']=df['_merge'].map({'left_only':'df1','right_only':'df2'})
df
Out[328]: 
   id  column1  column2      _merge df_name
1   2        1        2   left_only     df1
2   3        1        2  right_only     df2

【讨论】：