按列合并 2 个 Pandas 数据帧，不重复，并选择要保留的列答案

【问题标题】：Merge 2 Pandas data frames by column without duplicates and and select which columns to be retained按列合并 2 个 Pandas 数据帧，不重复，并选择要保留的列
【发布时间】：2020-08-12 03:53:05
【问题描述】：

我有 2 个具有相同结构的熊猫数据框： DF1

 col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-11 07:28:18 2020-08-11 07:28:18 Done
1  A     2 2020-08-11 07:28:18 2020-08-12 07:28:18 In Progress
2  B     3 2020-08-11 07:28:18 2020-08-13 07:28:18 Done
3  B     4 2020-08-11 07:28:18 2020-08-13 07:28:18 In Progress
4  C     5 2020-08-11 07:28:18 2020-08-13 07:28:18 Done

和

DF2

col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-15 07:28:18 2020-08-15 07:28:18 Done
1  A     2 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
2  B     3 2020-08-15 07:28:18 2020-08-15 07:28:18 Done
3  B     6 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
4  C     7 2020-08-15 07:28:18 2020-08-15 07:28:18 Done

我最终需要的是一个数据框，它从 DF1 获取第 1-3 列，从 DF2 获取第 4-5 列，并且没有重复项。如果密钥仅存在于其中一个数据帧中，则它也应记录在结果数据帧中，例如：

DF结果

col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-11 07:28:18 2020-08-15 07:28:18 Done
1  A     2 2020-08-11 07:28:18 2020-08-15 07:28:18 In Progress
2  B     3 2020-08-11 07:28:18 2020-08-15 07:28:18 Done
3  B     4 2020-08-11 07:28:18 2020-08-13 07:28:18 In Progress
4  C     5 2020-08-11 07:28:18 2020-08-13 07:28:18 Done
5  B     6 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
6  C     7 2020-08-15 07:28:18 2020-08-15 07:28:18 Done

【问题讨论】：

标签： python pandas dataframe merge

【解决方案1】：

我将首先inner 将数据帧的前三列（根据需要）与第二个数据帧的最后两列合并。但是，对于第二个数据帧，请确保包含 'Type','Key'，因为这些是您将合并的列 on。

然后concat 这个temp 数据帧与DF1 和DF2 并根据['Type','Key'] 的子集删除重复项，并在删除重复项时保留first 值。那是因为您将temp 数据帧作为pd.concat 中的第一个数据帧传递

temp = pd.merge(DF1[['Type', 'Key', 'Date first found']],
                DF2[['Type','Key', 'Date last found', 'Status']],
                how='inner',
                on=['Type','Key'])

DFResult = pd.concat([temp,DF1,DF2]).drop_duplicates(subset=['Type','Key'], keep='first')
DFResult

Out[11]: 
  Type  Key     Date first found      Date last found       Status
0    A    1  2020-08-11 07:28:18  2020-08-15 07:28:18         Done
1    A    2  2020-08-11 07:28:18  2020-08-15 07:28:18  In Progress
2    B    3  2020-08-11 07:28:18  2020-08-15 07:28:18         Done
3    B    4  2020-08-11 07:28:18  2020-08-13 07:28:18  In Progress
4    C    5  2020-08-11 07:28:18  2020-08-13 07:28:18         Done
3    B    6  2020-08-15 07:28:18  2020-08-15 07:28:18  In Progress
4    C    7  2020-08-15 07:28:18  2020-08-15 07:28:18         Done

【讨论】：