【问题标题】:Merge 2 Pandas data frames by column without duplicates and and select which columns to be retained按列合并 2 个 Pandas 数据帧,不重复,并选择要保留的列
【发布时间】:2020-08-12 03:53:05
【问题描述】:

我有 2 个具有相同结构的熊猫数据框: DF1

 col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-11 07:28:18 2020-08-11 07:28:18 Done
1  A     2 2020-08-11 07:28:18 2020-08-12 07:28:18 In Progress
2  B     3 2020-08-11 07:28:18 2020-08-13 07:28:18 Done
3  B     4 2020-08-11 07:28:18 2020-08-13 07:28:18 In Progress
4  C     5 2020-08-11 07:28:18 2020-08-13 07:28:18 Done

DF2

col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-15 07:28:18 2020-08-15 07:28:18 Done
1  A     2 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
2  B     3 2020-08-15 07:28:18 2020-08-15 07:28:18 Done
3  B     6 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
4  C     7 2020-08-15 07:28:18 2020-08-15 07:28:18 Done

我最终需要的是一个数据框,它从 DF1 获取第 1-3 列,从 DF2 获取第 4-5 列,并且没有重复项。如果密钥仅存在于其中一个数据帧中,则它也应记录在结果数据帧中,例如:

DF结果

col1  col2 col3               col4                col5      
  Type Key Date first found    Date last found     Status
0  A     1 2020-08-11 07:28:18 2020-08-15 07:28:18 Done
1  A     2 2020-08-11 07:28:18 2020-08-15 07:28:18 In Progress
2  B     3 2020-08-11 07:28:18 2020-08-15 07:28:18 Done
3  B     4 2020-08-11 07:28:18 2020-08-13 07:28:18 In Progress
4  C     5 2020-08-11 07:28:18 2020-08-13 07:28:18 Done
5  B     6 2020-08-15 07:28:18 2020-08-15 07:28:18 In Progress
6  C     7 2020-08-15 07:28:18 2020-08-15 07:28:18 Done

【问题讨论】:

    标签: python pandas dataframe merge


    【解决方案1】:

    我将首先inner 将数据帧的前三列(根据需要)与第二个数据帧的最后两列合并。但是,对于第二个数据帧,请确保包含 'Type','Key',因为这些是您将合并的列 on

    然后concat 这个temp 数据帧与DF1DF2 并根据['Type','Key'] 的子集删除重复项,并在删除重复项时保留first 值。那是因为您将temp 数据帧作为pd.concat 中的第一个数据帧传递

    temp = pd.merge(DF1[['Type', 'Key', 'Date first found']],
                    DF2[['Type','Key', 'Date last found', 'Status']],
                    how='inner',
                    on=['Type','Key'])
    
    DFResult = pd.concat([temp,DF1,DF2]).drop_duplicates(subset=['Type','Key'], keep='first')
    DFResult
    
    Out[11]: 
      Type  Key     Date first found      Date last found       Status
    0    A    1  2020-08-11 07:28:18  2020-08-15 07:28:18         Done
    1    A    2  2020-08-11 07:28:18  2020-08-15 07:28:18  In Progress
    2    B    3  2020-08-11 07:28:18  2020-08-15 07:28:18         Done
    3    B    4  2020-08-11 07:28:18  2020-08-13 07:28:18  In Progress
    4    C    5  2020-08-11 07:28:18  2020-08-13 07:28:18         Done
    3    B    6  2020-08-15 07:28:18  2020-08-15 07:28:18  In Progress
    4    C    7  2020-08-15 07:28:18  2020-08-15 07:28:18         Done
    

    【讨论】:

      猜你喜欢
      • 2017-05-10
      • 2017-03-25
      • 2019-02-26
      • 1970-01-01
      • 1970-01-01
      • 2016-03-02
      • 2021-11-01
      • 2022-01-25
      • 1970-01-01
      相关资源
      最近更新 更多