【问题标题】:Merge two dataframe based on a common column which should be matched by partial string基于应由部分字符串匹配的公共列合并两个数据框
【发布时间】:2020-12-29 07:54:36
【问题描述】:

我有两个数据框df1df2如下图:

df1

Date        ID          Amount    BillNo1
10/08/2020  ABBCSQ1ZA   878       2020/156
10/08/2020  ABBCSQ1ZA   878       2020/157
10/12/2020  AC928Q1ZS   3998      343SONY
10/14/2020  AC9268RE3   198       432
10/16/2020  AA171E1Z0   5490      AIPO325
10/19/2020  BU073C1ZW   3432      IDBI436-Total
10/19/2020  BU073C1ZW   3432      IDBI437-Total

df2

Date        ID          Amount    BillNo2
10/08/2020  ABBCSQ1ZA   878       156
10/11/2020  ATRC95REW   115       265
10/14/2020  AC9268RE3   198       A/432
10/16/2020  AA171E1Z0   5490      325
10/19/2020  BU073C1ZW   3432      436
10/19/2020  BU073C1ZW   3432      437

我的最终答案应该是:

Matched
Date        ID          Amount    BillNo1         BillNo2
10/08/2020  ABBCSQ1ZA   878       2020/156        156             # 156 matches
10/14/2020  AC9268RE3   198       432             A/432           # 432 matches
10/16/2020  AA171E1Z0   5490      AIPO325         325             # 325 matches
10/19/2020  BU073C1ZW   3432      IDBI436-Total   436             # 436 matches
10/19/2020  BU073C1ZW   3432      IDBI437-Total   437             # 437 matches 

Non Matched
Date        ID          Amount    BillNo1     BillNo2
10/08/2020  ABBCSQ1ZA   878       2020/157    NaN
10/12/2020  AC928Q1ZS   3998      343SONY     NaN
10/11/2020  ATRC95REW   115       NaN         265

如何根据Column =['BillNo1','BillNo2']的部分字符串匹配合并两个数据帧?

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    您可以定义自己的阈值,但以下是一项建议:

    import difflib
    from functools import partial
    #the below function is inspired from https://stackoverflow.com/a/56521804/9840637
    def get_closest_match(x,y):
        """x=possibilities , y = input"""
        f = partial(
        difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)
    
        matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
        return pd.DataFrame([y,matches.rename('BillNo2')]).T
    
    temp = get_closest_match(df2['BillNo2'],df1['BillNo1'])
    temp['BillNo2'] = (temp['BillNo2']
                       .fillna(df1['BillNo1']
                   .str.extract('('+'|'.join(df2['BillNo2'])+')',expand=False)))
    
    
    merged = (df1.assign(BillNo2=df1['BillNo1'].map(dict(temp.values)))
    .merge(df2.drop_duplicates(),on=['Date','ID','Amount','BillNo2']
    ,how='outer',indicator=True))
    
    print(merged)
    
             Date         ID  Amount        BillNo1 BillNo2      _merge
    0  10/08/2020  ABBCSQ1ZA     878       2020/156     156        both
    1  10/08/2020  ABBCSQ1ZA     878       2020/157     NaN   left_only
    2  10/12/2020  AC928Q1ZS    3998        343SONY     NaN   left_only
    3  10/14/2020  AC9268RE3     198            432   A/432        both
    4  10/16/2020  AA171E1Z0    5490        AIPO325     325        both
    5  10/19/2020  BU073C1ZW    3432  IDBI436-Total     436        both
    6  10/19/2020  BU073C1ZW    3432  IDBI437-Total     437        both
    7  10/11/2020  ATRC95REW     115            NaN     265  right_only
    

    一旦你合并了上面的df,你就可以这样做了;

    matched = merged.query("_merge=='both'")
    unmatched = merged.query("_merge!='both'")
    print("Matched Df \n ", matched,'\n\n',"Unmatched Df \n " , unmatched)
    
    Matched Df 
               Date         ID  Amount        BillNo1 BillNo2 _merge
    0  10/08/2020  ABBCSQ1ZA     878       2020/156     156   both
    3  10/14/2020  AC9268RE3     198            432   A/432   both
    4  10/16/2020  AA171E1Z0    5490        AIPO325     325   both
    5  10/19/2020  BU073C1ZW    3432  IDBI436-Total     436   both
    6  10/19/2020  BU073C1ZW    3432  IDBI437-Total     437   both 
    
     Unmatched Df 
               Date         ID  Amount   BillNo1 BillNo2      _merge
    1  10/08/2020  ABBCSQ1ZA     878  2020/157     NaN   left_only
    2  10/12/2020  AC928Q1ZS    3998   343SONY     NaN   left_only
    7  10/11/2020  ATRC95REW     115       NaN     265  right_only
    

    【讨论】:

      猜你喜欢
      • 2021-11-27
      • 2016-04-15
      • 2021-08-22
      • 2021-01-16
      • 1970-01-01
      • 2020-10-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多