【问题标题】:Merge pandas dataframes based on several conditions for a large dataset根据大型数据集的几个条件合并 pandas 数据帧
【发布时间】:2021-07-20 23:58:52
【问题描述】:

我想根据特定条件合并两个数据框。首先,我只想匹配全名,然后对于不匹配的条目,我想将名字和姓氏视为匹配条件。我有两个数据框如下:

df1
first_name  last_name   full_name
   John       Shoeb     John Shoeb
   John      Shumon   John Md Shumon
   Abu        Babu      Abu A Babu
  William     Curl      William Curl   

df2
givenName    surName     displayName
John          Shoeb      John Shoeb
John         Shumon     John M Shumon
Abu           Babu        Abu Babu
Raju          Kaju        Raju Kaju
Bill          Curl        Bill Curl

我先根据全名合并:

df3 = pd.merge(df1, df2, left_on=df1['full_name'].str.lower(), right_on=df2['displayName'].str.lower(), how='left')

并添加statuslog 列:

df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'status'] = True
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'log'] = 'Full Name Matching'

所以生成的数据框 df3 现在看起来像:

first_name   last_name     full_name    givenName   surName   displayName   status  log
John          Shoeb        John Shoeb      John      Shoeb    John Shoeb    True    Full Name Matching
  John        Shumon     John Md Shumon     NaN       NaN        NaN         NaN    NaN
  Abu          Babu        Abu A Babu       NaN       NaN        NaN         NaN    NaN
William        Curl       William Curl      NaN       NaN        NaN        False   NaN

预期结果 现在我想应用基于 df1(名字和姓氏)和 df2(givenName 和 surName)的匹配条件。最终的数据框应如下所示:

  first_name     last_name     full_name    givenName   surName   displayName   status  log
    John          Shoeb        John Shoeb      John      Shoeb    John Shoeb    True    Full Name Matching
      John        Shumon     John Md Shumon    John     Shumon    John Shumon   True    FN LN Matching
      Abu          Babu        Abu A Babu       Abu       Babu      Abu Babu    True    FN LN Matching
    William        Curl       William Curl      NaN       NaN        NaN        False   NaN

问题对于第二部分,即名字和姓氏匹配,我能够使用数据框的itertuples() 完成它。但是,当将相同的操作应用于庞大的数据集时,它会一直运行下去。我正在寻找有效的方法,以便将其应用于大量数据。

【问题讨论】:

    标签: python-3.x pandas dataframe join merge


    【解决方案1】:

    您可以在合并中使用indicator=True。然后比较第一次合并和第二次合并是否为"both"(例如np.where):

    df3 = (
        pd.merge(
            df1,
            df2,
            left_on=df1["full_name"].str.lower(),
            right_on=df2["displayName"].str.lower(),
            how="left",
            indicator=True,
        )
        .drop(columns="key_0")
        .rename(columns={"_merge": "first_merge"})
    )
    
    df3 = pd.merge(
        df3,
        df2,
        left_on=df1["first_name"].str.lower() + " " + df1["last_name"].str.lower(),
        right_on=df2["givenName"].str.lower() + " " + df2["surName"].str.lower(),
        how="left",
        indicator=True,
    )
    
    df3["log"] = np.where(
        (df3["first_merge"] == "both"),
        "Full Name Matching",
        np.where(df3["_merge"] == "both", "FN LN Matching", None),
    )
    df3["status"] = df3["log"].notna()
    
    df3 = df3[
        [
            "first_name",
            "last_name",
            "full_name",
            "givenName_y",
            "surName_y",
            "displayName_y",
            "status",
            "log",
        ]
    ].rename(
        columns={
            "givenName_y": "givenName",
            "surName_y": "surName",
            "displayName_y": "displayName",
        }
    )
    print(df3)
    

    打印:

      first_name last_name       full_name givenName surName    displayName  status                 log
    0       John     Shoeb      John Shoeb      John   Shoeb     John Shoeb    True  Full Name Matching
    1       John    Shumon  John Md Shumon      John  Shumon  John M Shumon    True      FN LN Matching
    2        Abu      Babu      Abu A Babu       Abu    Babu       Abu Babu    True      FN LN Matching
    3    William      Curl    William Curl       NaN     NaN            NaN   False                None
    

    【讨论】:

    • 如果 df3 中有重复项,第二次合并会引发错误吗?我在大型数据集上运行它时看到了这一点。
    • 或者第二次合并时应该是df3而不是df1
    • @AbuShoeb 取决于您的逻辑,但重复通常会导致合并问题。您可以在第二次合并中使用df1,但关键是使用indicator=True 参数 - 并观察"both" 的值。
    • 如果在第二次合并中使用了df1,会抛出key error。有什么线索吗?
    猜你喜欢
    • 2019-08-10
    • 2022-01-20
    • 2021-10-18
    • 2017-12-30
    • 1970-01-01
    • 1970-01-01
    • 2016-01-19
    • 2021-07-21
    • 2017-12-30
    相关资源
    最近更新 更多