【发布时间】:2021-07-20 23:58:52
【问题描述】:
我想根据特定条件合并两个数据框。首先,我只想匹配全名,然后对于不匹配的条目,我想将名字和姓氏视为匹配条件。我有两个数据框如下:
df1
first_name last_name full_name
John Shoeb John Shoeb
John Shumon John Md Shumon
Abu Babu Abu A Babu
William Curl William Curl
df2
givenName surName displayName
John Shoeb John Shoeb
John Shumon John M Shumon
Abu Babu Abu Babu
Raju Kaju Raju Kaju
Bill Curl Bill Curl
我先根据全名合并:
df3 = pd.merge(df1, df2, left_on=df1['full_name'].str.lower(), right_on=df2['displayName'].str.lower(), how='left')
并添加status 和log 列:
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'status'] = True
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'log'] = 'Full Name Matching'
所以生成的数据框 df3 现在看起来像:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon NaN NaN NaN NaN NaN
Abu Babu Abu A Babu NaN NaN NaN NaN NaN
William Curl William Curl NaN NaN NaN False NaN
预期结果 现在我想应用基于 df1(名字和姓氏)和 df2(givenName 和 surName)的匹配条件。最终的数据框应如下所示:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon John Shumon John Shumon True FN LN Matching
Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
William Curl William Curl NaN NaN NaN False NaN
问题对于第二部分,即名字和姓氏匹配,我能够使用数据框的itertuples() 完成它。但是,当将相同的操作应用于庞大的数据集时,它会一直运行下去。我正在寻找有效的方法,以便将其应用于大量数据。
【问题讨论】:
标签: python-3.x pandas dataframe join merge