识别两个熊猫数据框中部分匹配的行答案

【问题标题】：Identify rows in two pandas dataframes that match partially识别两个熊猫数据框中部分匹配的行
【发布时间】：2020-08-10 04:39:11
【问题描述】：

我正在尝试向数据框 df1 添加一列，用于说明 df1 中的行是否出现在第二个数据框 df2 中。这通常是相当容易的，但是我真的想要一个 4/5 匹配以及一个 5/5 列匹配。

也就是说，添加到 df1 的新列中名为“In_df2”的条目将是 1，如果在 5 个相关列（共 9 个）中完全匹配或在相关列的 4/5 中匹配列。假设这是 df1（移除了无关的列）。

df1_rows = [['555555555', 'M', 'Mike', 'Smith', '1970-01-01'], ['999999999', 'F', 'Jane', 'Wong', '1980-01-01'], ['111111111', 'M', 'Steve', 'Patel', '1990-01-01']]
df1 = pd.DataFrame(df1_rows, columns = ['SSN', 'sex', 'first_name', 'last_name', 'dob']) 


     SSN sex first_name last_name         dob
0  555555555   M       Mike     Smith  1970-01-01
1  999999999   F        Jane      Wong  1980-01-01
2  111111111   M      Steve     Patel  1990-01-01

并说这是 df2。

df2_rows = [['222222222', 'F', 'Steve', 'Patel', '1990-01-01'], ['555555555', 'M', 'Mike', 'Smith', '1970-01-01'], ['999999999', 'F', 'Jeff', 'Wong', '1980-01-01']]
df2 = pd.DataFrame(df2_rows, columns = ['SSN', 'sex', 'first_name', 'last_name', 'dob'])
df2

     SSN sex first_name last_name         dob
0  222222222   F      Steve     Patel  1990-01-01
1  555555555   M       Mike     Smith  1970-01-01
2  999999999   F       Jeff      Wong  1980-01-01

然后它应该返回以下内容：

df3_rows = [['555555555', 'M', 'Mike', 'Smith', '1970-01-01', 1], ['999999999', 'F', 'Jane', 'Wong', '1980-01-01', 1], ['111111111', 'M', 'Steve', 'Patel', '1990-01-01', 0]]
df3 = pd.DataFrame(df3_rows, columns = ['SSN', 'sex', 'first_name', 'last_name', 'dob', 'In_df2'])
df3

     SSN sex first_name last_name         dob  In_df2
0  555555555   M       Mike     Smith  1970-01-01       1
1  999999999   F       Jane      Wong  1980-01-01       1
2  111111111   M      Steve     Patel  1990-01-01       0

“In_df2”列的“0”行中有 1，因为 df2 中的 df1 中的“0”行完全匹配。它在“1”行中有 1，因为 df2 中有 4/5 匹配 df1 中的行“1”。它在“2”行中有一个 0，因为 df2 中只有 3/5 匹配 df1 中的“2”行。

我已经编写了手动执行此操作的代码（见下文），但我在编码方面有点像新手，而且可以预见的是，它超级慢。我已经搜索并找不到似乎可以处理这种部分匹配的包。

最后一件事，输出不必像我添加的那样是一列。我真的只是想识别 df1 中没有 df2 中的 4/5 或 5/5 伙伴的所有行。

感谢您的任何建议！

我的代码：

def row_compare(row1, row2):
    
    count = 0

    if row1.ssn == row2.ssn:
        count += 1
    if row1.dob == row2.dob:
        count += 1
    if row1.sex == row2.sex:
        count += 1
    if row1.first_name == row2.first_name:
        count += 1
    if row1.last_name== row2.last_name:
        count += 1
        
    if count >= 4:
        out = 1
    else:
        out = 0
        
    return out

接着是：

def row_to_df_compare(row1, df):
    
    df['In_Other'] = df.apply(lambda row2 : row_compare(row1, row2), axis = 1)
    if df.sum().In_Other > 0:
        out = 1
    else:
        out = 0
    return out

最后是：

df1['In_df2'] = df1.apply(lambda row : row_to_df_compare(row, df2), axis = 1)

【问题讨论】：

标签： pandas dataframe comparison rows partial

【解决方案1】：

我的一位同事想出了这个答案，这似乎适用于我正在使用的示例数据框：

from itertools import combinations 

df1['n_duplicates'] = 0

for columns in combinations(df2.columns, 4):
    columns = list(columns)
    df_concat = pd.concat([df1[columns], df2[columns]], ignore_index = True, sort = False).reset_index(drop=True)
    df1['n_duplicates'] += df_concat.duplicated(keep = False).astype(int).iloc[:df1.shape[0]]

df1_dedup = df1[df1['n_duplicates'] < 4]

【讨论】：