当 df 的列和长度不同时，将 df 中的行与不同 df 中的行进行比较答案

【问题标题】：Compare rows in df to rows in different df when columns and length of df is different当 df 的列和长度不同时，将 df 中的行与不同 df 中的行进行比较
【发布时间】：2020-10-15 09:09:52
【问题描述】：

我在 df1 上有以下数据：

   id       date ... paid
0 123 2020-10-14 ... 30.0
1 234 2020-09-23 ... 25.5
2 356 2020-08-25 ... 35.5

还有一些关于 df2 的其他信息：

   id payment_date amount type ...       other_info
0 568   2020-08-25   15.9 adj1 ...       some_words
1 123   2020-10-14   20.0 adj2 ...       more_words
2 234   2020-09-23   25.5 adj2 ... some_other_words
3 356   2020-08-25   35.5 adj2 ...  some_more_words

我需要在提到的特定列上比较 df1 上的每一行与 df2 上的行。如果它们完全匹配，我想在 df1 上添加一个带有布尔结果的列，或者像“是”这样的 str。最终的输出应该是这样的：

   id       date ... paid new_col
0 123 2020-10-14 ... 30.0   False
1 234 2020-09-23 ... 25.5    True
2 356 2020-08-25 ... 35.5    True

请注意，索引在两个数据帧中的任何一个上都不重要，并且它们的长度不同（df1 约为 100,000 行和 6 列，df2 约为 2,000,000 行和 13 列）。其他列在比较中无关紧要。

我尝试过使用类似的东西：

df1["new_col"] = ((df1["id"] == df2["id"]) &
                  (df1["date"] == df2["payment_date"]) &
                  (df1["paid"] == df2["amount"]))

但我明白了：“ValueError: Can only compare identically-labeled Series objects”。我不能使用“合并”之类的东西，因为列不一样，而且 df2 太大，因此需要额外的时间。另外，我不能使用pd.Series.isin()，因为每个 ID 都有很多日期和金额，它们必须完全匹配。几行的日期和金额也相同，不同之处在于比较提到的三列。

我正在寻找解决此问题的矢量化方法，或者只是一种无需在两个数据帧上逐行迭代的有效方法。

【问题讨论】：

标签： python pandas dataframe vectorization

【解决方案1】：

你可以使用merge 喜欢

In [37]: df1['new_col'] = df1.merge(df2,
             left_on=['id', 'date', 'paid'],
             right_on=['id', 'payment_date', 'amount'],
             how='left', indicator=True)['_merge'].eq('both')

In [38]: df1
Out[38]: 
    id        date  paid  new_col
0  123  2020-10-14  30.0    False
1  234  2020-09-23  25.5     True
2  356  2020-08-25  35.5     True

【讨论】：