标记两个数据帧之间不匹配的记录答案

【问题标题】：Marking records that don't match between two dataframes标记两个数据帧之间不匹配的记录
【发布时间】：2022-10-07 19:38:41
【问题描述】：

我有一个基准数据框：

my_id    parent_id    attribute_1    attribute_2     attribute_3       attribute_4
  ABC          DEF             A-          378.8          Accept             False
  ABS          DES             A-          388.8          Accept             False
  ABB          DEG             A           908.8          Decline             True
  ABB          DEG             B-          378.8          Accept             False
  APP          DRE             C-          370.8          Accept              True

和一个数据框：

my_id    parent_id    Attribute_1     attribute2           attr_3        attribute_5
  ABC          DEF             A-          478.8          Decline              StRing
  ABS          DES             A-          388.8          Accept               String
  ABB          DEG             A           908.8          Accept               StrIng
  ABB          DEG             C-          378.8          Accept               String
  APP          DRE             C-          370.8          Accept               STring

如您所见，attribute_1、attribute_2 或attribute_3 中不时出现一些错误（列的名称不同，但它们应该包含相同的内容）。

当我检查每一行的这三个属性是否与基准测试中的完全相同时，如何标记错误记录？我期望类似这样的输出：

faulty_rows = 

    my_id    parent_id    Attribute_1     attribute2           attr_3       faulty_attr 
      ABC          DEF             A-          478.8          Decline       [attribute2, attr_3]                  
      ABB          DEG             A           908.8          Accept        [attr_3]      
      ABB          DEG             C-          378.8          Accept        [Attribute_1]

我所做的是重命名列并始终逐列连接，这让我知道出了什么问题，但我想同时检查整行并标记错误所在。那可能吗？无论哪种方式，PySpark 或 Pandas 解决方案都很好，我对逻辑很感兴趣。

【问题讨论】：

两个数据框的行顺序是否一致？
@PaulS 很可能不会。

标签： python pandas dataframe join pyspark

【解决方案1】：

DeepDiff 可能是一个解决方案（假设A 指的是你的第一个字典，B 是你的第二个字典）？

from deepdiff import DeepDiff

print(DeepDiff(A, B, ignore_order=False).pretty())

###resulting output:
###Value of root['attribute_1'][3] changed from "B-" to "C-".
###Value of root['attribute_2'][0] changed from 378.8 to 478.8.
###Value of root['attribute_3'][0] changed from "Accept" to "Decline".
###Value of root['attribute_3'][2] changed from "Decline" to "Accept"

【讨论】：