如何在 PySpark 中检查 df1 是否等于 df2？答案

【问题标题】：How to check if df1 is equal to df2 in PySpark?如何在 PySpark 中检查 df1 是否等于 df2？
【发布时间】：2020-08-10 19:26:41
【问题描述】：

df1.show()
+---------+
|Data_Type|
+---------+
|   string|
|   string|
|      int|
+---------+
df2.show()
+---------+
|Data_Type|
+---------+
|   string|
|   string|
|      int|
+---------+

我想将 df1 中的列与 df2["Column_name"] 中的行进行比较（相等性检查）。

我尝试使用连接来比较它们，即通过

df1.join(df2,on="Data_Type",how="left").join(df2,on="Data_Type",how="right")
if(df3.count() == df1.count() == df2.count()):
    print(True)

但这不起作用，因为我在“Data_Type”列下有重复的值，并且在加入后我得到了一个叉积类型的输出，如下所示：

+---------+
|Data_Type|
+---------+
|      int|
|   string|
|   string|
|   string|
|   string|
|   string|
|   string|
|   string|
|   string|
+---------+

还有其他方法可以对数据帧进行相等性检查吗？

【问题讨论】：

标签： dataframe join pyspark

【解决方案1】：

在 spark 中使用 exceptAll（保留重复项）（或）subtract。

df1.show()
#+---------+
#|Data_Type|
#+---------+
#|   string|
#|   string|
#|      int|
#+---------+


df2.show()
#+---------+
#|Data_Type|
#+---------+
#|   string|
#|   string|
#|      int|
#+---------+

df1.exceptAll(df2).count()
df2.exceptAll(df1).count()
#0
df1.subtract(df2).count()
df2.subtract(df1).count()
#0

【讨论】：

【解决方案2】：

希望你在这个困难时期做得很好！

您可以尝试将两个数据帧相减，然后将它们转换为集合。这样，您可以将结果转换为数据框。

    lst = []
    for row in set(df1.collect()) - set(df2.collect()):
         lst.append(row)
    spark.createDataFrame(lst)

此外，由于我们使用集合，如果您有多个列，则两个数据框中的顺序都无关紧要。

希望这会有所帮助！

【讨论】：