PySpark：在df1中标记df2中存在的特定列的行？答案

【问题标题】：PySpark: Flag rows of specific columns in df1, that exist in df2?PySpark：在df1中标记df2中存在的特定列的行？
【发布时间】：2021-08-27 22:31:36
【问题描述】：

我正在使用 Pypsark。我有两个数据框，分别称为 df1 和 df2。我希望 df1 创建一个新列来标记 df1 的列 (A、B) 的哪些行存在和不存在于 df2 的列 D、E 中。 1 标记存在，否则为 0。转换的一个例子是：

df1

A	B	C	Exist
0	0	1	0
0	0	1	1
0	0	1	0

df1 的焦点列是 A、B，而 df2 的焦点列是 D、E。只有这些列的第二行匹配，因此 df1 将其新创建的存在列标记为 1。我怎样才能做到这一点？

【问题讨论】：

【解决方案1】：

df1.createOrReplaceTempView("table1")

df2.createOrReplaceTempView("table2")

spark.sql("select a,b,c, case when d is null and e is null then 0 else 1 end from table1 left external join table2 on A=D and B=E").show()

【讨论】：