【发布时间】:2021-01-15 20:02:12
【问题描述】:
我有 2 个相似的 Spark Dataframes df1 和 df2 我想比较更改:
-
df1和df2共享相同的列 -
df2的行数可以多于df1,但在比较时可以忽略df2中不在df1中的任何其他行 - 比较键列是
PROGRAM_NAME和ACTION
df1 = spark.createDataFrame([
["PROG1","ACTION1","10","NEW"],
["PROG2","ACTION2","12","NEW"],
["PROG3","ACTION1","14","NEW"],
["PROG4","ACTION4","16","NEW"]
],["PROGRAM_NAME", "ACTION", "VALUE1", "STATUS"])
df2 = spark.createDataFrame([
["PROG1","ACTION1","11","IN PROGRESS"],
["PROG2","ACTION2","12","NEW"],
["PROG3","ACTION1","20","FINISHED"],
["PROG4","ACTION4","14","IN PROGRESS"],
["PROG5","ACTION1","20","NEW"]
],["PROGRAM_NAME", "ACTION", "VALUE1", "STATUS"])
按df1、df2 和比较两个数据帧后我想要的预期结果显示如下。
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql pyspark-dataframes