如何比较两个表并用其他表中的值替换空值答案

【问题标题】：How to compare two tables and replace nulls with values from other table如何比较两个表并用其他表中的值替换空值
【发布时间】：2019-08-27 07:11:44
【问题描述】：

我正在处理一些任务，其中我们有两个具有相同/不同列的表。如果 table A 的记录有一些列值为 null，则必须更新为 table B 中的值，反之亦然.

table A

id | code | type
1  | null | A
2  | null | null
3  | 123  | C

table B

id | code | type
1  | 456 | A
2  | 789 | A1
3  | null  | C

到目前为止我所做的工作

Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d1_1.csv");
    Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d2_1.csv");



df1
    .as("a").join(df2.as("b"))
    .where("a.id== b.id")
    .withColumn("a.code", 
             functions.when(
                     df1.col("code").isNull(),


                     df2.col("code")  )

).show();

需要的输出

table C

id | code | type
1  | 456 | A
2  | 789 | A1
3  | 123  | C

【问题讨论】：

标签： apache-spark apache-spark-sql dataset

【解决方案1】：

你可以使用合并功能吗？

df1.join(df2, "id")
   .select(df1("id"), 
           coalesce(df1("code"), 
           df2("code")).as("code"), 
           coalesce(df1("type"), 
           df2("type")).as("type"))

然后输出：

+---+----+----+
| id|code|type|
+---+----+----+
|  1| 456|   A|
|  2| 789|  A1|
|  3| 123|   C|
+---+----+----+

【讨论】：

.withColumnRenamed("coalesce(code, code)", "Code") 是重命名的唯一方法还是在select 函数中是可能的
您可以在之后使用列上的别名功能重命名列。（见我编辑的帖子）