如何合并 DataFrame 并仅添加缺失的行？答案

【问题标题】：How to union DataFrames and add only missing rows?如何合并 DataFrame 并仅添加缺失的行？
【发布时间】：2017-06-04 04:52:52
【问题描述】：

我有一个数据框 df1，其中包含以下数据：

**customer_id**   **product**   **Val_id**    **rule_name**
     1               A            1               rule1
     2               B            X               rule1

我有另一个数据框 df2，其中包含以下数据：

**customer_id**   **product**   **Val_id**    **rule_name**
     1               A            1               rule2
     2               B            X               rule2
     3               C            y               rule2

两个数据框中的规则名称值始终是固定的

我想要一个新的联合数据框 df3。它应该有来自数据框 df1 的所有客户和来自数据框 df2 的所有其他客户，这些客户在 df1 中不存在。所以最终的 df3 应该是这样的：

**customer_id**   **product**   **Val_id**        **rule_name**
         1               A            1               rule1
         2               B            X               rule1
         3               C            y               rule2

谁能帮我实现这个结果。任何帮助将不胜感激。

【问题讨论】：

标签： scala apache-spark apache-spark-sql

【解决方案1】：

给定以下数据集：

val df1 = Seq(
  (1, "A", "1", "rule1"),
  (2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")

val df2 = Seq(
  (1, "A", "1", "rule2"),
  (2, "B", "X", "rule2"),
  (3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")

以及要求：

它应该有来自数据框 df1 的所有客户和来自数据框 df2 的所有其他客户，这些客户在 df1 中不存在。

我的第一个解决方案可能如下：

val missingCustomers = df2.
  join(df1, Seq("customer_id"), "leftanti").
  select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
|          1|      A|     1|    rule1|
|          2|      B|     X|    rule1|
|          3|      C|     y|    rule2|
+-----------+-------+------+---------+

另一种（可能更慢）的解决方案如下：

// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
  select("customer_id").
  except(df1.select("customer_id")).
  as[Int].
  collect

// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))

scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
|          1|      A|     1|    rule1|
|          2|      B|     X|    rule1|
|          3|      C|     y|    rule2|
+-----------+-------+------+---------+

【讨论】：

这里不行。如果数据没有错误，你必须加入。