根据 Scala 中的另一列聚合数据框列中的数据答案

【问题标题】：Aggregate data from a dataframe column based on another column in Scala根据 Scala 中的另一列聚合数据框列中的数据
【发布时间】：2019-12-18 15:57:53
【问题描述】：

我有一个表格（数据框）如下：

    scala> df1.printSchema
      root
       |-- id: string (nullable = true)
       |-- col1: string (nullable = true)
       |-- col2: array (nullable = true)
       |    |-- element: string (containsNull = true)

我需要为col2中的每个元素在col1中创建一个数据数组，如下：

     scala> df2.printSchema
      root
       |-- id: string (nullable = true)
       |-- c1: array (nullable = true)
       |    |-- element: string (containsNull = true)
       |-- c2: string (nullable = true)

df2.c2 是 df1.col2 中的每个元素，df2.c1 是 df1.col1 元素的数组。

SQL (hive) 或 Spark/Scala 可能会有所帮助。

更多解释：

df1:

  +----------------------------+
  | id | col1 |       col2     |
  +----------------------------+
  | 1  |  q1  |[i1, i2]        |
  | 1  |  q2  |[i1, i3]        |
  | 1  |  q3  |[i2, i4]        |
  | 2  |  q4  |[i5]            |
  | 2  |  q5  |[i6]            |
  | 3  |  q6  |[i7,i1,i2]      |
  | 3  |  q7  |[i1]            |
  +----------------------------+

df2:

  +----------------------------+
  | id |    c1      |    c2    |
  +----------------------------+
  | 1  |  [q1, q2]  |    i1    |
  | 1  |  [q1, q3]  |    i2    |
  | 1  |  [q2]      |    i3    |
  | 1  |  [q3]      |    i4    |
  | 2  |  [q4]      |    i5    |
  | 2  |  [q5]      |    i6    |
  | 3  |  [q6]      |    i7    |
  | 3  |  [q6, q7]  |    i1    |
  | 3  |  [q6]      |    i2    |
  +----------------------------+

【问题讨论】：

能否请您添加一些示例数据和您的期望？很容易理解req。在这一行上我有点困惑“如果 df1.col2 元素在同一行，则 df1.col1 将被添加到 df2.c2”。谢谢
我删除了它，因为它令人困惑。这是额外不需要的解释，基本上我需要将 col1 中的所有条目收集到 col2 中数组的每个元素的集合中。
这个问题是否重复：stackoverflow.com/q/57447955/2700344 ?
没有。我所做的是：首先分解 col2，然后对 id 和 col1 进行分组。是否正确？
您需要按 id、col2(exploded) 聚合 col1 组。看我的回答

标签： sql scala dataframe join hive

【解决方案1】：

先分解col2，然后使用collect_set聚合col1数组：

select  d.id, collect_set(d.col1) as c1, s.c2
   from df1 d lateral view explode(d.col2) s as c2
group by d.id, s.c2

【讨论】：