枢轴火花多级数据集答案

【问题标题】：Pivot spark multilevel Dataset枢轴火花多级数据集
【发布时间】：2017-06-08 19:54:10
【问题描述】：

我在 Spark 中有 Dataset 和这些架构：

root
 |-- from: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v1: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v2: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v3: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- to: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)

如何在 Scala 上从此数据集中制作表格（只有 3 列 id、name、tags）？

【问题讨论】：

标签： scala apache-spark

【解决方案1】：

只需将所有列组合成array、explode 并选择所有嵌套字段：

import org.apache.spark.sql.functions.{array, col, explode}

case class Vertex(id: String, name: String, tags: String)

val df  = Seq(((
  Vertex("1", "from", "a"), Vertex("2", "V1", "b"), Vertex("3", "V2", "c"), 
  Vertex("4", "v3", "d"), Vertex("5", "to", "e")
)).toDF("from", "v1", "v2", "v3", "to")


df.select(explode(array(df.columns map col: _*)).alias("col")).select("col.*")

结果如下：

+---+----+----+
| id|name|tags|
+---+----+----+
|  1|from|   a|
|  2|  V1|   b|
|  3|  V2|   c|
|  4|  v3|   d|
|  5|  to|   e|
+---+----+----+

【讨论】：