【问题标题】:Pivot spark multilevel Dataset枢轴火花多级数据集
【发布时间】:2017-06-08 19:54:10
【问题描述】:

我在 Spark 中有 Dataset 和这些架构:

root
 |-- from: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v1: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v2: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- v3: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)
 |-- to: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- tags: string (nullable = true)

如何在 Scala 上从此数据集中制作表格(只有 3 列 id、name、tags)?

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    只需将所有列组合成arrayexplode 并选择所有嵌套字段:

    import org.apache.spark.sql.functions.{array, col, explode}
    
    case class Vertex(id: String, name: String, tags: String)
    
    val df  = Seq(((
      Vertex("1", "from", "a"), Vertex("2", "V1", "b"), Vertex("3", "V2", "c"), 
      Vertex("4", "v3", "d"), Vertex("5", "to", "e")
    )).toDF("from", "v1", "v2", "v3", "to")
    
    
    df.select(explode(array(df.columns map col: _*)).alias("col")).select("col.*")
    

    结果如下:

    +---+----+----+
    | id|name|tags|
    +---+----+----+
    |  1|from|   a|
    |  2|  V1|   b|
    |  3|  V2|   c|
    |  4|  v3|   d|
    |  5|  to|   e|
    +---+----+----+
    

    【讨论】:

      猜你喜欢
      • 2020-07-22
      • 1970-01-01
      • 1970-01-01
      • 2019-03-17
      • 1970-01-01
      • 1970-01-01
      • 2018-11-27
      • 1970-01-01
      • 2020-12-27
      相关资源
      最近更新 更多