【问题标题】:Explode multiple columns of same type with different lengths分解多个不同长度的相同类型的列
【发布时间】:2019-02-23 00:05:41
【问题描述】:

我有一个以下格式的 spark 数据框需要分解。我检查了其他解决方案,例如this one。但是,就我而言,beforeafter 可以是不同长度的数组。

root
 |-- id: string (nullable = true)
 |-- before: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- start_time: string (nullable = true)
 |    |    |-- end_time: string (nullable = true)
 |    |    |-- area: string (nullable = true)
 |-- after: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- start_time: string (nullable = true)
 |    |    |-- end_time: string (nullable = true)
 |    |    |-- area: string (nullable = true)

例如,如果数据框只有一行,before 是一个大小为 2 的数组,after 是一个大小为 3 的数组,分解后的版本应该有 5 行,具有以下架构:

root
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- start_time: integer (nullable = false)
 |-- end_time: string (nullable = true)
 |-- area: string (nullable = true)

其中type 是一个新列,可以是"before" 或“之后”

我可以在两个单独的爆炸中执行此操作,我在每个爆炸中创建 type 列,然后在 union 中创建。

val dfSummary1 = df.withColumn("before_exp", 
explode($"before")).withColumn("type", 
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")

val dfSummary2 = df.withColumn("after_exp", 
explode($"after")).withColumn("type", 
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")

val dfResult = dfSumamry1.unionAll(dfSummary2)

但是,我想知道是否有更优雅的方法来做到这一点。谢谢。

【问题讨论】:

  • 分开做然后联合或加入 id 就是我所做的。

标签: scala apache-spark apache-spark-sql explode


【解决方案1】:

您也可以在没有联合的情况下实现此目的。用数据:

case class Area(start_time: String, end_time: String, area: String)

val df = Seq((
  "1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
  Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")

你可以的

df
  .select($"id",
    explode(
      array(
        struct(lit("before").as("type"), $"before".as("data")),
        struct(lit("after").as("type"), $"after".as("data"))
      )
    ).as("step1")
  )
 .select($"id",$"step1.type", explode($"step1.data").as("step2"))
 .select($"id",$"type", $"step2.*")
 .show()

+---+------+----------+--------+----+
| id|  type|start_time|end_time|area|
+---+------+----------+--------+----+
|  1|before|     01:00|   01:30|  10|
|  1|before|     02:00|   02:30|  20|
|  1| after|     07:00|   07:30|  70|
|  1| after|     08:00|   08:30|  80|
|  1| after|     09:00|   09:30|  90|
+---+------+----------+--------+----+

【讨论】:

    【解决方案2】:

    我认为exploding 两列分别后跟union 是一种相当直接的方法。您可以稍微简化 StructField 元素的选择,并为重复的 explode 过程创建一个简单的方法,如下所示:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.DataFrame
    
    case class Area(start_time: String, end_time: String, area: String)
    
    val df = Seq((
      "1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
      Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
    )).toDF("id", "before", "after")
    
    def explodeCol(df: DataFrame, colName: String): DataFrame = {
      val expColName = colName + "_exp"
      df.
        withColumn("type", lit(colName)).
        withColumn(expColName, explode(col(colName))).
        select("id", "type", expColName + ".*")
    }
    
    val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
    
    dfResult.show
    // +---+------+----------+--------+----+
    // | id|  type|start_time|end_time|area|
    // +---+------+----------+--------+----+
    // |  1|before|     01:00|   01:30|  10|
    // |  1|before|     02:00|   02:30|  20|
    // |  1| after|     07:00|   07:30|  70|
    // |  1| after|     08:00|   08:30|  80|
    // |  1| after|     09:00|   09:30|  90|
    // +---+------+----------+--------+----+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-08-29
      • 2020-03-01
      • 1970-01-01
      • 2018-05-27
      • 1970-01-01
      • 2018-12-11
      • 2021-06-20
      • 2014-07-04
      相关资源
      最近更新 更多