【问题标题】:Combine all dataframe columns into one column as a JSON with JSON types preserved将所有数据框列合并为一列作为 JSON 并保留 JSON 类型
【发布时间】:2021-12-28 22:40:03
【问题描述】:
val someDF = Seq(
  (8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
  (64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")

someDF.show(false)

+------+---------------------------------------------------------------+
|number|word                                                           |
+------+---------------------------------------------------------------+
|8     |{"details":{"decision":"ACCEPT","source":"Rules"}              |
|64    |{"details":{"decision":"ACCEPT","source":"Rules"}              |
+------+---------------------------------------------------------------+

问题陈述: 我想将所有列合并为 1 列,其中 JSON 类型保留在单个输出列中。就像我在下面得到的那样,这不是引号等的转义。

我尝试了什么:

someDF.toJSON.toDF.show(false)

// this escaped the quotes, which I don't want
+------------------------------------------------------------------------------------------------+
|value                                                                                           |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"}               |
|{"number":64,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"}              |
+------------------------------------------------------------------------------------------------+

someDF.select( to_json(struct(col("*"))).alias("value")) 也有同样的问题

我想要什么:

+------------------------------------------------------------------------------------------------+
|value                                                                                           |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                          |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                         |
+------------------------------------------------------------------------------------------------+

有没有办法做到这一点?

更新: 虽然我在这里使用了一个简单的数据框,但实际上我有数百列,因此手动定义的架构对我不起作用。

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    “someDF”中的“word”列是字符串类型,因此to_json将其视为常规字符串。这里的关键是在使用to_json之前将“word”列转换为struct类型。

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    
    val someDF = Seq(
      (8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
      (64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
    ).toDF("number", "word")
    
    val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
    someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)
    

    结果:

    +-----------------------------------------------------------------------+
    |value                                                                  |
    +-----------------------------------------------------------------------+
    |{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
    |{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
    +-----------------------------------------------------------------------+
    

    【讨论】:

    • 谢谢@memoryz,这行得通,但实际上我没有在这里使用someDF,而是有一个包含数百列的数据框,其中大部分列是JSON 字段。有没有办法将所有这些列组合成一个列而不手动指定列/模式?可以从数据框生成模式并使用吗?
    【解决方案2】:

    您可以在数据框上使用columns 方法检索列列表,然后使用concatconcat_ws 内置函数的组合手动构建您的JSON 字符串:

    import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
    
    val result = someDF.select(
      concat(
        lit("{"),
        concat_ws(
          ",", 
          someDF.columns.map(x => concat(lit("\""), lit(x), lit("\":"), col(x))): _*
        ),
        lit("}")).as("value")
    )
    

    【讨论】:

      猜你喜欢
      • 2021-07-21
      • 2015-09-17
      • 1970-01-01
      • 1970-01-01
      • 2020-07-11
      • 2015-02-04
      • 1970-01-01
      • 2023-01-23
      • 2018-04-16
      相关资源
      最近更新 更多