【问题标题】:Convert or flatten a JSON having nested data with struct/array to columns将具有结构/数组的嵌套数据的 JSON 转换或展平为列
【发布时间】:2021-10-11 14:59:24
【问题描述】:

以下 JSON 包含一个名为“result”的嵌套属性,其中包含一个键值对数组。

{
"result": [
    [
        {
            "key": "projects.name",
            "value": "Project 1",
            "type": "TEXT"
        },
        {
            "key": "projects.status",
            "value": "Archived",
            "type": "ENUM"
        },
        {
            "key": "user_tasks.start_date",
            "value": "2021-07-08 11:59:34",
            "type": "DATETIME"
        },
        {
            "key": "user_tasks.name",
            "value": "Section 1",
            "type": "TEXT"
        },
        {
            "key": "track_user.duration",
            "value": "00:40:02",
            "type": "INT"
        },
        {
            "key": "project_sections.question_count",
            "value": "24",
            "type": "SMALLINT"
        },
        {
            "key": "project_sections.assigned_to_users",
            "value": "test1@abc.com",
            "type": "JSON"
        }
    ],
    [
        {
            "key": "projects.name",
            "value": "Project 2",
            "type": "TEXT"
        },
        {
            "key": "projects.status",
            "value": "Archived",
            "type": "ENUM"
        },
        {
            "key": "user_tasks.start_date",
            "value": "2021-07-08 11:59:34",
            "type": "DATETIME"
        },
        {
            "key": "user_tasks.name",
            "value": "Section 2",
            "type": "TEXT"
        },
        {
            "key": "track_user.duration",
            "value": "00:40:02",
            "type": "INT"
        },
        {
            "key": "project_sections.question_count",
            "value": "23",
            "type": "SMALLINT"
        },
        {
            "key": "project_sections.assigned_to_users",
            "value": "test1@abc.com",
            "type": "JSON"
        }
    ],
    [
        {
            "key": "projects.name",
            "value": "Project 3",
            "type": "TEXT"
        },
        {
            "key": "projects.status",
            "value": "Archived",
            "type": "ENUM"
        },
        {
            "key": "user_tasks.start_date",
            "value": "2021-07-20 21:30:00",
            "type": "DATETIME"
        },
        {
            "key": "user_tasks.name",
            "value": "Internal Due Date",
            "type": "TEXT"
        },
        {
            "key": "track_user.duration",
            "value": "21:22:49",
            "type": "INT"
        },
        {
            "key": "project_sections.question_count",
            "value": "0",
            "type": "SMALLINT"
        },
        {
            "key": "project_sections.assigned_to_users",
            "value": "test1@abc.com",
            "type": "JSON"
        }
    ]
}

现在,我想要扩展此 JSON,并在嵌套数组部分中包含所有键,例如使用 Spark SQL / Scala 在下面的“预期输出”部分中:

我尝试使用explode 和pivot 函数,但不能正常工作。

【问题讨论】:

  • 如果我的回答解决了您的问题,您可以将其标记为“已接受”

标签: json scala apache-spark apache-spark-sql


【解决方案1】:

我试过你的问题,这里就是解决方案。

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions._
val DF= spark.read.json(spark.createDataset(json_ip::Nil))

display(
        
    DF.select(explode($"result"))
      .withColumn("r_num",row_number over(Window.orderBy($"col")))
      .withColumn("res_exp", explode($"col"))
      .drop($"col")
      .withColumn("all_row_values",$"res_exp.value")
      .withColumn("columns",$"res_exp.key")
      .drop("res_exp")
      .groupBy($"r_num")
      .pivot($"columns")
      .agg(first($"all_row_values"))
      .drop("r_num")
       )

输出:

【讨论】:

  • 非常感谢,Pradeep。这正是我所需要的。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-08-26
  • 2021-04-13
  • 1970-01-01
  • 2021-08-26
  • 2020-09-27
  • 2017-09-06
  • 2020-03-07
相关资源
最近更新 更多