【问题标题】:convert a Nested Json to a dataframe in Pyspark将嵌套的 Json 转换为 Pyspark 中的数据框
【发布时间】:2021-05-07 14:20:57
【问题描述】:

我正在尝试从带有嵌套字段和日期字段的 json 创建一个数据框,我想连接这些字段:

root
 |-- MODEL: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- START_Time: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- WEIGHT: string (nullable = true)
 |-- REGISTED: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- TOTAL: string (nullable = true)
 |-- SCHEDULED: struct (nullable = true)
 |    |-- day: long (nullable = true)
 |    |-- hour: long (nullable = true)
 |    |-- minute: long (nullable = true)
 |    |-- month: long (nullable = true)
 |    |-- second: long (nullable = true)
 |    |-- year: long (nullable = true)
 |-- PACKAGE: string (nullable = true)

目标是得到一个更像:

+---------+------------------+----------+-----------------+---------+-----------------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |SCHEDULED        |   
+---------+------------------+----------+-----------------+---------+-----------------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT   |yy-mm-dd-hh-mm-ss|TOTAL    |yy-mm-dd-hh-mm-ss| 

其中 yy-mm-dd-hh-mm-ss 是 json 中的天、小时、分钟....

|-- example: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)

我尝试过explode功能,可能没有按应有的方式使用,但没有用 谁能启发我的解决方案 谢谢

【问题讨论】:

标签: json dataframe pyspark


【解决方案1】:

您可以通过以下简单步骤来完成。

  1. 让我们在 data.json 文件中得到如下数据

{“模型”:“abc”,“代码”:“CODE1”,“START_Time”:{“天”:“05”,“小时”:“08”,“分钟”:“30”,“月”:“08”,“秒”:“30”,“年”:“21”},“重量”:“231”,“注册”:{“日”:“05”,“小时”:“ 08”,“分钟”:“30”,“月”:“08”,“秒”:“30”,“年”:“21”},“总”:“1”,“计划”:{“日:“05”,“小时”:“08”,“分钟”:“30”,“月”:“08”,“秒”:“30”,“年”:“21”},“PACKAGE” ": "汽车"}

此数据与您共享的架构相同。

  1. 如下在pyspark中读取这个json文件。

    from pyspark.sql.functions import *
    
    df = spark.read.json('data.json')
    
  2. 现在您可以读取嵌套值并修改列值,如下所示。

    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
    

输出将是

CODE MODEL PACKAGE REGISTED SCHEDULED START_Time TOTAL WEIGHT
CODE1 abc CAR 21-08-05-08-30-30 21-08-05-08-30-30 21-08-05-08-30-30 1 231

【讨论】:

    猜你喜欢
    • 2021-06-12
    • 1970-01-01
    • 2021-08-26
    • 1970-01-01
    • 1970-01-01
    • 2021-04-13
    • 2019-07-08
    • 2020-04-02
    相关资源
    最近更新 更多