【问题标题】:combining Json and normal columns with Pyspark将 Json 和普通列与 Pyspark 相结合
【发布时间】:2021-07-09 00:44:52
【问题描述】:

我有一个将普通列与 Json 列混合的平面文件

2020-08-05 00:00:04,489|{"Colour":"Blue", "Reason":"Sky","number":"1"}
2020-10-05 00:00:04,489|{"Colour":"Yellow", "Reason":"Flower","number":"2"}

我想用 pyspark 把它弄平:

|Timestamp|Colour|Reason|
|--------|--------|--------|
|2020-08-05 00:00:04,489|Blue| Sky|
|2020-10-05 00:00:04,489|Yellow| Flower|

目前我只能弄清楚如何使用 spark.read.json 和 Map 转换 Json,但是如何组合时间戳等常规列?

【问题讨论】:

    标签: json apache-spark pyspark apache-spark-sql


    【解决方案1】:

    让我们重建您的数据

    data2 = [("2020-08-05 00:00:04,489",'{"Colour":"Blue", "Reason":"Sky","number":"1"}'),
        ("2020-10-05 00:00:04,489",'{"Colour":"Yellow", "Reason":"Flower","number":"2"}')]
    
    schema = StructType([ \
        StructField("x",StringType(),True), \
        StructField("y",StringType(),True)])
    df = spark.createDataFrame(data=data2,schema=schema)
    df.printSchema()
    df.show(truncate=False)
    

    根据文档,我们可以使用 schema_of_json 解析 JSON 字符串并以 DDL 格式推断其架构

    schema=df.select(F.schema_of_json(df.select("y").first()[0])).first()[0]
    
    df.withColumn("y", F.from_json("y",\ schema)).selectExpr('x',"y.*").show(truncate=False)
    
    +-----------------------+------+------+------+
    |x                      |Colour|Reason|number|
    +-----------------------+------+------+------+
    |2020-08-05 00:00:04,489|Blue  |Sky   |1     |
    |2020-10-05 00:00:04,489|Yellow|Flower|2     |
    +-----------------------+------+------+------+
    

    【讨论】:

      【解决方案2】:

      您可以使用get_json_object。假设原来的列被称为col1col2,那么你可以这样做:

      df2 = df.select(
          F.col('col1').alias('Timestamp'), 
          F.get_json_object('col2', '$.Colour').alias('Colour'), 
          F.get_json_object('col2', '$.Reason').alias('Reason')
      )
      
      df2.show(truncate=False)
      +-----------------------+------+------+
      |Timestamp              |Colour|Reason|
      +-----------------------+------+------+
      |2020-08-05 00:00:04,489|Blue  |Sky   |
      |2020-10-05 00:00:04,489|Yellow|Flower|
      +-----------------------+------+------+
      

      或者你可以使用from_json:

      import pyspark.sql.functions as F
      
      df2 = df.select(
          F.col('col1').alias('Timestamp'), 
          F.from_json('col2', 'Colour string, Reason string').alias('col2')
      ).select('Timestamp', 'col2.*')
      
      df2.show(truncate=False)
      +-----------------------+------+------+
      |Timestamp              |Colour|Reason|
      +-----------------------+------+------+
      |2020-08-05 00:00:04,489|Blue  |Sky   |
      |2020-10-05 00:00:04,489|Yellow|Flower|
      +-----------------------+------+------+
      

      【讨论】:

        猜你喜欢
        • 2020-05-23
        • 2019-08-07
        • 1970-01-01
        • 1970-01-01
        • 2015-04-21
        • 2020-10-11
        • 1970-01-01
        • 1970-01-01
        • 2019-05-12
        相关资源
        最近更新 更多