【问题标题】:convert pyspark dataframe into nested json structure将 pyspark 数据帧转换为嵌套的 json 结构
【发布时间】:2021-04-13 15:29:35
【问题描述】:

我正在尝试将下面的数据框转换为嵌套的 json(字符串)

输入:

+---+---+-------+------+
| id|age| name  |number|
+---+---+-------+------+
|  1| 12|  smith|  uber|
|  2| 13|    jon| lunch|
|  3| 15|jocelyn|rental|
|  3| 15|  megan|   sds|
+---+---+-------+------+

输出:-

+---+---+--------------------------------------------------------------------+
|id |age|values                                                              
|
+---+---+--------------------------------------------------------------------+
|1  |12 |[{"number": "uber", "name": "smith"}]                                   
|
|2  |13 |[{"number": "lunch", "name": "jon"}]                                     
|
|3  |15 |[{"number": "rental", "name": "megan"}, {"number": "sds", "name": "jocelyn"}]|
+---+---+--------------------------------------------------------------------+

我的代码

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
# List
data = [(1,12,"smith", "uber"),
        (2,13,"jon","lunch"),(3,15,"jocelyn","rental")
        ,(3,15,"megan","sds")
        ]

# Create a schema for the dataframe
schema = StructType([
  StructField('id', IntegerType(), True),
  StructField('age', IntegerType(), True),
  StructField('number', StringType(), True),
    StructField('name', StringType(), True)])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

我尝试使用 collect_list 和 collect_set,无法获得所需的输出。

【问题讨论】:

    标签: json apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您可以使用collect_listto_json 为每个组收集一个json数组:

    import pyspark.sql.functions as F
    
    df2 = df.groupBy(
        'id', 'age'
    ).agg(
        F.collect_list(
            F.to_json(
                F.struct('number', 'name')
            )
        ).alias('values')
    ).orderBy(
        'id', 'age'
    )
    
    df2.show(truncate=False)
    +---+---+-----------------------------------------------------------------------+
    |id |age|values                                                                 |
    +---+---+-----------------------------------------------------------------------+
    |1  |12 |[{"number":"smith","name":"uber"}]                                     |
    |2  |13 |[{"number":"jon","name":"lunch"}]                                      |
    |3  |15 |[{"number":"jocelyn","name":"rental"}, {"number":"megan","name":"sds"}]|
    +---+---+-----------------------------------------------------------------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-08-26
      • 1970-01-01
      • 2021-08-26
      • 1970-01-01
      • 2020-04-02
      • 2021-06-12
      • 2020-03-07
      相关资源
      最近更新 更多