【问题标题】:convert array of array to array of struct in pyspark将数组数组转换为pyspark中的结构数组
【发布时间】:2021-09-10 03:43:44
【问题描述】:

我有如下数据框

id  contact_persons
-----------------------
1   [[abc, abc@xyz.com, 896676, manager],[pqr, pqr@xyz.com, 89809043, director],[stu, stu@xyz.com, 09909343, programmer]]    

架构看起来像这样。

root
 |-- id: string (nullable = true)
 |-- contact_persons: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

我需要像下面的架构一样转换这个数据框。

 root
 |-- id: string (nullable = true)
 |-- contact_persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- emails: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- phone: string (nullable = true)
 |    |    |-- roles: string (nullable = true)

我知道 pyspark 中有 struct 函数,但在这种情况下,我不知道如何使用它,因为数组是动态大小的。

【问题讨论】:

    标签: pyspark apache-spark-sql


    【解决方案1】:

    您可以使用TRANSFORM 表达式进行转换:

    import pyspark.sql.functions as f
    
    df = spark.createDataFrame([
      [1, [['abc', 'abc@xyz.com', '896676', 'manager'],
           ['pqr', 'pqr@xyz.com', '89809043', 'director'],
           ['stu', 'stu@xyz.com', '09909343', 'programmer']]]
    ], schema='id string, contact_persons array<array<string>>')
    
    expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
    output_df = df.withColumn('contact_persons', f.expr(expression))
    
    # output_df.printSchema()
    # root
    #  |-- id: string (nullable = true)
    #  |-- contact_persons: array (nullable = true)
    #  |    |-- element: struct (containsNull = false)
    #  |    |    |-- name: string (nullable = true)
    #  |    |    |-- emails: string (nullable = true)
    #  |    |    |-- phone: string (nullable = true)
    #  |    |    |-- roles: string (nullable = true)
    
    output_df.show(truncate=False)
    +---+-----------------------------------------------------------------------------------------------------------------------+
    |id |contact_persons                                                                                                        |
    +---+-----------------------------------------------------------------------------------------------------------------------+
    |1  |[{abc, abc@xyz.com, 896676, manager}, {pqr, pqr@xyz.com, 89809043, director}, {stu, stu@xyz.com, 09909343, programmer}]|
    +---+-----------------------------------------------------------------------------------------------------------------------+
    

    【讨论】:

      猜你喜欢
      • 2021-11-30
      • 1970-01-01
      • 1970-01-01
      • 2021-12-01
      • 1970-01-01
      • 1970-01-01
      • 2015-06-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多