【问题标题】:PySpark extract struct key into columnPySpark 将结构键提取到列中
【发布时间】:2022-11-03 16:48:47
【问题描述】:

我正在尝试转换以下架构;

|-- a: struct (nullable = true)
 |    |-- b: struct (nullable = true)
 |    |    |-- one: double (nullable = true)
 |    |    |-- two: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- three: string (nullable = true)
 |    |    |-- four: boolean (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- one: double (nullable = true)
 |    |    |-- two: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- three: string (nullable = true)
 |    |    |-- four: boolean (nullable = true)

进入这个;

 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- struct_key: string (nullable = true)
 |    |    |-- one: double (nullable = true)
 |    |    |-- two: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- three: string (nullable = true)
 |    |    |-- four: boolean (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- struct_key: string (nullable = true)
 |    |    |-- one: double (nullable = true)
 |    |    |-- two: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- three: string (nullable = true)
 |    |    |-- four: boolean (nullable = true)

真的只是想获取结构键并将其转换为字符串并将其添加到列中。 数据集中的 b/c 结构很多,因此需要一些通配符来转换它们。 使用火花 3.2.1

数据是从 JSON 生成的,所以是这样读取的;

df = spark.read.json(json_file)

【问题讨论】:

  • selectExpr('array(a.*) as a') 应该适用于您的情况

标签: apache-spark pyspark


【解决方案1】:

这是一种方法,您首先在内部结构中添加struct_key,然后使用它们创建一个数组。

# input
data_sdf = spark.createDataFrame([(((1, 2), (3, 4)), )], 
                                 'a struct<b: struct<foo: int, bar: int>, c: struct<foo: int, bar: int>>'
                                 )

# +----------------+
# |               a|
# +----------------+
# |{{1, 2}, {3, 4}}|
# +----------------+

# root
#  |-- a: struct (nullable = true)
#  |    |-- b: struct (nullable = true)
#  |    |    |-- foo: integer (nullable = true)
#  |    |    |-- bar: integer (nullable = true)
#  |    |-- c: struct (nullable = true)
#  |    |    |-- foo: integer (nullable = true)
#  |    |    |-- bar: integer (nullable = true)

# processing
data_sdf. 
    selectExpr('a.*'). 
    selectExpr(*['struct("{0}" as struct_key, {0}.*) as {0}'.format(c) for c in data_sdf.selectExpr('a.*').columns]). 
    withColumn('a', func.array(*data_sdf.selectExpr('a.*').columns)). 
    show(truncate=False)

# +----------------------+
# |a                     |
# +----------------------+
# |[{b, 1, 2}, {c, 3, 4}]|
# +----------------------+

# root
#  |-- a: array (nullable = false)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- struct_key: string (nullable = false)
#  |    |    |-- foo: integer (nullable = true)
#  |    |    |-- bar: integer (nullable = true)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-06-09
    • 1970-01-01
    • 2020-06-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多