【发布时间】:2021-08-27 09:06:43
【问题描述】:
如何在 PySpark 中展平不同形状的嵌套数组?这里用 same shape arrays 回答How to flatten nested arrays by merging values in spark。对于具有不同形状的数组,我收到如下所述的错误。
数据结构:
- 静态名称:
id、date、val、num(可以硬编码) - 动态名称:
name_1_a、name_10000_xvz(无法硬编码,因为数据框有多达 10000 个列/数组)
输入df:
root
|-- id: long (nullable = true)
|-- name_10000_xvz: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) **NOTE: additional `num` field **
| | |-- val: long (nullable = true)
|-- name_1_a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
df.show(truncate=False)
+---+---------------------------------------------------------------------+---------------------------------+
|id |name_10000_xvz |name_1_a |
+---+---------------------------------------------------------------------+---------------------------------+
|1 |[{2000, null, 30}, {2001, null, 31}, {2002, null, 32}, {2003, 1, 33}]|[{2001, 1}, {2002, 2}, {2003, 3}]|
+---+---------------------------------------------------------------------+---------------------------------+
所需的输出df:
+---+--------------+----+---+---+
| id| name|date|val|num|
+---+--------------+----+---+---+
| 1| name_1_a|2001| 1| |
| 1| name_1_a|2002| 2| |
| 1| name_1_a|2003| 3| |
| 1|name_10000_xvz|2000| 30| |
| 1|name_10000_xvz|2001| 31| |
| 1|name_10000_xvz|2002| 32| |
| 1|name_10000_xvz|2003| 33| 1 |
+---+--------------+----+---+---+
要重现的代码:
注意:当我在 TRANSFORM({name}, el -> STRUCT("{name}" AS name, el.date, el.val, el.num 中添加 el.num 时,出现以下错误。
import pyspark.sql.functions as f
df = spark.read.json(
sc.parallelize(
[
"""{"id":1,"name_1_a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"name_10000_xvz":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33, "num":1}]}"""
]
)
).select("id", "name_1_a", "name_10000_xvz")
names = [column for column in df.columns if column.startswith("name_")]
expressions = []
for name in names:
expressions.append(
f.expr(
'TRANSFORM({name}, el -> STRUCT("{name}" AS name, el.date, el.val, el.num))'.format(
name=name
)
)
)
flatten_df = df.withColumn("flatten", f.flatten(f.array(*expressions))).selectExpr(
"id", "inline(flatten)"
)
输出:
AnalysisException: No such struct field num in date, Val
【问题讨论】:
-
“name_1_a”不包含“num”字段是否正常?这就是你例外的原因。问题是:当字段值“缺失”时,该值是设置为 NULL 还是该字段完全缺失?
-
是的,字段 num 不见了,这是我的问题。
-
我可以解决 pandas 中的问题,如本示例 stackoverflow.com/questions/68941232/… 所示,但不能解决 pyspark 。
-
你输入的是 json 吗?你能多展示一点吗?更多行和更多列的原始格式?
标签: python apache-spark pyspark apache-spark-sql