从 Spark 中的嵌套标记数组中提取 DataFrame答案

【问题标题】：Extract DataFrame from nested, tagged array in Spark从 Spark 中的嵌套标记数组中提取 DataFrame
【发布时间】：2021-05-07 19:41:40
【问题描述】：

我正在使用 Spark 读取以下格式的 JSON 文档：

{
    "items": [
       {"type": "foo", value: 1},
       {"type": "bar", value: 2}
    ]
}

也就是说，数组项由“类型”列标记。

鉴于我知道“类型”的词汇（即 {foo, bar}），我如何得到这样的数据框：

root
 |-- bar: integer (nullable = true)
 |-- foo: integer (nullable = true)

【问题讨论】：

标签： json apache-spark pyspark apache-spark-sql

【解决方案1】：

您可以手动管理架构，如下所示：

>>> df2 = df.selectExpr("array(struct(items[0].value as foo, items[1].value as bar)) as items")
>>> df2.printSchema()
root
 |-- items: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- foo: long (nullable = true)
 |    |    |-- bar: long (nullable = true)

或者使用filter的更通用的方法：

>>> df2 = df.selectExpr("array(struct(filter(items, x -> x.type = 'foo')[0].value as foo, filter(items, x -> x.type = 'bar')[0].value as bar)) as items")
>>> df2.printSchema()
root
 |-- items: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- foo: long (nullable = true)
 |    |    |-- bar: long (nullable = true)

或者使用pivot:

>>> df2 = df.select(expr("inline_outer(items)")).groupBy().pivot("type").agg(
...     first("value")
... )
>>> df2.printSchema()
root
 |-- bar: integer (nullable = true)
 |-- foo: integer (nullable = true)

【讨论】：

如果没有匹配的条目，我想我可以使用element_at 而不是[0] 来避免错误。
我用第三个选项调整了答案，如stackoverflow.com/a/63248480/647151 所示。