Pyspark 访问和分解 json 的嵌套项答案

【问题标题】：Pyspark accessing and exploding nested items of a jsonPyspark 访问和分解 json 的嵌套项
【发布时间】：2019-07-15 15:17:13
【问题描述】：

我对 spark 很陌生，我正在尝试解析一个包含要聚合的数据的 json 文件，但我无法导航其内容。我搜索了其他解决方案，但找不到任何适合我的解决方案。

这是导入的json数据框的架构：

root
  |-- UrbanDataset: struct (nullable = true)
  |    |-- context: struct (nullable = true)
  |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |-- format: string (nullable = true)
  |    |    |    |-- height: long (nullable = true)
  |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |-- longitude: double (nullable = true)
  |    |    |-- language: string (nullable = true)
  |    |    |-- producer: struct (nullable = true)
  |    |    |    |-- id: string (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |-- timeZone: string (nullable = true)
  |    |    |-- timestamp: string (nullable = true)
  |    |-- specification: struct (nullable = true)
  |    |    |-- id: struct (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |    |-- value: string (nullable = true)
  |    |    |-- name: string (nullable = true)
  |    |    |-- properties: struct (nullable = true)
  |    |    |    |-- propertyDefinition: array (nullable = true)
  |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |-- codeList: string (nullable = true)
  |    |    |    |    |    |-- dataType: string (nullable = true)
  |    |    |    |    |    |-- propertyDescription: string (nullable = true)
  |    |    |    |    |    |-- propertyName: string (nullable = true)
  |    |    |    |    |    |-- subProperties: struct (nullable = true)
  |    |    |    |    |    |    |-- propertyName: array (nullable = true)
  |    |    |    |    |    |    |    |-- element: string (containsNull = true)
  |    |    |    |    |    |-- unitOfMeasure: string (nullable = true)
  |    |    |-- uri: string (nullable = true)
  |    |    |-- version: string (nullable = true)
  |    |-- values: struct (nullable = true)
  |    |    |-- line: array (nullable = true)
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |    |    |-- format: string (nullable = true)
  |    |    |    |    |    |-- height: double (nullable = true)
  |    |    |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |    |    |-- longitude: double (nullable = true)
  |    |    |    |    |-- id: long (nullable = true)
  |    |    |    |    |-- period: struct (nullable = true)
  |    |    |    |    |    |-- end_ts: string (nullable = true)
  |    |    |    |    |    |-- start_ts: string (nullable = true)
  |    |    |    |    |-- property: array (nullable = true)
  |    |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |    |-- name: string (nullable = true)
  |    |    |    |    |    |    |-- val: string (nullable = true)

附加整个json的一个子集here

我的目标是从此架构中检索 values 结构并操作/聚合位于 line.element.property.element.val 中的所有 val

我也尝试将其分解以获取“csv 样式”列中的每个字段，但出现错误：

pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(UrbanDataset.context, UrbanDataset.specification, UrbanDataset.values)' 由于数据类型不匹配：输入to function array 应该都是相同的类型

import pyspark
import pyspark.sql.functions as psf

df = spark.read.format('json').load('data1.json')
df.select(psf.explode(psf.array("UrbanDataset.*"))).show()

谢谢

【问题讨论】：

您能否提供数据集的一小段摘录？
当然，只是添加了总行数的一个子集（应该是 96 行，每 15 分钟一个）。
好吧，我之前没有加载过json。如果我能看到 DataFrame 的图片，那么我可以帮助你分解它。我无法加载这个json 文件。可能是我做得不对，这就是为什么我要求查看数据框。
idk，json已经过验证。但是，我如何向您展示 DataFrame 的图片？这不是发布的架构？你到底是什么意思？ Tnx

标签： python json pyspark

【解决方案1】：

不能直接访问嵌套数组，需要先使用explode。它将为数组中的每个元素创建一条线。

from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))

【讨论】：

那么，在这种特定情况下，我如何访问包含要爆炸的阵列的场线？ ty