【发布时间】:2019-07-15 15:17:13
【问题描述】:
我对 spark 很陌生,我正在尝试解析一个包含要聚合的数据的 json 文件,但我无法导航其内容。 我搜索了其他解决方案,但找不到任何适合我的解决方案。
这是导入的json数据框的架构:
root
|-- UrbanDataset: struct (nullable = true)
| |-- context: struct (nullable = true)
| | |-- coordinates: struct (nullable = true)
| | | |-- format: string (nullable = true)
| | | |-- height: long (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- language: string (nullable = true)
| | |-- producer: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| | |-- timestamp: string (nullable = true)
| |-- specification: struct (nullable = true)
| | |-- id: struct (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- propertyDefinition: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- codeList: string (nullable = true)
| | | | | |-- dataType: string (nullable = true)
| | | | | |-- propertyDescription: string (nullable = true)
| | | | | |-- propertyName: string (nullable = true)
| | | | | |-- subProperties: struct (nullable = true)
| | | | | | |-- propertyName: array (nullable = true)
| | | | | | | |-- element: string (containsNull = true)
| | | | | |-- unitOfMeasure: string (nullable = true)
| | |-- uri: string (nullable = true)
| | |-- version: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- line: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- coordinates: struct (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- height: double (nullable = true)
| | | | | |-- latitude: double (nullable = true)
| | | | | |-- longitude: double (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- period: struct (nullable = true)
| | | | | |-- end_ts: string (nullable = true)
| | | | | |-- start_ts: string (nullable = true)
| | | | |-- property: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- val: string (nullable = true)
附加整个json的一个子集here
我的目标是从此架构中检索 values 结构并操作/聚合位于 line.element.property.element.val 中的所有 val
我也尝试将其分解以获取“csv 样式”列中的每个字段,但出现错误:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(
UrbanDataset.context,UrbanDataset.specification,UrbanDataset.values)' 由于数据类型不匹配:输入to function array 应该都是相同的类型
import pyspark
import pyspark.sql.functions as psf
df = spark.read.format('json').load('data1.json')
df.select(psf.explode(psf.array("UrbanDataset.*"))).show()
谢谢
【问题讨论】:
-
您能否提供数据集的一小段摘录?
-
当然,只是添加了总行数的一个子集(应该是 96 行,每 15 分钟一个)。
-
好吧,我之前没有加载过
json。如果我能看到 DataFrame 的图片,那么我可以帮助你分解它。我无法加载这个json文件。可能是我做得不对,这就是为什么我要求查看数据框。 -
idk,json已经过验证。但是,我如何向您展示 DataFrame 的图片?这不是发布的架构?你到底是什么意思? Tnx