【发布时间】:2020-01-29 13:55:45
【问题描述】:
我有一个保存在 S3 中的 JSON 文件,我试图在 PySpark 中打开/读取/存储/任何内容作为字典或结构。它看起来像这样:
{
"filename": "some_file.csv",
"md5": "md5 hash",
"client_id": "some uuid",
"mappings": {
"shipping_city": "City",
"shipping_country": "Country",
"shipping_zipcode": "Zip",
"shipping_address1": "Street Line 1",
"shipping_address2": "Street Line 2",
"shipping_state_abbreviation": "State"
}
}
我想从 S3 中读取它并将其存储为字典或结构。当我这样读时:
inputJSON = "s3://bucket/file.json"
dfJSON = sqlContext.read.json(inputJSON, multiLine=True)
我得到一个删除映射的数据框,如下所示:
+---------+-------------+----------------------------------------------------------+-------+
|client_id|filename |mappings |md5 |
+-----------------------+----------------------------------------------------------+-------+
|some uuid|some_file.csv|[City, Country, Zip, Street Line 1, Street Line 2, State] |md5hash|
+-----------------------+----------------------------------------------------------+-------+
是否可以打开文件并将其读入字典,以便我可以访问映射或其他类似的东西?:
jsonDict = inputFile
mappingDict = jsonDict['mappings']
【问题讨论】:
标签: python json apache-spark pyspark