如何在不传递pyspark json中的父属性的情况下访问嵌套属性答案

【问题标题】：How to access nested attribute without passing parent attribute in pyspark json如何在不传递pyspark json中的父属性的情况下访问嵌套属性
【发布时间】：2021-07-04 07:49:26
【问题描述】：

我正在尝试使用 pyspark 访问以下 json 的内部属性

[
 {
    "432": [
        {
            "atttr1": null,
            "atttr2": "7DG6",
            "id":432,
            "score": 100
        }
    ]
},
 {
    "238": [
        {
            "atttr1": null,
            "atttr2": "7SS8",
            "id":432,
            "score": 100
        }
    ]
}
]

在输出中，我正在寻找类似下面的 csv 格式的内容 atttr1, atttr2,id,分数空,"7DG6",432,100 null,"7SS8",238,100

我知道我可以像下面这样获得这些详细信息，但我不想在 lambda 表达式中传递 432 或 238，因为在更大的 json 中这（斜体）会有所不同。我想遍历所有可用的值。

print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())

我也尝试注册一个名为“test”的临时表，但它给出了消息元素错误。_id 不存在。

inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")

我们将不胜感激任何帮助。我正在使用火花 2.4

【问题讨论】：

什么是peopleDF？你能显示peopleDF.show()的输出吗？
那是输入 df。将其重命名。输出 .show() 也是 +--------------------+-------- + | 238| 432| +--------------------+--------+ |空|[[, 7DG6, 432, 100]]| |[[, 7SS8, 432, 100]]|空| +--------------------+--------+

标签： python json apache-spark pyspark

【解决方案1】：

不使用 pyspark 功能，你可以这样做：

data = json.loads(json_str)  # or whatever way you're getting the data

columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns))  # headers

for item in data:
    for obj in list(item.values())[0]:  # since each list has only one object
        print(','.join(str(obj[col]) for col in columns))

输出：

atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100

或者

for item in data:
    obj = list(item.values())[0][0]  # since the object is the one and only item in list
    print(','.join(str(obj[col]) for col in columns))

仅供参考，您可以将它们存储在变量中或将其写入 csv 而不是/也可以打印。

如果您只是想将其转储到 csv，see this answer。

【讨论】：