这里有一些代码可以帮助您入门:
data = [
("hi", {"Name": "David", "Age": "25", "Location": "New York", "Height": "170", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "1", "Singing": "2"}, "Skills": {"Coding": "2", "Swimming": "4"}}}, "bye"),
("hi", {"Name": "Helen", "Age": "28", "Location": "New York", "Height": "160", "fields": {"Color": "Blue", "Shape": "Round", "Hobby": {"Dance": "5", "Singing": "6"}}}, "bye"),
]
df = spark.createDataFrame(data, ["greeting", "dic", "farewell"])
res = df.select(
F.col("dic").getItem("Name").alias(str("Name")),
F.col("dic")["Age"].alias(str("Age"))
)
res.show()
+-----+---+
| Name|Age|
+-----+---+
|David| 25|
|Helen| 28|
+-----+---+
res.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
Spark 无法处理多种不同类型的字典值。常规 Python 可以处理混合类型的字典键/值。
我们可以运行df.printSchema() 来查看 PySpark 是如何解释字典值的:
root
|-- greeting: string (nullable = true)
|-- dic: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- farewell: string (nullable = true)
您的示例数据集混合了字符串和字典值。运行df.select(F.col("dic").getItem("fields")).printSchema()查看:
root
|-- dic[fields]: string (nullable = true)
可能有一些方法可以解析字符串并将其转换为地图,但这会很昂贵。您可以在问题中添加printSchema 吗?您可能需要重组数据,这样答案会更容易一些;)