python数据框collect（）函数答案

【问题标题】：python dataframe collect() functionpython数据框collect（）函数
【发布时间】：2021-06-08 17:48:44
【问题描述】：

我在使用 collect() 函数时遇到了非常奇怪的问题

data = df.select("node_id", "bin", "type", "jsonObj").collect()

JsonObj 看起来像这样：

[
 {
   "id" : 1,
   "name" : "hello"
 },
 {
   "id" : 2,
   "name" : "world"
 }
]

现在，当我遍历由 collect 函数生成的列表并打印 row["jsonObj"] 时，我将 JSON 对象作为字符串的一部分，而不仅仅是 JSON 对象。就像现在我将“ ' ”添加到数组中的每个对象中。问题是当我尝试将它写入文件时，它变成字符串数组而不是 json 对象数组

['{
   "id" : 1,
   "name" : "hello"
 }',
 '{
   "id" : 2,
   "name" : "world"
 }'
]

有没有其他人遇到过同样的问题？我只想将 JsonObj 原样存储到文件而不是字符串。

node_id	bin	type	jsonObj
1	a	type1	[ { "id" : 11, "name" : "hello" }, { "id" : 12, "name" : "world" } ]

root
 |-- node_id: long (nullable = true)
 |-- bin: string (nullable = true)
 |-- type: string (nullable = true)
 |-- jsonObj: array (nullable = true)

【问题讨论】：

数据框可能有 jsonobj 列作为字符串类型的数组。如果您想要 JSON 对象，您需要使用 from_json 将其转换为结构数组。
你能给我举个例子吗？
我在上面的问题中添加了示例响应
你是如何将它写入文件的。你是把它写成csv文件吗？还是json文件？
我正在将其写入 Json 文件

标签： python json apache-spark pyspark apache-spark-sql

【解决方案1】：

您可以使用 from_json 将 JSON 字符串转换为结构：

import pyspark.sql.functions as F
from pyspark.sql.types import *

df2 = df.withColumn(
    "jsonObj",
    F.from_json(
        F.col('jsonObj').cast('string'), 
        ArrayType(StructType([StructField('id', IntegerType()), StructField('name', StringType())]))
    )
)

df2.show(truncate=False)
+-------+---+-----+--------------------------+
|node_id|bin|type |jsonObj                   |
+-------+---+-----+--------------------------+
|1      |a  |type1|[[11, hello], [12, world]]|
+-------+---+-----+--------------------------+

df2.write.json('filepath')

应该给出的输出为

{"node_id":"1","bin":"a","type":"type1","jsonObj":[{"id":11,"name":"hello"},{"id":12,"name":"world"}]}

【讨论】：

感谢您发布解决方案。但是我得到： org.apache.spark.sql.catalyst.parser.ParseException: no possible alternative at input '>'(line 1, pos 28)
@A007 ... spark 版本问题。需要 spark >= 2.4 才能使用转换。
知道了，spark 版本
我们还需要导入数组和结构吗？因为现在我在输入 ' 上没有可行的选择
spark 版本是'2.2.0'