【发布时间】:2020-06-24 09:12:35
【问题描述】:
我正在使用 spark sql 数据处理嵌套数组。
{
"isActive": true,
"sample": {
"someitem": {
"thesearecool": [{
"neat": "wow"
},
{
"neat": "tubular"
}
]
},
"coolcolors": [{
"color": "red",
"hex": "ff0000"
},
{
"color": "blue",
"hex": "0000ff"
}
]
}
}
架构:
root
|-- isActive: boolean (nullable = true)
|-- sample: struct (nullable = true)
| |-- coolcolors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- color: string (nullable = true)
| | | |-- hex: string (nullable = true)
| |-- someitem: struct (nullable = true)
| | |-- thesearecool: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- neat: string (nullable = true)
代码:
val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")
样本输出:
+--------+-----+------+--------------+
|isActive|color|hex |neat |
+--------+-----+------+--------------+
|true |red |ff0000|[wow, tubular]|
|true |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+
+-------+
|neat |
+-------+
|wow |
|tubular|
+-------+
我们需要处理数据单个结果。
【问题讨论】:
-
我不明白这是什么问题?
-
在数据之上,我可以处理两个不同的 rdd 数据,但我需要处理单个表.. 像一个爆炸处理一个数据数组,我需要处理第二个数组相同的 rdd
-
nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem. thesearecool.neat") .....这是给一个输出
-
nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")。这是给一个输出...我需要合并或单个输出嵌套在数组数据中的所有 json
-
你能帮我们吗
标签: scala apache-spark-sql apache-spark-2.0