使用 scala 将 JSON 对象字段添加到数据框中的 JSON 数组字段答案

【问题标题】：Add JSON object field to a JSON array field in the dataframe using scala使用 scala 将 JSON 对象字段添加到数据框中的 JSON 数组字段
【发布时间】：2020-09-02 09:54:37
【问题描述】：

有什么方法可以将json对象添加到已经存在的json对象数组中：

我有一个数据框：

+-------------------------+---------------------------------------------------------+------------+
|   name                  |       hit_songs                                         |  column1   |
+-------------------------+---------------------------------------------------------+------------+
|{"HomePhone":"34567002"} | [{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}] | value1     |
|{"HomePhone":"34567011"} | [{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}] |  value2    |
+-------------------------+---------------------------------------------------------+------------+

我想要一个结果数据框：

+---------------------------------------------------------------------------------+------------+
|   name                                                                                column1  
+------------------------------------------------------------------------------------+------------+
|[ {"HomePhone":"34567002"},{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"} ] |  value1     |
|[ {"HomePhone":"34567011"},{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"} ] |   value2    |
+-------------------------+---------------------------------------------------------++------------+

【问题讨论】：

你想达到什么目的？您能否在数据框中显示您需要的内容，因为我假设最终您想要表格格式的数据而不是 json。

标签： arrays json scala dataframe apache-spark

【解决方案1】：

使用array_union 函数。

name 是字符串类型，将此列转换为数组类型使用array

检查下面的代码。

scala> df.show(false)
+------------------------+-------------------------------------------------------+
|name                    |hit_songs                                              |
+------------------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------------------------------------------------+


scala> df.withColumn("name",array_union(array($"name"),$"hit_songs")).show(false) // Use array_union function, to join name string column with hit_songs array column, first convert name to array(name).
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|name                                                                             |hit_songs                                              |
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------------------------------------------------+

scala> df.show(false)
+------------------------+-------------+-------------------------------------------------------+
|name                    |dammy        |hit_songs                                              |
+------------------------+-------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------+-------------------------------------------------------+


scala> df.printSchema
root
 |-- name: string (nullable = true)
 |-- dammy: string (nullable = true)
 |-- hit_songs: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> df.withColumn("name",array_union(array_union(array($"name"),$"hit_songs"),array($"dammy"))).show(false)

+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|name                                                                             |dammy        |hit_songs                                              |
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+

【讨论】：

如果我们想添加第三列怎么办？
先生，实际上除了 name 和 hit_songs 之外，我还有更多的列。我希望它们保持不变，并且这个连接的数组列应该在那里。我已经更新了问题数据框。请检查
好的，我已经更新了答案，使用 withColumn 并使用 drop() 函数删除不需要的列。
让我检查一下
抛出错误：org.apache.spark.sql.AnalysisException 无法解析 'array_union(array(entitymappingJoinA.phonestruct11), entitymappingJoinA.phonestruct11)' 由于数据类型不匹配：输入到函数array_union 应该是两个具有相同元素类型的数组，但它是 [array>, struct];;