【问题标题】:Add JSON object field to a JSON array field in the dataframe using scala使用 scala 将 JSON 对象字段添加到数据框中的 JSON 数组字段
【发布时间】:2020-09-02 09:54:37
【问题描述】:

有什么方法可以将json对象添加到已经存在的json对象数组中:

我有一个数据框:

+-------------------------+---------------------------------------------------------+------------+
|   name                  |       hit_songs                                         |  column1   |
+-------------------------+---------------------------------------------------------+------------+
|{"HomePhone":"34567002"} | [{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}] | value1     |
|{"HomePhone":"34567011"} | [{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}] |  value2    |
+-------------------------+---------------------------------------------------------+------------+ 

我想要一个结果数据框:

+---------------------------------------------------------------------------------+------------+
|   name                                                                                column1  
+------------------------------------------------------------------------------------+------------+
|[ {"HomePhone":"34567002"},{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"} ] |  value1     |
|[ {"HomePhone":"34567011"},{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"} ] |   value2    |
+-------------------------+---------------------------------------------------------++------------+

【问题讨论】:

  • 你想达到什么目的?您能否在数据框中显示您需要的内容,因为我假设最终您想要表格格式的数据而不是 json。

标签: arrays json scala dataframe apache-spark


【解决方案1】:

使用array_union 函数。

name 是字符串类型,将此列转换为数组类型使用array

检查下面的代码。

scala> df.show(false)
+------------------------+-------------------------------------------------------+
|name                    |hit_songs                                              |
+------------------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------------------------------------------------+


scala> df.withColumn("name",array_union(array($"name"),$"hit_songs")).show(false) // Use array_union function, to join name string column with hit_songs array column, first convert name to array(name).
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|name                                                                             |hit_songs                                              |
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------------------------------------------------+
scala> df.show(false)
+------------------------+-------------+-------------------------------------------------------+
|name                    |dammy        |hit_songs                                              |
+------------------------+-------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------+-------------------------------------------------------+


scala> df.printSchema
root
 |-- name: string (nullable = true)
 |-- dammy: string (nullable = true)
 |-- hit_songs: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> df.withColumn("name",array_union(array_union(array($"name"),$"hit_songs"),array($"dammy"))).show(false)

+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|name                                                                             |dammy        |hit_songs                                              |
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+

【讨论】:

  • 如果我们想添加第三列怎么办?
  • 先生,实际上除了 name 和 hit_songs 之外,我还有更多的列。我希望它们保持不变,并且这个连接的数组列应该在那里。我已经更新了问题数据框。请检查
  • 好的,我已经更新了答案,使用 withColumn 并使用 drop() 函数删除不需要的列。
  • 让我检查一下
  • 抛出错误:org.apache.spark.sql.AnalysisException 无法解析 'array_union(array(entitymappingJoinA.phonestruct11), entitymappingJoinA.phonestruct11)' 由于数据类型不匹配:输入到函数array_union 应该是两个具有相同元素类型的数组,但它是 [array>, struct];;
猜你喜欢
  • 1970-01-01
  • 2016-04-20
  • 2020-04-28
  • 2016-09-28
  • 1970-01-01
  • 1970-01-01
  • 2017-04-13
  • 1970-01-01
  • 2019-05-09
相关资源
最近更新 更多