【问题标题】:SPARK: How to parse a Array of JSON object using SparkSPARK:如何使用 Spark 解析 JSON 对象数组
【发布时间】:2020-01-18 02:17:36
【问题描述】:

我有一个包含普通列的文件和一个包含 Json 字符串的列,如下所示。还附上图片。每一行实际上都属于一个名为 Demo 的列(在图中不可见)。其他列已被删除,并且在 pic 中不可见,因为它们现在不值得关注。

[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]

请不要更改 JSON 的格式,因为它在数据文件中与上面一样,除了所有内容都在一行中。

每一行在列下都有一个这样的对象,比如 JSON。这些对象都在一行中,但在一个数组中。我想使用 spark 解析此列并访问其中每个对象的值。请帮忙。

我想要的是获得关键“价值”的价值。我的目标是将每个 JSON 对象中的“值”键的值提取到单独的列中。

我尝试使用 get_json_object。它适用于以下 1) Json 字符串,但为 JSON 2) 返回 null

  1. {"key":"device_kind","value":"desktop"}
  2. [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","值":"windows"}]

我试过的代码如下

val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")

jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind

val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")

jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind

接下来我想使用 from_Json 但我无法弄清楚如何为 JSON 对象数组构建架构。我发现的所有示例都是嵌套 JSON 对象的示例,但与上述 JSON 字符串没有任何相似之处。

我确实发现在 sparkR 2.2 中 from_Json 有一个布尔参数,如果设置为 true,它将处理上述类型的 JSON 字符串,即 JSON 对象数组,但该选项在 Spark-Scala 2.3.3 中不可用

要清楚输入和预期输出,应如下所示。

i/p 下面

+------------------------------------------------------------------------+
|Demographics                                                            |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+

预期的 o/p 低于

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+

【问题讨论】:

标签: json apache-spark apache-spark-sql schema


【解决方案1】:

谢谢你的回答。它工作正常。 因为我使用的是 2.3.3 spark,所以我以稍微不同的方式解决了这个问题。

val sch = ArrayType(StructType(Array(
  StructField("key", StringType, true),
  StructField("value", StringType, true)
)))

val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))

val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
  .withColumn("country_code", expr("Demographics[1].value"))
  .withColumn("device_platform", expr("Demographics[2].value"))

【讨论】:

  • 嗨,你上面引用的 mdf 是谁?
【解决方案2】:

如果您的 JSON 列看起来像这样

    import spark.implicits._

    val inputDF = Seq(
      ("""[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]"""),
      ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}]"""),
      ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}]""")
    ).toDF("Demographics")

  inputDF.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|Demographics                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------+
|[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]|
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}] |
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}] |
+-------------------------------------------------------------------------------------------------------------------------+

您可以尝试通过以下方式解析该列:

  val parsedJson: DataFrame = inputDF.selectExpr("Demographics", "from_json(Demographics, 'array<struct<key:string,value:string>>') as parsed_json")

  val splitted = parsedJson.select(
    col("parsed_json").as("Demographics"),
    col("parsed_json").getItem(0).as("device_kind_json"),
    col("parsed_json").getItem(1).as("country_code_json"),
    col("parsed_json").getItem(2).as("device_platform_json")
  )

  val result = splitted.select(
    col("Demographics"),
    col("device_kind_json.value").as("device_kind"),
    col("country_code_json.value").as("country_code"),
    col("device_platform_json.value").as("device_platform")
  )

  result.show(false)

你会得到输出:

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+

【讨论】:

  • 嗨,Aleh,这种转换方式使得查询相关国家代码变得困难,特定设备种类的设备平台比如桌面。我想形成列 device_kind country_code, device_platform 并有它们对应的值每一行。
  • 嗨 BishamonTen。我已经编辑了答案。检查解决方案是否是您所需要的。
  • 嗨,Aleh,我想知道您是否有一些关于单元测试以及 Spark 代码/应用程序集成测试的经验。我有一些代码要进行单元测试。你能帮帮我吗?。
  • 嗨,比沙蒙。如果您分享更多信息,我可以告诉您如何提供帮助。
  • 嗨,阿莱。我已在此链接中发布我的问题stackoverflow.com/questions/58066394/… 提供您的宝贵意见
猜你喜欢
  • 2020-09-28
  • 2020-12-15
  • 1970-01-01
  • 2021-09-24
  • 2021-11-03
  • 2015-07-09
  • 2016-11-15
  • 2016-11-01
  • 1970-01-01
相关资源
最近更新 更多