【问题标题】:How do I work with Nested JSON that also contains invalid NULL strings?如何使用也包含无效 NULL 字符串的嵌套 JSON?
【发布时间】:2021-05-04 12:35:31
【问题描述】:

我不完全是新手,但我仍在学习,所以如果可以请提供详细信息。我正在使用 Databricks Python 笔记本中的 API 输出。一个字段的内容返回一个奇怪的嵌套 JSON 语句,我很难处理。内容可以改变,本质上这让我获得了可用于某个问题的可用选项,因此每一行代表一个问题,“选项”列中的值提供了所有潜在的答案选项以及系统中可用的更多选项.

所以我的想法如下;

  1. 我需要处理空值,因为这些是未使用的占位符输出。
  2. 取消透视数据框,以便针对所有关联的答案选项重复问题。
{
    "1": {
        "Display": "Strongly disagree"
    },
    "2": {
        "Display": "Disagree"
    },
    "3": {
        "Display": "Neither agree nor disagree"
    },
    "4": {
        "Display": "Agree"
    },
    "5": {
        "Display": "Strongly agree"
    },
    "6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null,"18":null,"19":null,"20":null,"21":null,"22":null,"23":null,"24":null,"25":null,"26":null,"27":null,"28":null,"29":null,"30":null,"31":null,"32":null,"33":null,"34":null,"35":null,"36":null,"37":null,"38":null,"39":null,"40":null,"41":null,"42":null,"43":null,"44":null,"45":null,"46":null,"47":null,"48":null,"49":null,"50":null,"51":null,"52":null,"53":null,"54":null,"55":null,"56":null,"57":null,"58":null,"59":null,"60":null,"61":null,"62":null,"63":null,"64":null,"65":null,"66":null,"67":null,"68":null,"69":null,"70":null,"71":null,"72":null,"73":null,"74":null,"75":null,"76":null,"77":null,"78":null,"79":null,"80":null,"81":null,"82":null,"83":null,"84":null,"85":null,"86":null,"87":null,"88":null,"89":null,"90":null,"91":null,"92":null,"93":null,"94":null,"95":null,"96":null,"97":null,"98":null,"99":null,"100":null,"101":null,"102":null,"103":null,"104":null,"105":null,"106":null,"107":null,"108":null,"109":null,"110":null,"111":null,"112":null,"113":null,"114":null,"115":null,"116":null,"117":null,"118":null,"119":null,"120":null,"121":null,"122":null,"123":null,"124":null,"125":null,"126":null,"127":null,"128":null,"129":null,"130":null,"131":null,"132":null,"133":null,"134":null,"135":null,"136":null,"137":null,"138":null,"139":null,"140":null,"141":null,"142":null,"143":null,"144":null,"145":null
}

你能帮我理解我在这里需要做什么吗?如果我的问题不够清楚,很乐意提供更好的细节/上下文。

【问题讨论】:

  • 您能解释一下您在第二点(非透视)中要实现的目标吗?您的预期输出是什么?

标签: json apache-spark pyspark


【解决方案1】:

您可以在读取json文件时使用dropFieldIfAllNull选项删除空字段:

df = spark.read.json('option.json', dropFieldIfAllNull=True)

df.show(truncate=False)
+-------------------+----------+----------------------------+-------+----------------+
|1                  |2         |3                           |4      |5               |
+-------------------+----------+----------------------------+-------+----------------+
|[Strongly disagree]|[Disagree]|[Neither agree nor disagree]|[Agree]|[Strongly agree]|
+-------------------+----------+----------------------------+-------+----------------+

【讨论】:

    【解决方案2】:
    1. 我需要处理空值,因为它们是未使用的占位符输出。

    当您使用 spark.read.json 将 JSON 加载到 DataFrame 时,要忽略空值答案选项使用 dropFieldIfAllNull

    df = spark.read.json('/json/data', dropFieldIfAllNull=True)
    
    1. 取消透视数据框,以便针对所有关联的答案选项重复问题。

    要为每个 question | option_id | answer 获取一行,您可以使用以下命令:

    # I guess you have a DataFrame like this after you drop all null options
    df.show(truncate=False)
    
    #+-------------+-------------------+----------+----------------------------+-------+----------------+
    #|question     |1                  |2         |3                           |4      |5               |
    #+-------------+-------------------+----------+----------------------------+-------+----------------+
    #|Some question|[Strongly disagree]|[Disagree]|[Neither agree nor disagree]|[Agree]|[Strongly agree]|
    #+-------------+-------------------+----------+----------------------------+-------+----------------+
    
    # create a struct for each questions option which holds option ID and answer
    answer_options = F.array(*[
        F.struct(
            F.lit(c).alias("option_id"), 
            F.col(c).alias("answer")
        ) for c in df.columns[1:]
    ])
    
    df1 = df.select("question", F.explode(answer_options).alias("options")) \
            .select("question", "options.*")
    
    df1.show(truncate=False)
    
    #+-------------+---------+----------------------------+
    #|question     |option_id|answer                      |
    #+-------------+---------+----------------------------+
    #|Some question|1        |[Strongly disagree]         |
    #|Some question|2        |[Disagree]                  |
    #|Some question|3        |[Neither agree nor disagree]|
    #|Some question|4        |[Agree]                     |
    #|Some question|5        |[Strongly agree]            |
    #+-------------+---------+----------------------------+
    

    【讨论】:

      猜你喜欢
      • 2020-07-14
      • 1970-01-01
      • 2021-08-07
      • 2017-11-16
      • 2022-07-08
      • 1970-01-01
      • 1970-01-01
      • 2011-11-27
      • 2018-09-26
      相关资源
      最近更新 更多