json_normalize 在尝试提取某些属性时产生 KeyError答案

【问题标题】：json_normalize produces a KeyError when trying to extract certain attributesjson_normalize 在尝试提取某些属性时产生 KeyError
【发布时间】：2021-05-10 04:36:48
【问题描述】：

这是我的 json 文件的一个子集：

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

我想将它放入一个数据框中，每个问题和每个答案对应一行。

Python 代码：

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

这会产生 KeyError：

KeyError: "['answers.text'] not in index"

在这玩了几个小时，绝对无法弄清楚这一点。我觉得它应该很简单，但它从来都不是。

【问题讨论】：

标签： python json pandas json-normalize

【解决方案1】：

使用record_prefix，与record_path和meta一起使用，所以d可以一次性归一化
- 当record_path 和meta 和'id' 和'text' 之间存在重叠的key 名称时，pd.json_normalize 将产生ValueError。
- ValueError: Conflicting metadata name id, need distinguishing prefix 在不使用 record_path 的情况下发生。
出现KeyError 是因为'answers.text' 不在d 中，它是由.json_normalize() 创建的
如果有任何顶级 keys 在 df 中不需要，请将它们从 meta 中删除。

import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

扩展测试数据

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

关于另一个answer，不推荐使用.apply(pd.Series)，因为它非常慢。
- 参见SO: Splitting dictionary/list inside a Pandas Column into Separate Columns 中的timing analysis
- 10M 行需要 53 分钟

【讨论】：

【解决方案2】：

这是我通常使用的技术

json_normalize()顶级列表
explode() 孩子 list, reset_index() 用于第 3 步
在子list 内扩展dict 和apply(pd.Series)

d = {'data': {'questions': [{'id': 6574,
    'text': 'Question #1',
    'instructionalText': '',
    'minimumResponses': 0,
    'maximumResponses': None,
    'sortOrder': 1,
    'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
     {'id': 362950, 'text': 'Answer #2', 'parentId': None},
     {'id': 362951, 'text': 'Answer #3', 'parentId': None},
     {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")

	id	text	sortOrder	id_ans	text_ans
0	6574	Question #1	1	362949	Answer #1
1	6574	Question #1	1	362950	Answer #2
2	6574	Question #1	1	362951	Answer #3
3	6574	Question #1	1	362952	Answer #4

【讨论】：