如何在不丢失记录的情况下使用空列表对熊猫中的列进行 json_normalize答案

【问题标题】：How to json_normalize a column in pandas with empty lists, without losing records如何在不丢失记录的情况下使用空列表对熊猫中的列进行 json_normalize
【发布时间】：2020-12-27 23:43:56
【问题描述】：

我正在使用pd.json_normalize 将此数据中的"sections" 字段展平为行。除了"sections" 为空列表的行之外，它工作正常。

此 ID 被完全忽略，并且在最终展平的数据框中丢失。我需要确保数据中每个唯一 ID 至少有一行（某些 ID 可能有很多行，每个唯一 ID、每个唯一 section_id、question_id 和 answer_id 最多一行，因为我没有嵌套数据中的更多字段）：

     {'_id': '5f48f708fe22ca4d15fb3b55',
      'created_at': '2020-08-28T12:22:32Z',
      'sections': []}]

样本数据：

sample = [{'_id': '5f48bee4c54cf6b5e8048274',
          'created_at': '2020-08-28T08:23:00Z',
          'sections': [{'comment': '',
            'type_fail': None,
            'answers': [{'comment': 'stuff',
              'feedback': [],
              'value': 10.0,
              'answer_type': 'default',
              'question_id': '5e59599c68369c24069630fd',
              'answer_id': '5e595a7c3fbb70448b6ff935'},
             {'comment': 'stuff',
              'feedback': [],
              'value': 10.0,
              'answer_type': 'default',
              'question_id': '5e598939cedcaf5b865ef99a',
              'answer_id': '5e598939cedcaf5b865ef998'}],
            'score': 20.0,
            'passed': True,
            '_id': '5e59599c68369c24069630fe',
            'custom_fields': []},
           {'comment': '',
            'type_fail': None,
            'answers': [{'comment': '',
              'feedback': [],
              'value': None,
              'answer_type': 'not_applicable',
              'question_id': '5e59894f68369c2398eb68a8',
              'answer_id': '5eaad4e5b513aed9a3c996a5'},
             {'comment': '',
              'feedback': [],
              'value': None,
              'answer_type': 'not_applicable',
              'question_id': '5e598967cedcaf5b865efe3e',
              'answer_id': '5eaad4ece3f1e0794372f8b2'},
             {'comment': "stuff",
              'feedback': [],
              'value': 0.0,
              'answer_type': 'default',
              'question_id': '5e598976cedcaf5b865effd1',
              'answer_id': '5e598976cedcaf5b865effd3'}],
            'score': 0.0,
            'passed': True,
            '_id': '5e59894f68369c2398eb68a9',
            'custom_fields': []}]},
         {'_id': '5f48f708fe22ca4d15fb3b55',
          'created_at': '2020-08-28T12:22:32Z',
          'sections': []}]

测试：

df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")

此时我缺少 ID“5f48f708fe22ca4d15fb3b55”的一行，我仍然需要它。

df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")

我能否以某种方式更改它以确保每个 ID 至少有一行？我正在处理数百万条记录，并且不想稍后意识到我的最终数据中缺少一些 ID。我能想到的唯一解决方案是将每个数据帧标准化，然后再次将其加入原始数据帧。

【问题讨论】：

标签： python pandas dictionary json-normalize

【解决方案1】：

解决问题的最佳方法是修复dict
如果sections 是空的list，则用[{'answers': [{}]}] 填充它

for i, d in enumerate(sample):
    if not d['sections']:
        sample[i]['sections'] = [{'answers': [{}]}]

df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")

# display(df2)
  section_comment  section_type_fail                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               section_answers  section_score section_passed               section__id section_custom_fields                       _id            created_at
0                                NaN                                                                                                                                                                        [{'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e59599c68369c24069630fd', 'answer_id': '5e595a7c3fbb70448b6ff935'}, {'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e598939cedcaf5b865ef99a', 'answer_id': '5e598939cedcaf5b865ef998'}]           20.0           True  5e59599c68369c24069630fe                    []  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z
1                                NaN  [{'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e59894f68369c2398eb68a8', 'answer_id': '5eaad4e5b513aed9a3c996a5'}, {'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e598967cedcaf5b865efe3e', 'answer_id': '5eaad4ece3f1e0794372f8b2'}, {'comment': 'stuff', 'feedback': [], 'value': 0.0, 'answer_type': 'default', 'question_id': '5e598976cedcaf5b865effd1', 'answer_id': '5e598976cedcaf5b865effd3'}]            0.0           True  5e59894f68369c2398eb68a9                    []  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z
2             NaN                NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [{}]            NaN            NaN                       NaN                   NaN  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z

df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")

# display(df3)
  comment feedback  value     answer_type               question_id                 answer_id                       _id            created_at               section__id section_score section_passed section_type_fail section_comment
0   stuff       []   10.0         default  5e59599c68369c24069630fd  5e595a7c3fbb70448b6ff935  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59599c68369c24069630fe            20           True               NaN                
1   stuff       []   10.0         default  5e598939cedcaf5b865ef99a  5e598939cedcaf5b865ef998  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59599c68369c24069630fe            20           True               NaN                
2               []    NaN  not_applicable  5e59894f68369c2398eb68a8  5eaad4e5b513aed9a3c996a5  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
3               []    NaN  not_applicable  5e598967cedcaf5b865efe3e  5eaad4ece3f1e0794372f8b2  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
4   stuff       []    0.0         default  5e598976cedcaf5b865effd1  5e598976cedcaf5b865effd3  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
5     NaN      NaN    NaN             NaN                       NaN                       NaN  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z                       NaN           NaN            NaN               NaN             NaN

【讨论】：

用列表 + 空字典填充空值对我有用，并允许我纯粹使用 json_normalize（与使用另一个库相比）。我必须考虑的唯一另一件事是当您规范化一个字段而其他字段是列表类型时 - 我必须将这些列表列转换为字符串。不过，我现在有一个可行的解决方案，谢谢。
@ldacey 很高兴这对你有用。使用结构不正确的 JSON 文件可能会很痛苦。
@TrentonMcKinney 天哪！我刚遇到这个问题。在这里感谢您的解决方案！
@ScottBoston 嘿，斯科特！很高兴这对你有用。 JSON 对象可能会很痛苦。哎呀！ 1.5年前说过同样的话。我支持这个评论。

【解决方案2】：

这是json_normalize 的一个已知问题。我还没有找到使用json_normalize 的方法。你可以尝试使用flatten_json 这样的东西：

import flatten_json as fj

dic = (fj.flatten(d) for d in sample)
df = pd.DataFrame(dic)
print(df)

                        _id            created_at sections_0_comment  ...            sections_1__id sections_1_custom_fields sections
0  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z                     ...  5e59894f68369c2398eb68a9                       []      NaN
1  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z                NaN  ...                       NaN                      NaN       []

【讨论】：

已知问题，如存在当前问题报告，或者我应该创建一个潜在问题？我觉得至少应该有一个警告，因为您可能会在不知道原因的情况下丢失数据。我刚刚测试了 flatten_json 模块，谢谢 - 它有没有机会将数据展平/分解成行而不是列？此字段包含可以随时添加到结果中的自定义字段，实际上有数千个唯一的问题和答案 ID，因此为每个 ID 设置一个列对我来说是行不通的。
这里打开了一个问题：github.com/pandas-dev/pandas/issues/21830
要展平成行，您必须循环遍历字典，然后使用 flatten 函数展平它们
酷，让我试试。我还创建了一个似乎更具体的新问题：github.com/pandas-dev/pandas/issues/36245，其中包含使用元列时我期望的确切输出示例