【问题标题】:Python: JSON with list of objects to dataframePython:带有对象列表的 JSON 到数据框
【发布时间】:2021-12-21 08:08:54
【问题描述】:

我有一个 JSON 示例,我想将其扁平化为 pandas DataFrame。我已经习惯了应用我自己编写的一些方法,但我想知道是否有更好/更短的解决方案来解决这个问题。

JSON 示例:

{
  "documentName": "test1.json",
  "time": "2020-10-10T08:00:00Z",
  "data": [
    {
      "name":"john",
      "scores": [
        {
          "event":"one",
          "score":10
        },
        {
          "event":"two",
          "score":10
        },
        {
          "event":"three",
          "score":10
        }
      ]
    },
    {
      "name":"mary",
      "scores": [
        {
          "event":"one",
          "score":10
        },
        {
          "event":"two",
          "score":5
        }
      ]
    },
    {
      "name":"hope",
      "scores": [
      ]
    }
  ]
}

所需的输出数据帧:

index documentName time name one two three
0 test1.json 2020-10-10T08:00:00Z john 10 10 10
1 test1.json 2020-10-10T08:00:00Z mary 10 5 Null
2 test1.json 2020-10-10T08:00:00Z hope Null Null Null

因此事件名称将被添加为列并相应地填充。有 4 个事件,但如果有可能动态检查数量和命名事件(因此不是固定的),那将是一个巨大的优势。

目前我使用了以下方法:

def object_to_columns(df_row,column):
  if isinstance(df_row[column], dict):
    for key, value in df_row[column].items():
      column_name = "{}-{}".format(column.lower(), key.lower())
      df_row[column_name] = value
  return df_row

def list_of_objects_to_columns(df_row,column):
  if isinstance(df_row[column], list):
    for item in df_row[column]:
      column_name = f"{item['event']}"
      df_row[column_name] = item['score']
  return df_row

with open("test1.json") as file:
  df = pd.read_json(file)
  df = df.apply(object_to_columns, column="data", axis=1)
  df = df.apply(list_of_objects_to_columns, column="data-scores", axis-1)

### CODE TO REMOVE UNUSED COLUMNS AND RENAMING ###

哪些想法更好、更清洁、更快?

【问题讨论】:

  • 你真的需要希望这一行吗?
  • 希望可以去掉:)

标签: python json pandas dataframe


【解决方案1】:

更直接的方法是使用json_normalize,但您丢失了有关“希望”的信息:

import pandas as pd
import json

with open("data.json") as file:
    data = json.load(file)

out = pd.json_normalize(data, ['data', 'scores'],
                        meta=['documentName', 'time', ['data', 'name']]) \
        .pivot(index=['documentName', 'time', 'data.name'],
               columns='event', values='score').reset_index()

输出:

>>> out
event documentName                  time data.name   one  three   two
0       test1.json  2020-10-10T08:00:00Z      john  10.0   10.0  10.0
1       test1.json  2020-10-10T08:00:00Z      mary  10.0    NaN   5.0

更新 保留“希望”行的另一种选择:

with open("data.json") as file:
    data = json.load(file)

out = pd.json_normalize(data, 'data', meta=['documentName', 'time']) \
        .explode('scores', ignore_index=True)

out[['event', 'score']] = out.pop('scores').dropna() \
                             .agg(lambda x: pd.Series(x.values()))

out = out.pivot(index=['documentName', 'time', 'name'],
                columns='event', values='score') \
         .reset_index().drop(columns=np.NaN)

输出:

>>> out
event documentName                  time  name   one  three   two
0       test1.json  2020-10-10T08:00:00Z  hope   NaN    NaN   NaN
1       test1.json  2020-10-10T08:00:00Z  john  10.0   10.0  10.0
2       test1.json  2020-10-10T08:00:00Z  mary  10.0    NaN   5.0

【讨论】:

  • 太棒了!这就像一个魅力,非常感谢你!第一个消除了希望,这正是我所需要的。谢谢!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-01-23
  • 1970-01-01
  • 2018-07-19
  • 1970-01-01
  • 2016-05-29
  • 2020-10-05
相关资源
最近更新 更多