【问题标题】:Using Pandas to Flatten a JSON with a nested array使用 Pandas 用嵌套数组展平 JSON
【发布时间】:2021-12-31 06:27:05
【问题描述】:

具有以下 JSON。我想拉出任务将其展平并放入自己的数据框中并包含来自父级的 ID

[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
         "assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "taskId":898989,
         "status":"Closed"
        },
        {
         "assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
         "resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
         "taskId":999999
         }
       ],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]  

我想从这样的任务中获取数据框

id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status

我使用扁平化了整个数据框

df=pd.json_normalize(json.loads(df.to_json(orient='records')))

它在 [{}] 中留下了我认为可以的任务,因为我想将任务拉出到它自己的数据框中并包含来自父级的 id。

我在这样的数据框中有 id 和任务

tasksdf=storiesdf[['tasks','id']]

然后我想把它标准化

tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))

但我知道,因为它在一个数组中,所以我需要做一些不同的事情。但是我一直无法弄清楚。我一直在查看其他示例并阅读其他人所做的事情。任何帮助将不胜感激。

【问题讨论】:

    标签: json pandas


    【解决方案1】:

    主要问题是您的任务记录在某些情况下是空的,因此如果您使用 json_normalize 创建它,它不会出现在您的数据框中。

    其次,assigneeresolvedBy 和嵌套的task 之间的某些列是冗余的。因此,我将首先创建assignee.idresolved.id...等列,并将它们与规范化的task合并:

    json_data = json.loads(json_str)
    
    df = pd.DataFrame.from_dict(json_data)
    df = df.explode('task')
    
    df_assign = pd.DataFrame()
    df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
    df = df.join(df_assign).drop('assignee', axis=1)
    
    df_resolv = pd.DataFrame()
    df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
    df = df.join(df_resolv).drop('resolvedBy', axis=1)
    
    df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
    df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)
    
    print(df.drop_duplicates().reset_index(drop=True))
    

    输出:

             id       state  assignee.id assignee.firstName  ... resolvedBy.firstName  resolvedBy.lastName    taskId  status
    0  123456.0    Complete         5757                Jim  ...                  Jim              Johnson  898989.0  Closed
    1  123477.0  Inprogress         8576               Jack  ...                 None                 None       NaN     NaN
    2    123456    Complete         5857               Nacy  ...               George              Johnson  999999.0     NaN
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-03-07
      • 2017-06-19
      • 2020-11-24
      • 2017-12-26
      • 2022-01-08
      • 2022-06-11
      • 2021-10-09
      相关资源
      最近更新 更多