用于多个文件路径的 JSON 到 pandas 数据框答案

【问题标题】：JSON to pandas dataframe for multiple filepaths用于多个文件路径的 JSON 到 pandas 数据框
【发布时间】：2020-10-13 23:15:13
【问题描述】：

我有一个包含 40 个客户数据文件的文件夹。每个客户都有一个包含不同购买的 json 文件。一个示例路径是 ../customer_data/customer_1/transaction.json

我想将此 json 文件加载到具有 customer_id、date、instore 和 rewards 列的数据框中。客户 ID 是文件夹名称，然后对于 instore/rewards 中的每一组我想要一个新行。

目标：上述文件应如下所示：

   customer_id| date                     | instore          | rewards
   customer_1 |2018-12-21T12:02:42-08:00 |  0               | 0
   customer_1 |2018-12-24T06:19:03-08:00 |98.25211334228516 | 16.764389038085938
   customer_1 |2018-12-24T06:19:03-08:00 |99.88800811767578 | 18.61212158203125

我尝试了以下代码，但收到此错误 ValueError: Conflicting metadata name flexion, need distinct prefix :

# path to file
p = Path('../customer_data/customer_1/transaction.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create dataframe
df = json_normalize(data, record_path='purchase', meta=['instore', 'rewards'], errors='ignore')

任何建议都会有所帮助

【问题讨论】：

标签： python json pandas json-normalize

【解决方案1】：

你可以试试这个，customer_id 不在你的 json 中，所以我只是编造的：

path = '../customer_data/customer_1/transaction.json'
with open('1.json', 'r+') as f:
    data = json.load(f)

df = pd.json_normalize(data, record_path=['purchase'], meta=[['date'], ['tierLevel']])
df['customer_id'] = path.split('/')[2]
print(df)


     instore    rewards                       date tierLevel customer_id
0  98.252113  16.764389  2018-12-24T06:19:03-08:00         7  customer_1
1  99.888008  18.612122  2018-12-24T06:19:03-08:00         7  customer_1

【讨论】：

谢谢，客户 ID 仅在文件路径中。有没有办法从文件路径本身中提取它？
这不会得到第一本字典
嗯，很有趣。不要认为json_normalize 支持这一点。
@TrentonMcKinney 是的，有没有办法获得空值？

【解决方案2】：

使用rglob 查找所有文件。
通过填充purchase 键中的空列表来修复data。
使用parent & stem 从路径中获取客户ID。
- 给定p = Path('../customer_data/customer_1/transaction.json')
- p.parent.stem 返回'customer_1'

import pandas as pd
import json
from pathlib import Path

file_path = Path('../customer_data')
files = file_path.rglob('transaction.json')

df_list = list()
for file in files:

    # read json
    with file.open('r', encoding='utf-8') as f:
        data = json.loads(f.read())
    
    # fix purchase where list is empty
    for x in data:
        if not x['purchase']:  # checks if list is empty
            x['purchase'] = [{'instore': 0, 'rewards': 0}]
        
    # create dataframe
    df = pd.json_normalize(data, 'purchase', ['date', 'tierLevel'])
    
    # add customer
    df['customer_id'] = file.parent.stem
    
    # add to dataframe list
    df_list.append(df)
    

df = pd.concat(df_list)

【讨论】：

【解决方案3】：

你可以使用我的库 anyjsontodf.py

基本上：

import anyjsontodf as jd

df = jd.jsontodf(jsonfile)

Github：https://github.com/fSEACHAD/anyjsontodf

中篇文章：https://medium.com/@fernando.garcia.varela/dancing-with-the-dictionary-transforming-any-json-to-pandas-3328b49269d0

希望这会有所帮助！

【讨论】：