【问题标题】:Parsing nested JSON with lists and dicts to separate dataframes for each使用列表和字典解析嵌套的 JSON 以分离每个数据帧
【发布时间】:2021-03-13 13:39:38
【问题描述】:

我的 JSON 由字典和列表组成。

我想将字典和列表写入单独的数据帧,如下所示:

这是一个示例 JSON,我有数千个类似的:

{
  "zone_id" : "1001",
  "timezone" : "Eastern Time",
  "address" : {
    "city" : "Niagara Falls",
    "country_code" : "US"
  },
  "financial" : {
    "currency" : {
      "code" : "USA"
    }
  },
  "amenities" : {
    "self_park" : true,
    "paved" : true,
    "mobile_pass" : null,
    "handicap" : null
  },
  "description" : "",
  "html_description" : null,
  "reserve" : false,
  "access_type" : "mobile_pay",
  "product_types" : [ "ondemand" ],
  "rates" : [ {
    "id" : 50000.1,
    "rate_type" : "valid_for",
    "zone_id" : "1001",
    "description" : "1 Hour",
    "price" : "1.50"
  }, {
    "id" : 50001.1,
    "rate_type" : "valid_for",
    "zone_id" : "1001",
    "description" : "4 Hours",
    "price" : "3.00"
  }, {
    "id" : 50002.1,
    "rate_type" : "valid_for",
    "zone_id" : "1001",
    "description" : "8 Hours",
    "price" : "6.00"
  }],
  "reservation_configuration" : null,
  "company" : {
    "proper_name" : "Niagara Falls",
    "logo_thumbnail" : null,
    "unique" : "niagarafalls"
  }
}

我想将 json 扁平化为以下具有这些列和相应数据的数据框:

df1:
    zone_id,
    timezone,
    description,
    html_description,
    reserve,
    access_type,
    product_types,
    rates,
    reservation_configuration,
    address.city,
    address.country_code,
    financial.currency.code,
    amenities.self_park,
    amenities.paved,
    amenities.mobile_pass,
    amenities.handicap,
    company.proper_name,
    company.logo_thumbnail,
    company.unique

address:
    zone_id
    city
    country_code
    
financial.currency:
    zone_id
    code
    
amenities:
    zone_id
    self_park
    paved
    mobile_pass
    handicap
    
product_types:
    zone_id
    product_types

rates:
    id
    rate_type
    zone_id
    description
    price
    
company:
    zone_id
    proper_name
    logo_thumbnail
    unique

这是我到目前为止所做的,我可以用它生成df1,但是我无法将json中的列表/字典分成数据帧,每个数据帧都有一个键; zone_id 是每个的唯一标识符(类似于数据库中表中表的主键),用于将来的数据帧连接目的。 product_typesrates 有我正在尝试解决这个问题的信息。我需要帮助将每个字典或列表分成单独的数据帧,每个数据帧都附有 zone_id

dfs = []

for index, js in enumerate(json_files):
    print(index, js)
    with open(os.path.join(path_to_json, js)) as json_file:
        json_text = json.load(json_file)
        a = pd.json_normalize(json_text)
        dfs.append(a)
        
df1 = pd.concat(dfs, ignore_index=True)

【问题讨论】:

    标签: python json pandas dataframe dictionary


    【解决方案1】:

    我认为您的主要问题是您在阅读 JSON 时尝试对其进行规范化。有时这可行,但在您的情况下,您需要实际的嵌套字段来构造不同的数据框。

    这应该做你想做的:

    import json
    import pandas
    import itertools
    
    
    # Your data
    raw_data = """{
        "zone_id": "1001",
        "timezone": "Eastern Time",
        "address": {"city": "Niagara Falls", "country_code": "US"},
        "financial": {"currency": {"code": "USA"}},
        "amenities": {
            "self_park": true,
            "paved": true,
            "mobile_pass": null,
            "handicap": null
        },
        "description": "",
        "html_description": null,
        "reserve": false,
        "access_type": "mobile_pay",
        "product_types": ["ondemand"],
        "rates": [
            {
                "id": 50000.1,
                "rate_type": "valid_for",
                "zone_id": "1001",
                "description": "1 Hour",
                "price": "1.50"
            },
            {
                "id": 50001.1,
                "rate_type": "valid_for",
                "zone_id": "1001",
                "description": "4 Hours",
                "price": "3.00"
            },
            {
                "id": 50002.1,
                "rate_type": "valid_for",
                "zone_id": "1001",
                "description": "8 Hours",
                "price": "6.00"
            }
        ],
        "reservation_configuration": null,
        "company": {
            "proper_name": "Niagara Falls",
            "logo_thumbnail": null,
            "unique": "niagarafalls"
        }
    }
    """
    data = json.loads(raw_data)
    
    # Lets pretend we have multiple
    data = [data] * 100
    
    
    # NOTE: You will probably want to use something like this (remove the comments to use it):
    # data = []
    # for json_file_path in json_files:
    #     with open(os.path.join(path_to_json, json_file_path)) as json_file:
    #         data.append(json.load(json_file))
    
    # =========
    # Construct the different types of dataframes
    
    # df1
    df1_columns = [
        "zone_id",
        "timezone",
        "description",
        "html_description",
        "reserve",
        "access_type",
        "product_types",
        "rates",
        "reservation_configuration",
        "address.city",
        "address.country_code",
        "financial.currency.code",
        "amenities.self_park",
        "amenities.paved",
        "amenities.mobile_pass",
        "amenities.handicap",
        "company.proper_name",
        "company.logo_thumbnail",
        "company.unique",
    ]
    df1 = pandas.json_normalize(data)[df1_columns]
    
    # Address
    address = pandas.DataFrame([{"zone_id": row["zone_id"], **row["address"]} for row in data])
    
    # financial_currency
    financial_currency = pandas.DataFrame([{"zone_id": row["zone_id"], **row["financial"]["currency"]} for row in data])
    
    # amenities
    amenities = pandas.DataFrame([{"zone_id": row["zone_id"], **row["amenities"]} for row in data])
    
    # product_types
    product_types = pandas.DataFrame([{"zone_id": row["zone_id"], "product_types": row["product_types"]} for row in data])
    
    # rates
    rates = pandas.DataFrame(itertools.chain.from_iterable([row["rates"] for row in data]))
    
    # company
    company = pandas.DataFrame([{"zone_id": row["zone_id"], **row["company"]} for row in data])
    

    【讨论】:

    • 感谢您的解决方案,虽然我检查它是否适合我,但我想提一下,每个单独数据帧中的列名不是固定列表,一些 JSON,可以有更多键与另一个相比,虽然其他人可能没有所有的键,我将如何处理动态列名?
    • 另外,因为每个列表/字典中的键可能通过 JSON 保持一致,也可能不一致,所以我希望能够从 JSON 本身获取列数据帧的键名,而不是硬编码他们,这可能吗?
    • 您提供的解决方案似乎没有产生我正在寻找的东西。
    • @cyrus24,所以您想要一种能够神奇地转换为结构化内容的动态格式?不,那是不可能的。您必须在某个时候指定数据框的格式。您可以让它包含 JSON 的一部分中的所有字段 (**row["company"])。请举例说明您正在寻找的输出,在我看来,我完全符合您的要求。
    猜你喜欢
    • 2021-09-25
    • 1970-01-01
    • 2019-01-18
    • 1970-01-01
    • 2017-03-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-06-18
    相关资源
    最近更新 更多