带有选择性键的 Pandas 深度嵌套 json：列的值答案

【问题标题】：Pandas deep nested json with selective key:value for columns带有选择性键的 Pandas 深度嵌套 json：列的值
【发布时间】：2020-11-07 04:30:07
【问题描述】：

我正在尝试从深度嵌套的 AWS 定价 API 创建 DataFrame，当我指定仅查看第一级键“tems”和第二级键“OnDemand”之后，我将 sku 作为索引和列 OnDemand具有多个嵌套的 json/dicts。这是代码和输出：

import requests
import json
import os
import pandas as pd 
from pandas.io.json import json_normalize
import flatten_json


ec2_url = requests.get("https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-east-1/index.json")
ec2_dict = json.loads(ec2_url.text)

df_init_terms = pd.DataFrame(ec2_dict['terms'])
df_init_terms 
#print(df_init_terms.values)
df_init_terms = df_init_terms.drop(['Reserved'], axis = 1) 

df_dropna = df_init_terms.dropna()
df_dropna1 = df_dropna[:1000]
df_init_terms.values

输出：

  array([[{'QUMEF4UK3NPT4MN3.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'QUMEF4UK3NPT4MN3', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'QUMEF4UK3NPT4MN3.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'QUMEF4UK3NPT4MN3.JRTCKXETXF.6YS6EN2CT7', 'description': '$0.376 per Unused Reservation Windows c3.xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '0.3760000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           [{'DBCQPZ6Z853WRE98.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'DBCQPZ6Z853WRE98', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'DBCQPZ6Z853WRE98.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'DBCQPZ6Z853WRE98.JRTCKXETXF.6YS6EN2CT7', 'description': '$3.586 per Unused Reservation RHEL r5d.12xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '3.5860000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           [{'MK44K7QNJQCC2E98.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'MK44K7QNJQCC2E98', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'MK44K7QNJQCC2E98.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'MK44K7QNJQCC2E98.JRTCKXETXF.6YS6EN2CT7', 'description': '$1.40 per Dedicated Linux with SQL Std m4.2xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '1.4000000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           ...,
           [nan],
           [nan],
           [nan]], dtype=object)

使用 head() 输出：

                                                          OnDemand
QUMEF4UK3NPT4MN3    {'QUMEF4UK3NPT4MN3.JRTCKXETXF': {'offerTermCod...
DBCQPZ6Z853WRE98    {'DBCQPZ6Z853WRE98.JRTCKXETXF': {'offerTermCod...
MK44K7QNJQCC2E98    {'MK44K7QNJQCC2E98.JRTCKXETXF': {'offerTermCod...
86MNM35KQ46XCFDQ    {'86MNM35KQ46XCFDQ.JRTCKXETXF': {'offerTermCod...
NCQF4R2S47SB2QE5    {'NCQF4R2S47SB2QE5.JRTCKXETXF': {'offerTermCod...

如何标准化 OnDemand 列以将每个 sku 分隔为行并为有效日期、描述和 pricePerUnit 分隔列，这是新字典和深度嵌套：

       sku         effectiveDate         description          priceUnit
QUMEF4UK3NPT4MN3   2020-07-01T00:00:00Z  $0.376 per Unus...   $0.376
DBCQPZ6Z853WRE98   2020-07-01T00:00:00Z  $3.586 per Unuse...  $3.586
MK44K7QNJQCC2E98   ...and so on...

提前致谢！

【问题讨论】：

标签： json pandas dataframe nested

【解决方案1】：

您可以使用json_normalize 来完成此类任务。但在这种情况下，它不会有帮助，因为数据结构是dict 的dict 的dict 等等......所以，我不确定如果没有迭代预处理，这是可能的。只是一个例子：

def load_terms():
    url = 'your_url_here...'
    # you can parse json using .json() - without json.loads
    # iterate by each OnDemand record inside terms
    for terms in requests.get(url).json()['terms']['OnDemand'].values():  # type: dict
        # generate each row as dict for df
        for _, term in terms.items():  # type: str, dict
            for _, dimensions in term['priceDimensions'].items():  # type: str, dict
                for currency_key, price in dimensions['pricePerUnit'].items():  # type: str, str
                    yield {
                        'sku': term['sku'],
                        'effectiveDate': term['effectiveDate'],
                        'description': dimensions['description'],
                        # don't know prices details...
                        'priceUnit': '$' + price if currency_key == 'USD' else price,
                    }


df = pd.DataFrame(list(load_terms()))
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df.head())

#                 sku         effectiveDate                                        description       priceUnit
# 0  QUMEF4UK3NPT4MN3  2020-07-01T00:00:00Z  $0.376 per Unused Reservation Windows c3.xlarg...   $0.3760000000
# 1  DBCQPZ6Z853WRE98  2020-07-01T00:00:00Z  $3.586 per Unused Reservation RHEL r5d.12xlarg...   $3.5860000000
# 2  MK44K7QNJQCC2E98  2020-07-01T00:00:00Z  $1.40 per Dedicated Linux with SQL Std m4.2xla...   $1.4000000000
# 3  86MNM35KQ46XCFDQ  2020-07-01T00:00:00Z  $48.432 per Dedicated Unused Reservation Windo...  $48.4320000000
# 4  NCQF4R2S47SB2QE5  2020-07-01T00:00:00Z  $7.336 per On Demand Linux with SQL Server Ent...   $7.3360000000

【讨论】：