【问题标题】:Pandas deep nested json with selective key:value for columns带有选择性键的 Pandas 深度嵌套 json:列的值
【发布时间】:2020-11-07 04:30:07
【问题描述】:

我正在尝试从深度嵌套的 AWS 定价 API 创建 DataFrame,当我指定仅查看第一级键“tems”和第二级键“OnDemand”之后,我将 sku 作为索引和列 OnDemand具有多个嵌套的 json/dicts。这是代码和输出:

import requests
import json
import os
import pandas as pd 
from pandas.io.json import json_normalize
import flatten_json


ec2_url = requests.get("https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-east-1/index.json")
ec2_dict = json.loads(ec2_url.text)

df_init_terms = pd.DataFrame(ec2_dict['terms'])
df_init_terms 
#print(df_init_terms.values)
df_init_terms = df_init_terms.drop(['Reserved'], axis = 1) 

df_dropna = df_init_terms.dropna()
df_dropna1 = df_dropna[:1000]
df_init_terms.values 

输出:

  array([[{'QUMEF4UK3NPT4MN3.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'QUMEF4UK3NPT4MN3', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'QUMEF4UK3NPT4MN3.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'QUMEF4UK3NPT4MN3.JRTCKXETXF.6YS6EN2CT7', 'description': '$0.376 per Unused Reservation Windows c3.xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '0.3760000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           [{'DBCQPZ6Z853WRE98.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'DBCQPZ6Z853WRE98', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'DBCQPZ6Z853WRE98.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'DBCQPZ6Z853WRE98.JRTCKXETXF.6YS6EN2CT7', 'description': '$3.586 per Unused Reservation RHEL r5d.12xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '3.5860000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           [{'MK44K7QNJQCC2E98.JRTCKXETXF': {'offerTermCode': 'JRTCKXETXF', 'sku': 'MK44K7QNJQCC2E98', 'effectiveDate': '2020-07-01T00:00:00Z', 'priceDimensions': {'MK44K7QNJQCC2E98.JRTCKXETXF.6YS6EN2CT7': {'rateCode': 'MK44K7QNJQCC2E98.JRTCKXETXF.6YS6EN2CT7', 'description': '$1.40 per Dedicated Linux with SQL Std m4.2xlarge Instance Hour', 'beginRange': '0', 'endRange': 'Inf', 'unit': 'Hrs', 'pricePerUnit': {'USD': '1.4000000000'}, 'appliesTo': []}}, 'termAttributes': {}}}],
           ...,
           [nan],
           [nan],
           [nan]], dtype=object)

使用 head() 输出:

                                                          OnDemand
QUMEF4UK3NPT4MN3    {'QUMEF4UK3NPT4MN3.JRTCKXETXF': {'offerTermCod...
DBCQPZ6Z853WRE98    {'DBCQPZ6Z853WRE98.JRTCKXETXF': {'offerTermCod...
MK44K7QNJQCC2E98    {'MK44K7QNJQCC2E98.JRTCKXETXF': {'offerTermCod...
86MNM35KQ46XCFDQ    {'86MNM35KQ46XCFDQ.JRTCKXETXF': {'offerTermCod...
NCQF4R2S47SB2QE5    {'NCQF4R2S47SB2QE5.JRTCKXETXF': {'offerTermCod...

如何标准化 OnDemand 列以将每个 sku 分隔为行并为有效日期、描述和 pricePerUnit 分隔列,这是新字典和深度嵌套:

       sku         effectiveDate         description          priceUnit
QUMEF4UK3NPT4MN3   2020-07-01T00:00:00Z  $0.376 per Unus...   $0.376
DBCQPZ6Z853WRE98   2020-07-01T00:00:00Z  $3.586 per Unuse...  $3.586
MK44K7QNJQCC2E98   ...and so on...

提前致谢!

【问题讨论】:

    标签: json pandas dataframe nested


    【解决方案1】:

    您可以使用json_normalize 来完成此类任务。但在这种情况下,它不会有帮助,因为数据结构是dictdictdict 等等......所以,我不确定如果没有迭代预处理,这是可能的。只是一个例子:

    def load_terms():
        url = 'your_url_here...'
        # you can parse json using .json() - without json.loads
        # iterate by each OnDemand record inside terms
        for terms in requests.get(url).json()['terms']['OnDemand'].values():  # type: dict
            # generate each row as dict for df
            for _, term in terms.items():  # type: str, dict
                for _, dimensions in term['priceDimensions'].items():  # type: str, dict
                    for currency_key, price in dimensions['pricePerUnit'].items():  # type: str, str
                        yield {
                            'sku': term['sku'],
                            'effectiveDate': term['effectiveDate'],
                            'description': dimensions['description'],
                            # don't know prices details...
                            'priceUnit': '$' + price if currency_key == 'USD' else price,
                        }
    
    
    df = pd.DataFrame(list(load_terms()))
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    print(df.head())
    
    #                 sku         effectiveDate                                        description       priceUnit
    # 0  QUMEF4UK3NPT4MN3  2020-07-01T00:00:00Z  $0.376 per Unused Reservation Windows c3.xlarg...   $0.3760000000
    # 1  DBCQPZ6Z853WRE98  2020-07-01T00:00:00Z  $3.586 per Unused Reservation RHEL r5d.12xlarg...   $3.5860000000
    # 2  MK44K7QNJQCC2E98  2020-07-01T00:00:00Z  $1.40 per Dedicated Linux with SQL Std m4.2xla...   $1.4000000000
    # 3  86MNM35KQ46XCFDQ  2020-07-01T00:00:00Z  $48.432 per Dedicated Unused Reservation Windo...  $48.4320000000
    # 4  NCQF4R2S47SB2QE5  2020-07-01T00:00:00Z  $7.336 per On Demand Linux with SQL Server Ent...   $7.3360000000
    

    【讨论】:

      猜你喜欢
      • 2020-05-05
      • 2022-07-10
      • 1970-01-01
      • 2021-06-09
      • 2014-02-24
      • 1970-01-01
      • 2017-07-02
      • 2018-04-12
      • 2021-12-12
      相关资源
      最近更新 更多