【问题标题】:Parsing JSON strings from API with Pandas使用 Pandas 从 API 解析 JSON 字符串
【发布时间】:2019-09-02 15:14:41
【问题描述】:

我正在尝试将嵌套在 API 响应中的 JSON 键和相关数据提取到 Pandas DataFrame 中,每个键和相关数据元素作为单独的列。

我已经在这里尝试过解决方案:Parsing a JSON string which was loaded from a CSV using Pandas

但是有两个问题。首先,我必须将 API 请求响应转换为 CSV,然后再从 CSV 转换为 DataFrame,这似乎是一个浪费的步骤,但如果这样可行,我愿意这样做。

其次,即使我这样做了,我也会收到“JSON DecodeError: Expecting property name 用双引号括起来。”

我也尝试过这里描述的解决方案但失败了:https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas 的教学

import requests
from pandas import DataFrame
import pandas as pd
import json

teamgamedata_url = 'https://api.collegefootballdata.com/games/teams?year=2019&week=1&seasonType=regular'

teamgamedataresp = requests.get(teamgamedata_url)

dftg = DataFrame(teamgamedataresp.json())

这可行,但会生成包含大量嵌套数据的“团队”列 所以为了尝试遵循 csv 的建议,我转换为 CSV

dftg.to_csv(r'/path/teamgameinfoapi.csv')

def CustomParser(data):
    j1 = json.loads(data)
    return j1

csvtodf = pd.read_csv('/path/teamgameinfoapi.csv', 
                      converters={'teams':CustomParser}, header=0)

csvtodf[sorted(csvtodf['teams'][0].keys())] = csvtodf['teams'].apply(pd.Series)

我希望 CustomParser 将 JSON 数据提取到单独的列中,但得到:

JSONDecodeError: 需要用双引号括起来的属性名称

我希望最后一行代码将列附加到数据框中,但结果却是:

KeyError: '团队'

【问题讨论】:

    标签: python json pandas csv


    【解决方案1】:

    执行此操作的高效且更熊猫的方式

    在熊猫上 >= 0.25.1

    teamgamedataresp = requests.get(teamgamedata_url)
    
    d = teamgamedataresp.json()
    
    # errors='ignore' used because some records may not have id, in that case it will throw error, I'm ignoring it here
    teams_df = pd.io.json.json_normalize(d, ['teams'], ['id'], errors='ignore')
    print(teams_df)
    
    teams_df = teams_df.explode('stats')
    print(teams_df)
    
    stats_df = pd.io.json.json_normalize(teams_df['stats'])
    print(stats_df)
    
    teams_df.drop(columns=['stats'], inplace=True)
    data = pd.concat([teams_df.reset_index(drop=True), stats_df.reset_index(drop=True)], axis=1)
    print(data)
    

    TL;DR(为便于理解而显示的数据)

    规范第一级记录,即团队

            school         conference homeAway  points                                              stats         id
    0       Vanderbilt                SEC     home       6  [{'category': 'rushingTDs', 'stat': '0'}, {'ca...  401110732
    1          Georgia                SEC     away      30  [{'category': 'rushingTDs', 'stat': '2'}, {'ca...  401110732
    2            Miami                ACC     home      20  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
    3          Florida                SEC     away      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
    4    Georgia State           Sun Belt     away      38  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401110730
    ..             ...                ...      ...     ...                                                ...        ...
    163           Navy  American Athletic     home      45  [{'category': 'rushingTDs', 'stat': '6'}, {'ca...  401117857
    164   Gardner-Webb               None     away      28  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401135910
    165      Charlotte     Conference USA     home      49  [{'category': 'rushingTDs', 'stat': '4'}, {'ca...  401135910
    166  Alabama State               None     away      19  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237
    167            UAB     Conference USA     home      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237
    
    [168 rows x 6 columns]
    

    将统计信息列中的列表分解为行

             school      conference homeAway  points                                           stats         id
    0    Vanderbilt             SEC     home       6         {'category': 'rushingTDs', 'stat': '0'}  401110732
    0    Vanderbilt             SEC     home       6         {'category': 'passingTDs', 'stat': '0'}  401110732
    0    Vanderbilt             SEC     home       6   {'category': 'kickReturnYards', 'stat': '35'}  401110732
    0    Vanderbilt             SEC     home       6      {'category': 'kickReturnTDs', 'stat': '0'}  401110732
    0    Vanderbilt             SEC     home       6        {'category': 'kickReturns', 'stat': '2'}  401110732
    ..          ...             ...      ...     ...                                             ...        ...
    167         UAB  Conference USA     home      24  {'category': 'netPassingYards', 'stat': '114'}  401114237
    167         UAB  Conference USA     home      24       {'category': 'totalYards', 'stat': '290'}  401114237
    167         UAB  Conference USA     home      24    {'category': 'fourthDownEff', 'stat': '0-1'}  401114237
    167         UAB  Conference USA     home      24    {'category': 'thirdDownEff', 'stat': '1-13'}  401114237
    167         UAB  Conference USA     home      24        {'category': 'firstDowns', 'stat': '16'}  401114237
    
    [3927 rows x 6 columns]
    

    规范化统计数据列以获取数据框

                 category  stat
    0          rushingTDs     0
    1          passingTDs     0
    2     kickReturnYards    35
    3       kickReturnTDs     0
    4         kickReturns     2
    ...               ...   ...
    3922  netPassingYards   114
    3923       totalYards   290
    3924    fourthDownEff   0-1
    3925     thirdDownEff  1-13
    3926       firstDowns    16
    
    [3927 rows x 2 columns]
    

    最后,合并两个数据框。

              school      conference homeAway  points         id         category  stat
    0     Vanderbilt             SEC     home       6  401110732       rushingTDs     0
    1     Vanderbilt             SEC     home       6  401110732       passingTDs     0
    2     Vanderbilt             SEC     home       6  401110732  kickReturnYards    35
    3     Vanderbilt             SEC     home       6  401110732    kickReturnTDs     0
    4     Vanderbilt             SEC     home       6  401110732      kickReturns     2
    ...          ...             ...      ...     ...        ...              ...   ...
    3922         UAB  Conference USA     home      24  401114237  netPassingYards   114
    3923         UAB  Conference USA     home      24  401114237       totalYards   290
    3924         UAB  Conference USA     home      24  401114237    fourthDownEff   0-1
    3925         UAB  Conference USA     home      24  401114237     thirdDownEff  1-13
    3926         UAB  Conference USA     home      24  401114237       firstDowns    16
    
    [3927 rows x 7 columns]
    

    【讨论】:

    • 谢谢。在完成原始响应后,我将解决这个问题,因为我想清楚地了解每个问题的情况,以便我发展自己的技能。
    • 我会帮忙的。我会更新我的答案
    • 非常有帮助,@Vishnudev。
    • 如果您觉得它有帮助,请通过投票/接受作为答案来鼓励。 @janalytics
    • 在这种特殊情况下,真正的第一级记录可能是游戏,由“id”定义。但我也看到了能够通过其他记录级别进行定义的价值。根据用例,这可以更容易直观地阅读。但我想这就是学习 Panda 很重要的地方——培养根据目标以不同方式构建数据帧的能力。
    【解决方案2】:

    我宁愿手动向下钻取,因为结构很复杂:

    a = teamgamedataresp.json()
    buf = []
    for game in a:
        for team in game['teams']:
            game_dict = dict(id=game['id'])
            for cat in ('school', 'conference', 'homeAway', 'points'):
                game_dict[cat] = team[cat]
            for stat in team['stats']:
                game_dict[stat['category']] = stat['stat']
            buf.append(game_dict)
    
    >>> df = pd.DataFrame(buf)
    >>> df
                id         school         conference homeAway  points  ... puntReturnTDs puntReturns interceptionYards interceptionTDs passesIntercepted
    0    401110732     Vanderbilt                SEC     home       6  ...           NaN         NaN               NaN             NaN               NaN
    1    401110732        Georgia                SEC     away      30  ...             0           4               NaN             NaN               NaN
    2    401110723          Miami                ACC     home      20  ...             0           1                41               0                 2
    3    401110723        Florida                SEC     away      24  ...             0           3               NaN             NaN               NaN
    4    401110730  Georgia State           Sun Belt     away      38  ...           NaN         NaN                 0               0                 1
    ..         ...            ...                ...      ...     ...  ...           ...         ...               ...             ...               ...
    163  401117857           Navy  American Athletic     home      45  ...             0           1               NaN             NaN               NaN
    164  401135910   Gardner-Webb               None     away      28  ...           NaN         NaN                45               0                 3
    165  401135910      Charlotte     Conference USA     home      49  ...             1           3               NaN             NaN               NaN
    166  401114237  Alabama State               None     away      19  ...             0           2               NaN             NaN               NaN
    167  401114237            UAB     Conference USA     home      24  ...             0           2                 0               0                 1
    
    [168 rows x 31 columns]
    

    【讨论】:

    • 嗯。您是否采取了重新排序列的步骤?我的以不同的顺序出现。另外,如果你能原谅一个新手问题,你如何让你的代码以这种方式显示(而不是我出现的原始形式)?
    • 不,我没有。可以是熊猫版吗?我的是在 Linux Anaconda Python 3.7 环境中的0.25.1。对于代码格式,您必须在编辑框中选择并标记为代码。对于 Pandas 版本,我指的是 Python 版本,因为从 Python 3.6 开始,所有字典现在都是有序字典,这意味着它们保留了键插入的顺序。
    • 我在 Mac Anaconda Python 7 环境中是 0.24.2。它似乎想按字母顺序排列列,这有点烦人。您的解决方案效果很好,但由于我正在学习,我想确认我对您如何攻击它的理解。将尝试使用 select-and-mark 方法将注释代码发布回来。感谢您的耐心等待。
    • 由于篇幅限制,不得不回答我自己的问题。
    【解决方案3】:

    我无法将其放入评论中;我为违反协议道歉。希望我跟着你,@crayxt。如果我误解了,请告诉我。

    a = teamgamedataresp.json()
    #defining a as the response to our get request for this data, in JSON format 
    buf = []
    #Not sure what this means, but I think we're creating a bucket where a new dict can go
    for game in a:
        #defining game as the key unit for evaluation in our desired dataset
        for team in game['teams']:
            #telling python that for each unique game we want to identify each
            #team identified by the 'teams' key, and  . . . 
            game_dict = dict(id=game['id'])
            #tells python that our key unit, game, is tied to the game ID#
            for cat in ('school', 'conference', 'homeAway', 'points'):
                #defining the categories where there is no further nesting to explore
                game_dict[cat]=team[cat]
                #we're making sure that python knows that the categories used for the game
                #should be the same one used for the team.
            for stat in team['stats']:
                game_dict[stat['category']] = stat['stat']
                #now we need to dive into the further nesting in the stats category.
                #each team's statistics should be labeled with the JSON 'category" key
                #and given a value equal to the "stat" value key given in JSON
            buf.append(game_dict)
            #adds the newly created stat columns to the dataset
    

    【讨论】:

      猜你喜欢
      • 2014-01-07
      • 1970-01-01
      • 2011-08-30
      • 2018-04-13
      • 1970-01-01
      • 1970-01-01
      • 2021-10-29
      • 1970-01-01
      相关资源
      最近更新 更多