【问题标题】:Faster way to make pandas Multiindex dataframe than append制作 pandas Multiindex 数据帧比追加更快的方法
【发布时间】:2021-04-10 20:57:43
【问题描述】:

我正在寻找一种更快的方法将数据从我的 json 对象加载到多索引数据帧中。

我的 JSON 是这样的:

    {
        "1990-1991": {
            "Cleveland": {
                "salary": "$14,403,000",
                "players": {
                    "Hot Rod Williams": "$3,785,000",
                    "Danny Ferry": "$2,640,000",
                    "Mark Price": "$1,400,000",
                    "Brad Daugherty": "$1,320,000",
                    "Larry Nance": "$1,260,000",
                    "Chucky Brown": "$630,000",
                    "Steve Kerr": "$548,000",
                    "Derrick Chievous": "$525,000",
                    "Winston Bennett": "$525,000",
                    "John Morton": "$350,000",
                    "Milos Babic": "$200,000",
                    "Gerald Paddio": "$120,000",
                    "Darnell Valentine": "$100,000",
                    "Henry James": "$75,000"
                },
                "url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
            },

我正在制作这样的数据框:

    df = pd.DataFrame(columns=["year", "team", "player", "salary"])
    
    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            for player in nbaSalaryData[year][team]['players']:
                df = df.append({
                        "year": year,
                        "team": team,
                        "player": player,
                        "salary": nbaSalaryData[year][team]['players'][player]
                    }, ignore_index=True)
    
    df = df.set_index(['year', 'team', 'player']).sort_index()
    df

结果:

                                              salary 
    year       team     player
    1990-1991  Atlanta  Doc Rivers          $895,000
                        Dominique Wilkins   $2,065,000
                        Gary Leonard        $200,000
                        John Battle         $590,000
                        Kevin Willis        $685,000
    ... ... ... ...
    2020-2021   Washington  Robin Lopez     $7,300,000
                        Rui Hachimura       $4,692,840
                        Russell Westbrook   $41,358,814
                        Thomas Bryant       $8,333,333
                        Troy Brown          $3,372,840

这是我想要的表格 - 年份、球队和球员作为索引,薪水作为列。我知道使用 append 很慢,但我无法找到替代方案。我尝试使用元组(配置略有不同 - 没有球员和薪水)来制作它,但它最终无法正常工作。

    tuples = []
    index = None

    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            t = nbaSalaryData[year][team]
            tuples.append((year, team))

    index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
    df = index.to_frame()
    df

哪些输出:

                             year   team
    year    team        
    1990-1991   Cleveland   1990-1991   Cleveland
                New York    1990-1991   New York
                Detroit     1990-1991   Detroit
                LA Lakers   1990-1991   LA Lakers
                Atlanta     1990-1991   Atlanta  

我对 pandas 不太熟悉,但我意识到肯定有比append() 更快的方法。

【问题讨论】:

    标签: python pandas dataframe multi-index


    【解决方案1】:

    你可以修改the answer to a very similar question如下:

    z = json.loads(json_data)
    
    out = pd.Series({
        (i,j,m): z[i][j][k][m]
        for i in z
        for j in z[i]
        for k in ['players']
        for m in z[i][j][k]
    }).to_frame('salary').rename_axis('year team player'.split())
    
    # out:
    
                                               salary
    year      team      player                       
    1990-1991 Cleveland Hot Rod Williams   $3,785,000
                        Danny Ferry        $2,640,000
                        Mark Price         $1,400,000
                        Brad Daugherty     $1,320,000
                        Larry Nance        $1,260,000
                        Chucky Brown         $630,000
                        Steve Kerr           $548,000
                        Derrick Chievous     $525,000
                        Winston Bennett      $525,000
                        John Morton          $350,000
                        Milos Babic          $200,000
                        Gerald Paddio        $120,000
                        Darnell Valentine    $100,000
                        Henry James           $75,000
    

    另外,如果您打算对这些薪水进行一些数值分析,您可能希望它们是数字,而不是字符串。如果是这样,还请考虑:

    out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))
    

    PS:解释:

    for 行只是扁平化嵌套dict 的一大理解。要了解它的工作原理,请先尝试:

    [
        (i,j)
        for i in z
        for j in z[i]
    ]
    

    第三个for会列出z[i][j]的所有键,即:['salary', 'players', 'url'],但我们只对'players'感兴趣,所以我们这么说。

    最后一点是,我们想要一个dict,而不是list。尝试不使用pd.Series() 包围的表达式,您将确切地看到发生了什么。

    【讨论】:

    • 哇,现在理解起来真是令人费解。那是一篇好文章——我将详细介绍它以尝试了解该系列是如何构建的。这非常快,尤其是与我正在做的相比。谢谢皮埃尔
    • 当然,我添加了一点解释以及如何逐步了解各个位的用途。
    【解决方案2】:

    我们可以使用 for 循环创建数据帧并在最终连接之前追加:将连接延迟到最后比在循环中追加数据帧要好得多

    box = []
    # data refers to the shared json in the question
    for year, value in data.items():
        for team, players in value.items():
            content = players["players"]
            content = pd.DataFrame.from_dict(
                content, orient="index", columns=["salary"]
            ).rename_axis(index="player")
            content = content.assign(year=year, team=team)
            box.append(content)
    
    box
    
    [                       salary       year       team
     player                                             
     Hot Rod Williams   $3,785,000  1990-1991  Cleveland
     Danny Ferry        $2,640,000  1990-1991  Cleveland
     Mark Price         $1,400,000  1990-1991  Cleveland
     Brad Daugherty     $1,320,000  1990-1991  Cleveland
     Larry Nance        $1,260,000  1990-1991  Cleveland
     Chucky Brown         $630,000  1990-1991  Cleveland
     Steve Kerr           $548,000  1990-1991  Cleveland
     Derrick Chievous     $525,000  1990-1991  Cleveland
     Winston Bennett      $525,000  1990-1991  Cleveland
     John Morton          $350,000  1990-1991  Cleveland
     Milos Babic          $200,000  1990-1991  Cleveland
     Gerald Paddio        $120,000  1990-1991  Cleveland
     Darnell Valentine    $100,000  1990-1991  Cleveland
     Henry James           $75,000  1990-1991  Cleveland]
    

    连接和重新排序索引级别:

    (
        pd.concat(box)
        .set_index(["year", "team"], append=True)
        .reorder_levels(["year", "team", "player"])
    )
    

    【讨论】:

    • 这是一个有趣的方法。我不会想到的。
    猜你喜欢
    • 1970-01-01
    • 2016-07-31
    • 2017-09-19
    • 1970-01-01
    • 2018-07-29
    • 2018-04-09
    • 2022-11-28
    • 2018-08-14
    • 2019-02-16
    相关资源
    最近更新 更多