【问题标题】:Merge data frame with other and calculate groupby percentage based on the specific condition将数据框与其他数据框合并并根据特定条件计算分组百分比
【发布时间】:2021-09-10 13:06:03
【问题描述】:

我有两个数据框,如下图

df1:

Sports     Expected_%
Cricket    70
Football   20
Tennis     10

df2:

Region    Sports     Count    Percentage     
North     Cricket    800      75                              
North     Football   50       5            
North     Tennis     150      20           
South     Cricket    1300     65           
South     Football   550      27.5         
South     Tennis     150      7.5  

    

预期输出:

Region    Sports     Count    Percentage   Expected_%     Expected_count    
North     Cricket    800      75           70             700
North     Football   50       5            20             200
North     Tennis     150      20           10             100
South     Cricket    1300     65           70             1400
South     Football   550      27.5         20             400
South     Tennis     150      7.5          10             200

解释:

Expected_% for Cricket = 70

Total Count for North = 1000

Expected_Count for North = 1000*70/100 = 700

【问题讨论】:

    标签: python-3.x pandas dataframe group-by


    【解决方案1】:

    DataFrame.merge 与左连接用于新列,然后将GroupBy.transformsum 用于新Series,乘以新列并除以100

    df = df2.merge(df1, on='Sports', how='left')
    summed = df.groupby('Region')['Count'].transform('sum')
    df['Expected_count'] = summed.mul(df['Expected_%']).div(100)
    print (df)
      Region    Sports  Count  Percentage  Expected_%  Expected_count
    0  North   Cricket    800        75.0          70           700.0
    1  North  Football     50         5.0          20           200.0
    2  North    Tennis    150        20.0          10           100.0
    3  South   Cricket   1300        65.0          70          1400.0
    4  South  Football    550        27.5          20           400.0
    5  South    Tennis    150         7.5          10           200.0
    

    或使用Series.map 新建列:

    df2['Expected_%']= df2['Sports'].map(df1.set_index('Sports')['Expected_%'])
    summed = df2.groupby('Region')['Count'].transform('sum')
    df2['Expected_count'] = summed.mul(df2['Expected_%']).div(100)
    print (df2)
      Region    Sports  Count  Percentage  Expected_%  Expected_count
    0  North   Cricket    800        75.0          70           700.0
    1  North  Football     50         5.0          20           200.0
    2  North    Tennis    150        20.0          10           100.0
    3  South   Cricket   1300        65.0          70          1400.0
    4  South  Football    550        27.5          20           400.0
    5  South    Tennis    150         7.5          10           200.0
    

    【讨论】:

      【解决方案2】:

      另一种方式:

      map_dict = dict(df1.values)
      df2['Percentage'] = df2.groupby('Region').apply(lambda x: (x['Count'].sum() * x['Sports'].map(map_dict))).div(100).values
      

      【讨论】:

      • 性能...避免这种情况,因为速度慢。
      • @jezrael Ohh!!.. 我还没有检查过这个性能!!
      • 是的,取决于数据,我想慢 10 倍,但也许更多。
      • 简单的一般规则 - 原生 pandas 函数很快,自定义函数不是。如果每组调用一个有点复杂的函数,那么性能会降低,因为groupby.apply,因为复杂的函数,以及因为调用 N 次(组数)s
      • @jezrael 有道理!!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-10-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-06
      相关资源
      最近更新 更多