【问题标题】:How to summarize on different groupby combinations?如何总结不同的groupby组合?
【发布时间】:2019-02-17 20:21:45
【问题描述】:

我正在编制一份按县划分的前 3 种作物的表格。一些县的作物品种相同,顺序相同。其他县有相同的作物品种,但顺序不同。

df1 = pd.DataFrame( { 
    "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , 
    "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
    "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
    "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
    "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )

我可以对 Crop1、Crop2 和 Crop3 进行 groupby 并得到 total_pop 的总和:

df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()

这给了我特定作物组合的总数:

df1_grouped
apples  melons  grain   1500
grain   melons  apples  2000
melons  grain   apples  4500
pears   carrots raddish 6700
raddish pears   carrots 2500

不过,我想要的是获得不同作物组合的总人口——无论列出的作物是作物 1、作物 2 还是作物 3。期望的结果是这样的:

apples  melons   grain    8000
pears   carrots  raddish  9200 

感谢您的指导。

【问题讨论】:

    标签: python pandas dataframe pandas-groupby itertools


    【解决方案1】:

    由于您的数据似乎可以保证每个国家/地区有 3 种独特的作物(“我正在按县编制前 3 种作物的表格。”),因此对值进行排序并重新分配就足够了。

    import numpy as np
    
    cols = ['Crop1', 'Crop2', 'Crop3']
    df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
    
           County    Crop1  Crop2    Crop3  Total_pop
    0      Harney   apples  grain   melons       2000
    1       Baker   apples  grain   melons       1500
    2     Wheeler   apples  grain   melons       3000
    3  Hood River   apples  grain   melons       1500
    4       Wasco  carrots  pears  raddish       2000
    5      Morrow  carrots  pears  raddish       2500
    6       Union  carrots  pears  raddish       2700
    7        Lake  carrots  pears  raddish       2000
    

    然后总结一下:

    df1.groupby(cols).sum()
    
    #                       Total_pop
    #Crop1   Crop2 Crop3             
    #apples  grain melons        8000
    #carrots pears raddish       9200
    

    好处是您可以避免使用Series.apply.apply(axis=1)。对于较大的DataFrames,性能差异很明显:

    df1 = pd.concat([df1]*10000, ignore_index=True)
    
    cols = ['Crop1', 'Crop2', 'Crop3']
    %timeit df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
    #36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    to_sum = ['Crop1', 'Crop2', 'Crop3']
    %timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
    #1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    【讨论】:

      【解决方案2】:

      这是一种方法。

      首先让我们跨列获取唯一值,然后将这些值重新分配回 DataFrame。我们将对原始数据的副本执行此操作,因为您可能需要保留原始数据。

      df = df1.copy()
      
      to_sum = ['Crop1', 'Crop2', 'Crop3']
      
      df[to_sum] = pd.DataFrame(df.loc[:, to_sum] \
                                  .apply(set, axis=1) \
                                  .apply(sorted) \
                                  .values \
                                  .tolist(), columns=to_sum)
      
      print(df)
      
             County  Crop1    Crop2    Crop3  Total_pop
      0      Harney  grain   apples   melons       2000
      1       Baker  grain   apples   melons       1500
      2     Wheeler  grain   apples   melons       3000
      3  Hood River  grain   apples   melons       1500
      4       Wasco  pears  carrots  raddish       2000
      5      Morrow  pears  carrots  raddish       2500
      6       Union  pears  carrots  raddish       2700
      7        Lake  pears  carrots  raddish       2000
      

      现在我们可以执行groupby 来获得想要的结果。

      df.groupby(to_sum).Total_pop.sum()
      
      Crop1    Crop2  Crop3  
      apples   grain  melons     8000
      carrots  pears  raddish    9200
      Name: Total_pop, dtype: int64
      

      【讨论】:

      • 这并不能完全给出正确的答案,因为您还没有对其进行排序。我已经根据您的回答更新了正确的版本!
      • 那么我们究竟是如何排序的呢?
      【解决方案3】:

      np.bincount

      i, u = pd.factorize([*map(frozenset, zip(df1.Crop1, df1.Crop2, df1.Crop3))])
      s = np.bincount(i, df1.Total_pop)
      
      pd.Series(s, u)
      
      (melons, grain, apples)      8000.0
      (carrots, raddish, pears)    9200.0
      dtype: float64
      

      或者,如果您想要单独的列

      pd.Series(dict(zip(map(tuple, u), s)))
      
      melons   grain    apples    8000.0
      carrots  raddish  pears     9200.0
      dtype: float64
      

      而且非常漂亮

      pd.Series(dict(zip(map(tuple, u), s))) \
        .rename_axis(['Crop1', 'Crop2', 'Crop3']).reset_index(name='Total_pop')
      
           Crop1    Crop2   Crop3  Total_pop
      0   melons    grain  apples     8000.0
      1  carrots  raddish   pears     9200.0
      

      【讨论】:

        【解决方案4】:

        方法一:

        合并crop

        >>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
        ...                           x['Crop2'],
        ...                           x['Crop3']]),axis=1)
        >>> df1.head()
               County   Crop1    Crop2    Crop3  Total_pop              combined_temp
        0      Harney   grain   melons   apples       2000    [grain, melons, apples]
        1       Baker  melons    grain   apples       1500    [melons, grain, apples]
        2     Wheeler  melons    grain   apples       3000    [melons, grain, apples]
        3  Hood River  apples   melons    grain       1500    [apples, melons, grain]
        4       Wasco   pears  carrots  raddish       2000  [pears, carrots, raddish]
        

        使它成为一个排序的元组

        >>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
        >>> df1.head()
               County   Crop1    Crop2            ...             Total_pop              combined_temp                     sorted
        0      Harney   grain   melons            ...                  2000    [grain, melons, apples]    (apples, grain, melons)
        1       Baker  melons    grain            ...                  1500    [melons, grain, apples]    (apples, grain, melons)
        2     Wheeler  melons    grain            ...                  3000    [melons, grain, apples]    (apples, grain, melons)
        3  Hood River  apples   melons            ...                  1500    [apples, melons, grain]    (apples, grain, melons)
        4       Wasco   pears  carrots            ...                  2000  [pears, carrots, raddish]  (carrots, pears, raddish)
        

        然后通过操作进入你的正常组

        >>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
        >>> df1_grouped
                              sorted  Total_pop
        0    (apples, grain, melons)       8000
        1  (carrots, pears, raddish)       9200
        

        方法二: 基于answer by aws-apprentice 的缩短版

        df = df1.copy()
        
        grouping_cols = ['Crop1', 'Crop2', 'Crop3']
        
        df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
                                    .apply(set, axis=1) \
                                    .apply(sorted)            
                                    .values \
                                    .tolist(), columns=grouping_cols)
        
        >>> df.head()
               County    Crop1  Crop2    Crop3  Total_pop
        0      Harney   apples  grain   melons       2000
        1       Baker   apples  grain   melons       1500
        2     Wheeler   apples  grain   melons       3000
        3  Hood River   apples  grain   melons       1500
        4       Wasco  carrots  pears  raddish       2000
        

        现在按组分组

        >>> df.groupby(grouping_cols).Total_pop.sum()
        Crop1    Crop2  Crop3  
        apples   grain  melons     8000
        carrots  pears  raddish    9200
        Name: Total_pop, dtype: int64
        

        但我个人更喜欢this answer using numpy

        【讨论】:

        • 因为您的原始答案在没有排序的情况下无法正常工作。 (我执行并检查了它)我将您原始答案中的.apply(list) 更改为.apply(sorted) 以使其正常工作。我也对您的回答表示赞赏。
        • 没有在列上运行排序,您的answergroup by 之后没有给出OP 在我的系统上想要的正确输出。也许你应该重新检查?因此,根据您的回答,对 3 列进行了更改。
        • 嘿,我并不是在暗示 OP 想要对答案进行排序。我的意思是当我完全按照您给出的答案时,最终group by 的输出不正确。所以我在更新我的答案时使用apply(sorted) 来更正它
        • 感谢您的所有回答。他们对我有很大的帮助和教育意义。上面的方法 1 生成了我正在寻找的结果,所以我选择了它。
        【解决方案5】:
        import pandas as pd
        
        df = pd.DataFrame( {
            "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
            "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
            "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
            "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
            "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
        print(df)
        df["Merged"] = df[["Crop1", "Crop2", "Crop3"]].apply(lambda x: ','.join(x.dropna().astype(str).values), axis=1).str.split(",")
        df["Merged"] = df["Merged"].sort_values().apply(lambda x: sorted(x)).apply(lambda x: ",".join(x))
        df[["x", "y", "z"]] = df["Merged"].str.split(",", expand=True)
        df1=df.groupby(['x',"y","z"])['Total_pop'].sum().reset_index()
        print(df1)
        

        输出:

              County    Crop1    Crop2    Crop3  Total_pop
              Harney    grain   melons   apples       2000
               Baker   melons    grain   apples       1500
             Wheeler   melons    grain   apples       3000
          Hood River   apples   melons    grain       1500
               Wasco    pears  carrots  raddish       2000
              Morrow  raddish    pears  carrots       2500
               Union    pears  carrots  raddish       2700
                Lake    pears  carrots  raddish       2000
        
                   x      y        z  Total_pop
              apples  grain   melons       8000
             carrots  pears  raddish       9200
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2022-01-18
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2018-03-31
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多