【问题标题】:pandas dataframe add column with values using multiple conditions on other column values熊猫数据框使用其他列值上的多个条件添加具有值的列
【发布时间】:2017-10-16 02:00:12
【问题描述】:

我有一个名为“已调整”的 pandas 完整数据框。我想在“fyear”和“conm”上添加具有新值条件的“stage”列。

    fyear   conm                indadjsg
1   1999    1-800-FLOWERS.COM   26.646086
2   2000    1-800-FLOWERS.COM   22.727175 
3   2001    1-800-FLOWERS.COM   7.312014
4   2002    1-800-FLOWERS.COM   4.948308
5   2003    1-800-FLOWERS.COM   6.278798
23  1996    ABERCROMBIE & FITCH -CL A   34.831691
24  1997    ABERCROMBIE & FITCH -CL A   48.053137
25  1998    ABERCROMBIE & FITCH -CL A   48.918326
26  1999    ABERCROMBIE & FITCH -CL A   46.956456
27  2000    ABERCROMBIE & FITCH -CL A   33.91436
28  2001    ABERCROMBIE & FITCH -CL A   67.23423
29  2002    ABERCROMBIE & FITCH -CL A   99.09342
11929   2006    CLIFTON BANCORP INC 0.236418
11930   2007    CLIFTON BANCORP INC -1.366626
11931   2008    CLIFTON BANCORP INC 8.564019
11932   2009    CLIFTON BANCORP INC -4.966110
11933   2010    CLIFTON BANCORP INC -4.359552
11934   2011    CLIFTON BANCORP INC -16.313852
11935   2012    CLIFTON BANCORP INC -18.193550
11936   2013    CLIFTON BANCORP INC -10.126603
11937   2014    CLIFTON BANCORP INC 4.718584
11938   2015    CLIFTON BANCORP INC -11.889065
11940   2015    CLIPPER REALTY INC  70.945767
11941   2016    CLIPPER REALTY INC  3.776001
11980   2014    CM FINANCE INC  205.894048
11981   2015    CM FINANCE INC  68.518555
121247  2009    VCA INC -5.552030
121248  2010    VCA INC -3.357275
121249  2011    VCA INC -0.930798
121250  2012    VCA INC 5.974914
121256  2007    VIASPACE INC    -50.966869
121257  2008    VIASPACE INC    149.957403
121258  2009    VIASPACE INC    197.776855
121259  2010    VIASPACE INC    -25.201733
121260  2011    VIASPACE INC    77.082624
121261  2012    VIASPACE INC    78.034233
121266  2005    YASHENG GROUP   -3.728098
121267  2006    YASHENG GROUP   -2.233927
121268  2007    YASHENG GROUP   0.349349
121279  2009    YUHE INTERNATIONAL INC  27.995324
121280  2010    YUHE INTERNATIONAL INC  34.375630

1) 如果唯一公司的 fyear 数量等于或小于 5,我想填写“start”。

 byyr = adjusted.groupby(by=['conm'])['fyear']
 dfbyyr =byyr.count().to_frame()
 start = dfbyyr[dfbyyr['fyear'] <= 5]

                               fyear
    conm                
    1-800-FLOWERS.COM           5
    ABERCROMBIE & FITCH -CL A   7
    CLIFTON BANCORP INC        10
    CLIPPER REALTY INC          2
    CM FINANCE INC              2
    VCA INC                     4
    VIASPACE INC                6
    YASHENG GROUP               3
    YUHE INTERNATIONAL INC      2

2) 在我用“开始”条件填充其余数据后,我想填充另一个值。 我计算了独特公司的平均 indadjsg。

mask2 = adjusted.groupby(by=['conm'])['indadjsg']
countsg = mask2.mean().to_frame().reset_index()
c = countsg.dropna()   

数据框'c'

    conm                indadjsg
0   1-800-FLOWERS.COM   3.291539
1   ABERCROMBIE & FITCH -CL A   105.335324
2   CLIFTON BANCORP INC 22.920683
3   CLIPPER REALTY INC  36.784677
4   CM FINANCE INC  1.605919
5   VCA INC 3.116871
6   VIASPACE INC    -106.153789
7   YASHENG GROUP   -2.676296
8   YUHE INTERNATIONAL INC  12.306557

我要给出的条件如下:

      indadjsg  < 0,  'decline'
 0 <= indadjsg  <= 15, 'revival'
 15< indadjsg  <= 100, 'mature'
 100< indajsg         , 'growth'

我要制作的最终数据框是这样的

    fyear   conm                indadjsg    stage
1   1999    1-800-FLOWERS.COM   26.646086   start
2   2000    1-800-FLOWERS.COM   22.727175   start
3   2001    1-800-FLOWERS.COM   7.312014    start
4   2002    1-800-FLOWERS.COM   4.948308    start
5   2003    1-800-FLOWERS.COM   6.278798    start
23  1996    ABERCROMBIE & FITCH -CL A   34.831691  growth 
24  1997    ABERCROMBIE & FITCH -CL A   48.053137  growth    
25  1998    ABERCROMBIE & FITCH -CL A   48.918326  growth    
26  1999    ABERCROMBIE & FITCH -CL A   46.956456  growth 
27  2000    ABERCROMBIE & FITCH -CL A   33.91436  growth 
28  2001    ABERCROMBIE & FITCH -CL A   67.23423  growth 
29  2002    ABERCROMBIE & FITCH -CL A   99.09342    growth 
11929   2006    CLIFTON BANCORP INC 0.236418        mature
11930   2007    CLIFTON BANCORP INC -1.366626       mature
11931   2008    CLIFTON BANCORP INC 8.564019        mature 
11932   2009    CLIFTON BANCORP INC -4.966110       mature 
11933   2010    CLIFTON BANCORP INC -4.359552       mature 
11934   2011    CLIFTON BANCORP INC -16.313852      mature 
11935   2012    CLIFTON BANCORP INC -18.193550      mature 
11936   2013    CLIFTON BANCORP INC -10.126603      mature 
11937   2014    CLIFTON BANCORP INC 4.718584        mature 
11938   2015    CLIFTON BANCORP INC -11.889065      mature 
11940   2015    CLIPPER REALTY INC  70.945767       start
11941   2016    CLIPPER REALTY INC  3.776001        start
11980   2014    CM FINANCE INC  205.894048    start
11981   2015    CM FINANCE INC  68.518555     start
121247  2009    VCA INC -5.552030             start
121248  2010    VCA INC -3.357275             start
121249  2011    VCA INC -0.930798             start
121250  2012    VCA INC 5.974914              start
121256  2007    VIASPACE INC    -50.966869    decline
121257  2008    VIASPACE INC    149.957403    decline
121258  2009    VIASPACE INC    197.776855    decline
121259  2010    VIASPACE INC    -25.201733    decline
121260  2011    VIASPACE INC    77.082624     decline
121261  2012    VIASPACE INC    78.034233     decline 
121266  2005    YASHENG GROUP   -3.728098        start
121267  2006    YASHENG GROUP   -2.233927        start
121268  2007    YASHENG GROUP   0.349349         start
121279  2009    YUHE INTERNATIONAL INC  27.995324    start
121280  2010    YUHE INTERNATIONAL INC  34.375630    start

有什么方法可以一次性完成吗?我只能想到制作单独的列并将其合并。你能帮助我有效地思考吗?提前谢谢你。

【问题讨论】:

    标签: python pandas dataframe group-by conditional-statements


    【解决方案1】:

    开始于:

    Adjusted:
            fyear   conm                        indadjsg     
    0       1999    1-800-FLOWERS.COM           26.646086             
    1       2000    1-800-FLOWERS.COM           22.727175             
    2       2001    1-800-FLOWERS.COM           7.312014              
    3       2002    1-800-FLOWERS.COM           4.948308              
    4       2003    1-800-FLOWERS.COM           6.278798              
    5       1996    ABERCROMBIE & FITCH -CL A   34.831691             
    6       1997    ABERCROMBIE & FITCH -CL A   48.053137             
    ...
    35      2012    VIASPACE INC                78.034233             
    36      2005    YASHENG GROUP               -3.728098             
    37      2006    YASHENG GROUP               -2.233927             
    38      2007    YASHENG GROUP               0.349349              
    39      2009    YUHE INTERNATIONAL INC      27.995324             
    40      2010    YUHE INTERNATIONAL INC      34.375630             
    

    这段代码不是特别聪明,但很简单:

    # add an empty "stage" column
    adjusted['stage'] = ''
    
    # create boolean masks for each stage classification
    g = adjusted.groupby(by='conm')
    decline = g['indadjsg'].transform('mean') < 0
    revival = (g['indadjsg'].transform('mean') >= 0) & (g['indadjsg'].transform('mean') <= 15)
    mature = (g['indadjsg'].transform('mean') > 15) & (g['indadjsg'].transform('mean') <= 100)
    growth = (g['indadjsg'].transform('mean') > 100)
    start = g['fyear'].transform('count') <= 5
    
    adjusted.loc[decline, 'stage'] = 'decline'
    adjusted.loc[revival, 'stage'] = 'revival'
    adjusted.loc[mature, 'stage'] = 'mature'
    adjusted.loc[growth, 'stage'] = 'growth'
    
    # set 'start' classification last so it overwrites 
    # the classification set based on 
    adjusted.loc[start, 'stage'] = 'start'
    

    输出如下所示:

        fyear   conm                        indadjsg    stage  
    0   1999    1-800-FLOWERS.COM           26.646086   start  
    1   2000    1-800-FLOWERS.COM           22.727175   start  
    2   2001    1-800-FLOWERS.COM           7.312014    start  
    3   2002    1-800-FLOWERS.COM           4.948308    start  
    4   2003    1-800-FLOWERS.COM           6.278798    start  
    5   1996    ABERCROMBIE & FITCH -CL A   34.831691   mature 
    6   1997    ABERCROMBIE & FITCH -CL A   48.053137   mature 
    7   1998    ABERCROMBIE & FITCH -CL A   48.918326   mature 
    8   1999    ABERCROMBIE & FITCH -CL A   46.956456   mature 
    9   2000    ABERCROMBIE & FITCH -CL A   33.914360   mature 
    10  2001    ABERCROMBIE & FITCH -CL A   67.234230   mature 
    11  2002    ABERCROMBIE & FITCH -CL A   99.093420   mature 
    12  2006    CLIFTON BANCORP INC         0.236418    decline
    13  2007    CLIFTON BANCORP INC         -1.366626   decline
    14  2008    CLIFTON BANCORP INC         8.564019    decline
    15  2009    CLIFTON BANCORP INC         -4.966110   decline
    16  2010    CLIFTON BANCORP INC         -4.359552   decline
    17  2011    CLIFTON BANCORP INC         -16.313852  decline
    18  2012    CLIFTON BANCORP INC         -18.193550  decline
    19  2013    CLIFTON BANCORP INC         -10.126603  decline
    20  2014    CLIFTON BANCORP INC         4.718584    decline
    21  2015    CLIFTON BANCORP INC         -11.889065  decline
    22  2015    CLIPPER REALTY INC          70.945767   start  
    23  2016    CLIPPER REALTY INC          3.776001    start  
    24  2014    CM FINANCE INC              205.894048  start  
    25  2015    CM FINANCE INC              68.518555   start  
    26  2009    VCA INC                     -5.552030   start  
    27  2010    VCA INC                     -3.357275   start  
    28  2011    VCA INC                     -0.930798   start  
    29  2012    VCA INC                     5.974914    start  
    30  2007    VIASPACE INC                -50.966869  mature 
    31  2008    VIASPACE INC                149.957403  mature 
    32  2009    VIASPACE INC                197.776855  mature 
    33  2010    VIASPACE INC                -25.201733  mature 
    34  2011    VIASPACE INC                77.082624   mature 
    35  2012    VIASPACE INC                78.034233   mature 
    36  2005    YASHENG GROUP               -3.728098   start  
    37  2006    YASHENG GROUP               -2.233927   start  
    38  2007    YASHENG GROUP               0.349349    start  
    39  2009    YUHE INTERNATIONAL INC      27.995324   start  
    40  2010    YUHE INTERNATIONAL INC      34.375630   start            
    

    【讨论】:

      【解决方案2】:

      我相信您可以通过pd.cutnp.where 实现这一点:

      adjusted # copied text from your example
      Out[86]: 
          fyear               conm   indadjsg
      0    1999  1-800-FLOWERS.COM   26.64609
      1    2000  1-800-FLOWERS.COM   22.72717
      2    2001  1-800-FLOWERS.COM    7.31201
      3    2002  1-800-FLOWERS.COM    4.94831
      4    2003  1-800-FLOWERS.COM    6.27880
      5    1996        ABERCROMBIE   34.83169
      6    1997        ABERCROMBIE   48.05314
      7    1998        ABERCROMBIE   48.91833
      8    1999        ABERCROMBIE   46.95646
      9    2000        ABERCROMBIE   33.91436
      10   2001        ABERCROMBIE   67.23423
      11   2002        ABERCROMBIE   99.09342
      ..    ...                ...        ...
      25   2015                 CM   68.51856
      26   2009                VCA   -5.55203
      27   2010                VCA   -3.35728
      28   2011                VCA   -0.93080
      29   2012                VCA    5.97491
      30   2007           VIASPACE  -50.96687
      31   2008           VIASPACE  149.95740
      32   2009           VIASPACE  197.77686
      33   2010           VIASPACE  -25.20173
      34   2011           VIASPACE   77.08262
      35   2012           VIASPACE   78.03423
      36   2005            YASHENG   -3.72810
      
      byyr = adjusted.groupby(by='conm')['fyear'].count().to_frame()
      start = byyr.fyear[adjusted.conm]
      
      indadjsg = adjusted.groupby(by='conm')['indadjsg'].mean().to_frame()
      px = indadjsg.indadjsg[adjusted.conm]
      categories = pd.cut(px.values.reshape((len(px), )), 
                          bins= [-np.inf, 0, 15, 100, np.inf], 
                          labels=['decline', 'revival', 'mature', 'growth'])
      
      adjusted.loc[:, 'stage'] = np.where(start <= 5, 'start', categories)
      
      adjusted # result
      Out[130]: 
          fyear               conm   indadjsg   stage
      0    1999  1-800-FLOWERS.COM   26.64609   start
      1    2000  1-800-FLOWERS.COM   22.72717   start
      2    2001  1-800-FLOWERS.COM    7.31201   start
      3    2002  1-800-FLOWERS.COM    4.94831   start
      4    2003  1-800-FLOWERS.COM    6.27880   start
      5    1996        ABERCROMBIE   34.83169  mature
      6    1997        ABERCROMBIE   48.05314  mature
      7    1998        ABERCROMBIE   48.91833  mature
      8    1999        ABERCROMBIE   46.95646  mature
      9    2000        ABERCROMBIE   33.91436  mature
      10   2001        ABERCROMBIE   67.23423  mature
      11   2002        ABERCROMBIE   99.09342  mature
      ..    ...                ...        ...     ...
      25   2015                 CM   68.51856   start
      26   2009                VCA   -5.55203   start
      27   2010                VCA   -3.35728   start
      28   2011                VCA   -0.93080   start
      29   2012                VCA    5.97491   start
      30   2007           VIASPACE  -50.96687  mature
      31   2008           VIASPACE  149.95740  mature
      32   2009           VIASPACE  197.77686  mature
      33   2010           VIASPACE  -25.20173  mature
      34   2011           VIASPACE   77.08262  mature
      35   2012           VIASPACE   78.03423  mature
      36   2005            YASHENG   -3.72810   start
      

      在 pd.cut 上,确保使用 right=Trueright=False. 指定 bin 的边缘

      【讨论】:

      • 很抱歉重复的答案,@unutbu 和我同时回答了,我们的答案在相同的解决方案中采取的路线略有不同。
      • 无需道歉。查看替代方法可能非常有用。
      【解决方案3】:

      有一种方法可以使用单个 groupby/transform 操作计算 stage 列(请参阅下面的 classify 函数),但它涉及为每个组调用一次自定义 Python 函数。如果有很多组,这往往是低效的。

      一般来说,当您替换大量 Python 时,您会获得更好的性能 对整个(大)DataFrame 进行矢量化操作的函数调用或 DataFrame 的大列。

      因此,如果有很多 conms(即很多组),最好 按照你的第一个想法去做——计算每个公司的阶段,然后合并 结果返回到adjusted。这是一种方法 - 合并是 通过调用join 完成:

      import numpy as np
      import pandas as pd
      adjusted = pd.DataFrame({'conm': ['1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIPPER REALTY INC', 'CLIPPER REALTY INC', 'CM FINANCE INC', 'CM FINANCE INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'YASHENG GROUP', 'YASHENG GROUP', 'YASHENG GROUP', 'YUHE INTERNATIONAL INC', 'YUHE INTERNATIONAL INC'], 'fyear': [1999, 2000, 2001, 2002, 2003, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2015, 2016, 2014, 2015, 2009, 2010, 2011, 2012, 2007, 2008, 2009, 2010, 2011, 2012, 2005, 2006, 2007, 2009, 2010], 'indadjsg': [26.646085999999997, 22.727175, 7.312014, 4.948308, 6.278798, 34.831691, 48.053137, 48.918326, 46.956456, 33.914359999999995, 67.23423000000001, 99.09342, 0.236418, -1.3666260000000001, 8.564019, -4.96611, -4.359552, -16.313852, -18.19355, -10.126603, 4.718584, -11.889064999999999, 70.945767, 3.7760010000000004, 205.894048, 68.518555, -5.55203, -3.357275, -0.9307979999999999, 5.974914, -50.966869, 149.957403, 197.776855, -25.201732999999997, 77.082624, 78.034233, -3.728098, -2.233927, 0.34934899999999997, 27.995324, 34.37563]}, index=[1, 2, 3, 4, 5, 23, 24, 25, 26, 27, 28, 29, 11929, 11930, 11931, 11932, 11933, 11934, 11935, 11936, 11937, 11938, 11940, 11941, 11980, 11981, 121247, 121248, 121249, 121250, 121256, 121257, 121258, 121259, 121260, 121261, 121266, 121267, 121268, 121279, 121280])
      
      grouped = adjusted.groupby(by=['conm'])
      stage = pd.cut(grouped['indadjsg'].mean(), bins=[-np.inf,0,15,100,np.inf], labels=False)
      stage.name = 'stage'
      labels = np.array(['decline', 'revival', 'mature', 'growth'])
      adjusted = adjusted.join(stage, on='conm')
      adjusted['stage'] = labels[adjusted['stage']]
      mask = (grouped['fyear'].transform('count') <= 5)
      adjusted.loc[mask, 'stage'] = 'start'
      print(adjusted)
      

      产量

                                   conm  fyear    indadjsg    stage
      1               1-800-FLOWERS.COM   1999   26.646086    start
      2               1-800-FLOWERS.COM   2000   22.727175    start
      3               1-800-FLOWERS.COM   2001    7.312014    start
      4               1-800-FLOWERS.COM   2002    4.948308    start
      5               1-800-FLOWERS.COM   2003    6.278798    start
      23      ABERCROMBIE & FITCH -CL A   1996   34.831691   mature
      24      ABERCROMBIE & FITCH -CL A   1997   48.053137   mature
      25      ABERCROMBIE & FITCH -CL A   1998   48.918326   mature
      26      ABERCROMBIE & FITCH -CL A   1999   46.956456   mature
      27      ABERCROMBIE & FITCH -CL A   2000   33.914360   mature
      28      ABERCROMBIE & FITCH -CL A   2001   67.234230   mature
      29      ABERCROMBIE & FITCH -CL A   2002   99.093420   mature
      11929         CLIFTON BANCORP INC   2006    0.236418  decline
      11930         CLIFTON BANCORP INC   2007   -1.366626  decline
      11931         CLIFTON BANCORP INC   2008    8.564019  decline
      11932         CLIFTON BANCORP INC   2009   -4.966110  decline
      11933         CLIFTON BANCORP INC   2010   -4.359552  decline
      11934         CLIFTON BANCORP INC   2011  -16.313852  decline
      11935         CLIFTON BANCORP INC   2012  -18.193550  decline
      11936         CLIFTON BANCORP INC   2013  -10.126603  decline
      11937         CLIFTON BANCORP INC   2014    4.718584  decline
      11938         CLIFTON BANCORP INC   2015  -11.889065  decline
      11940          CLIPPER REALTY INC   2015   70.945767    start
      11941          CLIPPER REALTY INC   2016    3.776001    start
      11980              CM FINANCE INC   2014  205.894048    start
      11981              CM FINANCE INC   2015   68.518555    start
      121247                    VCA INC   2009   -5.552030    start
      121248                    VCA INC   2010   -3.357275    start
      121249                    VCA INC   2011   -0.930798    start
      121250                    VCA INC   2012    5.974914    start
      121256               VIASPACE INC   2007  -50.966869   mature
      121257               VIASPACE INC   2008  149.957403   mature
      121258               VIASPACE INC   2009  197.776855   mature
      121259               VIASPACE INC   2010  -25.201733   mature
      121260               VIASPACE INC   2011   77.082624   mature
      121261               VIASPACE INC   2012   78.034233   mature
      121266              YASHENG GROUP   2005   -3.728098    start
      121267              YASHENG GROUP   2006   -2.233927    start
      121268              YASHENG GROUP   2007    0.349349    start
      121279     YUHE INTERNATIONAL INC   2009   27.995324    start
      121280     YUHE INTERNATIONAL INC   2010   34.375630    start
      

      这是另一种方法,当有很多组时它会更慢(但如果组很少,可能会更快)。

      您可以使用单个 groupby/transform 操作计算 stage 列 使用自定义 Python 函数classifyclassify 为每个组调用一次 - 即,为 conm 的每个值调用一次。

      import bisect
      def classify(grp, grid=[0,15,100,np.inf], 
                   labels=['decline', 'revival', 'mature', 'growth']):
          return 'start' if len(grp) <= 5 else labels[bisect.bisect_left(grid, grp.mean())]
      
      grouped = adjusted.groupby(by=['conm'])
      adjusted['stage'] = grouped['indadjsg'].transform(classify)
      print(adjusted)
      

      【讨论】:

        猜你喜欢
        • 2018-05-15
        • 1970-01-01
        • 2016-08-17
        • 2023-03-17
        • 1970-01-01
        • 1970-01-01
        • 2018-10-26
        • 2023-01-13
        • 2022-11-17
        相关资源
        最近更新 更多