【问题标题】:add column calculation with condition in pandas在熊猫中添加带有条件的列计算
【发布时间】:2022-01-16 19:51:33
【问题描述】:

我有一个数据框,例如:

COL1 start1 end1    start2  end2
A    5000   6000    5000    6500
B    5000   6000    4550    6000
C    5000   6000    2000    5300
D    5000   6000    5900    8000
E    5000   6000    5600    5800
F    5000   6000    5000    6000
G    5000   6000    4000    7000 

我想在我做的地方添加一个新列NEWCOL

If  start1 ==start2 & end1 == end2 : 
  NEWCOL = 1 
elif start1==star2 & end2 > start1 | end1 == end2 & start1 < start2:
  NEWCOL= (end2-start2) / (end1-start1)
elif start2 < start1 & end2 > end1: 
  NEWCOL = (end2-start2) / (end1-start1) 
elif start1 < start2 & end1 > end2: 
  NEWCOL = (end2-start2) / (end1-start1) 
elif start1 > start2 & end1 > end2 :
  NEWCOL = (end2-start1) / end1-start1 
elif start1 < start2 & end1 < end2 :
  NEWCOL = (end1-start2) / (end1-start1)

然后我应该得到:

COL1    start1  end1    start2  end2    NEWCOL
A   5000    6000    5000    6500    1.5
B   5000    6000    4550    6000    1.45
C   5000    6000    2000    5300    0.3
D   5000    6000    5900    8000    0.1
E   5000    6000    5600    5800    0.2
F   5000    6000    5000    6000    1
G   5000    6000    4000    7000    3

【问题讨论】:

    标签: python python-3.x pandas numpy


    【解决方案1】:

    使用自定义函数的解决方案是可能的,但如果更大DataFrame 会很慢:

    def f(x):
        start1 = x['start1']
        start2 = x['start2']
        end1 = x['end1']
        end2 = x['end2']
        
        if start1 ==start2 and end1 == end2 : 
            return 1 
        elif (start1==start2 and end2 > start1) or (end1 == end2 and start1 < start2):
            return  (end2-start2) / (end1-start1)
        elif start2 < start1 and end2 > end1: 
            return (end2-start2) / (end1-start1) 
        elif start1 < start2 and end1 > end2: 
            return  (end2-start2) / (end1-start1) 
        elif start1 > start2 and end1 > end2 :
            return  (end2-start1) / (end1-start1) 
        elif start1 < start2 and end1 < end2 :
            return  (end1-start2) / (end1-start1)
    

    为了提高性能使用numpy.select:

    m1 = (df.start1 == df.start2) & (df.end1 == df.end2 )
    s1 = 1 
    m2 = ((df.start1==df.start2) & (df.end2 > df.start1)) | ((df.end1 == df.end2) & (df.start1 < df.start2))
    s2=  (df.end2-df.start2) / (df.end1-df.start1)
    m3 = (df.start2 < df.start1) & (df.end2 > df.end1)
    s3= (df.end2-df.start2) / (df.end1-df.start1) 
    m4 = (df.start1 < df.start2) & (df.end1 > df.end2)
    s4= (df.end2-df.start2) / (df.end1-df.start1) 
    m5 = (df.start1 > df.start2) & (df.end1 > df.end2)
    s5=  (df.end2-df.start1) / (df.end1-df.start1 )
    m6 = (df.start1 < df.start2) & (df.end1 < df.end2)
    s6=  (df.end1-df.start2) / (df.end1-df.start1)
    
    masks = [m1,m2,m3,m4,m5,m6]
    vals = [s1,s2,s3,s4,s5,s6]
    

    df['VAL'] = np.select(masks, vals, default=np.nan)
    df['val1'] = df.apply(f, axis=1)
    
    print (df)
      COL1  start1  end1  start2  end2  VAL  val1
    0    A    5000  6000    5000  6500  1.5   1.5
    1    B    5000  6000    4550  6000  NaN   NaN
    2    C    5000  6000    2000  5300  0.3   0.3
    3    D    5000  6000    5900  8000  0.1   0.1
    4    E    5000  6000    5600  5800  0.2   0.2
    5    F    5000  6000    5000  6000  1.0   1.0
    6    G    5000  6000    4000  7000  3.0   3.0
    

    性能

    #70k rows
    df = pd.concat([df] * 10000, ignore_index=True)
    
    
    In [111]: %timeit df['VAL'] = np.select(masks, vals, default=np.nan)
    1.79 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [112]: %timeit df['val1'] = df.apply(f, axis=1)
    1.41 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    【讨论】:

      【解决方案2】:

      另一个向量化选项是pyjanitor中的case_when,类似于SQL的case_when或python的ifelse,或np.select:

      # pip install pyjanitor
      import pandas as pd
      import janitor
      
      # reusing @jezrael's already coded options :)
      df.case_when(
          m1,s1, # condition, value if True
          m2,s2,
          m3,s3,
          m4,s4,
          m5,s5,
          m6,s6,
          np.nan, # default if False
          column_name='col')
      
      
      
        COL1  start1  end1  start2  end2  NEWCOL
      0    A    5000  6000    5000  6500     1.5
      1    B    5000  6000    4550  6000     NaN
      2    C    5000  6000    2000  5300     0.3
      3    D    5000  6000    5900  8000     0.1
      4    E    5000  6000    5600  5800     0.2
      5    F    5000  6000    5000  6000     1.0
      6    G    5000  6000    4000  7000     3.0
      

      【讨论】:

        猜你喜欢
        • 2021-02-23
        • 2022-01-06
        • 1970-01-01
        • 1970-01-01
        • 2017-09-09
        • 1970-01-01
        • 2022-11-14
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多