在熊猫中添加带有条件的列计算答案

【问题标题】：add column calculation with condition in pandas在熊猫中添加带有条件的列计算
【发布时间】：2022-01-16 19:51:33
【问题描述】：

我有一个数据框，例如：

COL1 start1 end1    start2  end2
A    5000   6000    5000    6500
B    5000   6000    4550    6000
C    5000   6000    2000    5300
D    5000   6000    5900    8000
E    5000   6000    5600    5800
F    5000   6000    5000    6000
G    5000   6000    4000    7000

我想在我做的地方添加一个新列NEWCOL：

If  start1 ==start2 & end1 == end2 : 
  NEWCOL = 1 
elif start1==star2 & end2 > start1 | end1 == end2 & start1 < start2:
  NEWCOL= (end2-start2) / (end1-start1)
elif start2 < start1 & end2 > end1: 
  NEWCOL = (end2-start2) / (end1-start1) 
elif start1 < start2 & end1 > end2: 
  NEWCOL = (end2-start2) / (end1-start1) 
elif start1 > start2 & end1 > end2 :
  NEWCOL = (end2-start1) / end1-start1 
elif start1 < start2 & end1 < end2 :
  NEWCOL = (end1-start2) / (end1-start1)

然后我应该得到：

COL1    start1  end1    start2  end2    NEWCOL
A   5000    6000    5000    6500    1.5
B   5000    6000    4550    6000    1.45
C   5000    6000    2000    5300    0.3
D   5000    6000    5900    8000    0.1
E   5000    6000    5600    5800    0.2
F   5000    6000    5000    6000    1
G   5000    6000    4000    7000    3

【问题讨论】：

标签： python python-3.x pandas numpy

【解决方案1】：

使用自定义函数的解决方案是可能的，但如果更大DataFrame 会很慢：

def f(x):
    start1 = x['start1']
    start2 = x['start2']
    end1 = x['end1']
    end2 = x['end2']
    
    if start1 ==start2 and end1 == end2 : 
        return 1 
    elif (start1==start2 and end2 > start1) or (end1 == end2 and start1 < start2):
        return  (end2-start2) / (end1-start1)
    elif start2 < start1 and end2 > end1: 
        return (end2-start2) / (end1-start1) 
    elif start1 < start2 and end1 > end2: 
        return  (end2-start2) / (end1-start1) 
    elif start1 > start2 and end1 > end2 :
        return  (end2-start1) / (end1-start1) 
    elif start1 < start2 and end1 < end2 :
        return  (end1-start2) / (end1-start1)

为了提高性能使用numpy.select:

m1 = (df.start1 == df.start2) & (df.end1 == df.end2 )
s1 = 1 
m2 = ((df.start1==df.start2) & (df.end2 > df.start1)) | ((df.end1 == df.end2) & (df.start1 < df.start2))
s2=  (df.end2-df.start2) / (df.end1-df.start1)
m3 = (df.start2 < df.start1) & (df.end2 > df.end1)
s3= (df.end2-df.start2) / (df.end1-df.start1) 
m4 = (df.start1 < df.start2) & (df.end1 > df.end2)
s4= (df.end2-df.start2) / (df.end1-df.start1) 
m5 = (df.start1 > df.start2) & (df.end1 > df.end2)
s5=  (df.end2-df.start1) / (df.end1-df.start1 )
m6 = (df.start1 < df.start2) & (df.end1 < df.end2)
s6=  (df.end1-df.start2) / (df.end1-df.start1)

masks = [m1,m2,m3,m4,m5,m6]
vals = [s1,s2,s3,s4,s5,s6]

df['VAL'] = np.select(masks, vals, default=np.nan)
df['val1'] = df.apply(f, axis=1)

print (df)
  COL1  start1  end1  start2  end2  VAL  val1
0    A    5000  6000    5000  6500  1.5   1.5
1    B    5000  6000    4550  6000  NaN   NaN
2    C    5000  6000    2000  5300  0.3   0.3
3    D    5000  6000    5900  8000  0.1   0.1
4    E    5000  6000    5600  5800  0.2   0.2
5    F    5000  6000    5000  6000  1.0   1.0
6    G    5000  6000    4000  7000  3.0   3.0

性能：

#70k rows
df = pd.concat([df] * 10000, ignore_index=True)


In [111]: %timeit df['VAL'] = np.select(masks, vals, default=np.nan)
1.79 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [112]: %timeit df['val1'] = df.apply(f, axis=1)
1.41 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：

【解决方案2】：

另一个向量化选项是pyjanitor中的case_when，类似于SQL的case_when或python的ifelse，或np.select：

# pip install pyjanitor
import pandas as pd
import janitor

# reusing @jezrael's already coded options :)
df.case_when(
    m1,s1, # condition, value if True
    m2,s2,
    m3,s3,
    m4,s4,
    m5,s5,
    m6,s6,
    np.nan, # default if False
    column_name='col')



  COL1  start1  end1  start2  end2  NEWCOL
0    A    5000  6000    5000  6500     1.5
1    B    5000  6000    4550  6000     NaN
2    C    5000  6000    2000  5300     0.3
3    D    5000  6000    5900  8000     0.1
4    E    5000  6000    5600  5800     0.2
5    F    5000  6000    5000  6000     1.0
6    G    5000  6000    4000  7000     3.0

【讨论】：