使用contains 和|(正则表达式OR)作为布尔掩码,然后按boolean indexing 过滤:
df = df[df['Process Parameter'].str.contains('|'.join(keys))]
print (df)
Process Parameter Value
1 System Clk 2.0
2 Core Clk 3.0
3 Bilinear Coeff 5.1
4 Prec Coeff 6.2
详情:
print (df['Process Parameter'].str.contains('|'.join(keys)))
0 False
1 True
2 True
3 True
4 True
5 False
Name: Process Parameter, dtype: bool
extract 的另一种解决方案,对于不匹配的值返回 NaNs,所以 notnull 是必要的:
df = df[df['Process Parameter'].str.extract('('+'|'.join(keys)+')',expand=False).notnull()]
print (df)
Process Parameter Value
1 System Clk a 2.0
2 Core Clk a 3.0
3 Bilinear Coeff 5.1
4 Prec Coeff 6.2
时间安排:
a = 'Temperature System Clk Core Clk Bilinear Coeff Prec Coeff Yield'.split()
N = 200000
df = pd.DataFrame({'Process Parameter': [np.random.choice(a, size=np.random.randint(1,10)) for x in range(N)]})
df['Process Parameter'] = df['Process Parameter'].str.join(' ')
keys =['Clk', 'Coeff']
In [115]: %timeit df[df['Process Parameter'].str.contains('|'.join(keys))]
10 loops, best of 3: 140 ms per loop
In [116]: %timeit df[df['Process Parameter'].str.extract('('+'|'.join(keys)+')',expand=False).notnull()]
1 loop, best of 3: 247 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ's solution 1
In [117]: %timeit df[df['Process Parameter'].str.findall('|'.join(keys)).astype(bool)]
10 loops, best of 3: 177 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ's solution 2
In [118]: %timeit df[df['Process Parameter'].str.split(expand=True).isin(keys).any(1)]
1 loop, best of 3: 527 ms per loop
#piRSquared solution 1
In [136]: %timeit df[(find(df['Process Parameter'].values.astype(str)[:, None], keys) >= 0).any(1)]
1 loop, best of 3: 487 ms per loop
#piRSquared solution 2
In [137]: %timeit df[df['Process Parameter'].str.split().apply(set) & set(keys)]
1 loop, best of 3: 401 ms per loop
编辑您需要word boundary 进行匹配:
df = pd.DataFrame({'Process Parameter' : ['Clockspeed', 'System Clk', 'Core Clk',
'Bilinear Coeff', 'Prec Coeff', 'Yield'],
'Value' : [1.2,2.0,3.0, 5.1, 6.2, 7.4]})
keys =['Clk', 'Coeff']
print (df)
Process Parameter Value
0 Clockspeed 1.2
1 System Clk 2.0
2 Core Clk 3.0
3 Bilinear Coeff 5.1
4 Prec Coeff 6.2
5 Yield 7.4
pat = '|'.join(r"\b{}\b".format(x) for x in keys)
df = df[df['Process Parameter'].str.contains(pat)]
print (df)
Process Parameter Value
1 System Clk 2.0
2 Core Clk 3.0
3 Bilinear Coeff 5.1
4 Prec Coeff 6.2