由于有许多等效的方法可以完成这项任务,而且它看起来像是一个有用的测试用例,可以更好地了解 Pandas 的总体性能,我对所有提交的答案进行了基准测试(我是单元格中代码的作者#3、#4、#7 和 #10)。
我将输入数据大小增加了一千倍,以便进行更真实和公平的比较 - 否则纯 Python 解决方案优于基于 Pandas 的方法,因为对于大型数据集变得无关紧要的恒定开销。解决方案按从好到坏排序。
In [1]:
import pandas as pd
df = pd.DataFrame({'Col1': [20, 30, 40, 50, 60, 70],
'Col2': ['May', 'March', 'June', 'July', 'May', 'March'],
'SearchCol3': ['abc(feb)', 'def | mar', 'ghi | feb', 'jkl(apr)', 'mno(mar)', 'abc']})
strings = ['jan', 'feb', 'mar', 'apr', 'may']
replacement = ['January', 'February', 'March', 'April', 'May']
mapping = dict(zip(strings, replacement))
a_regex = '(jan|feb|mar|apr|may)'
month_replacements = {'jan': 'January','feb': 'February',
'mar': 'March','apr': 'April','may': 'May'}
In [2]:
# Use a more realistic input size
df['NewCol4'] = ''
df = pd.concat([df]*1000).reset_index().consolidate()
In [3]:
%%timeit -n 100
result = []
for searchcol, default in zip(df["SearchCol3"], df["Col2"]):
for s in mapping:
if s in searchcol:
result.append(mapping[s])
break
else:
result.append(default)
df['NewCol4'] = result
100 loops, best of 3: 2.69 ms per loop
In [4]:
%%timeit -n 100
result = []
for index, searchcol, default in df[["SearchCol3", "Col2"]].itertuples():
for s in mapping:
if s in searchcol:
result.append(mapping[s])
break
else:
result.append(default)
df['NewCol4'] = result
100 loops, best of 3: 8.64 ms per loop
In [5]:
%%timeit -n 100
df['NewCol4'] = df.Col2
for i, s in enumerate(strings):
df.loc[df.SearchCol3.str.contains(s), 'NewCol4'] = replacement[i]
100 loops, best of 3: 23.1 ms per loop
In [6]:
100
%%timeit -n 100
df['NewCol4'] = None
# Use month name if abbreviation in `SearchCol3`.
for month_code, month in zip(strings, replacement):
df.loc[df.SearchCol3.str.contains(month_code), 'NewCol4'] = month
# Create a mask of null values and apply Col2 if null.
mask = df.NewCol4.isnull()
df.loc[mask, 'NewCol4'] = df.loc[mask, 'Col2']
100 loops, best of 3: 24.4 ms per loop
In [7]:
%%timeit -n 100
def match_string(searchcol, default):
for s in mapping:
if s in searchcol:
return mapping[s]
return default
df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
100 loops, best of 3: 135 ms per loop
In [8]:
%%timeit -n 100
def match_string(col3, col2):
k = ([s for s in strings if s in col3])
if k: # if found in col3, return that result
return replacement[strings.index(k[0])]
l = ([s for s in replacement if s in col2])
if l: # else if found in col2, return second best option
return l[0]
return '' # if neither, return empty string
df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
100 loops, best of 3: 144 ms per loop
In [9]:
%%timeit -n 100
#Extract Using Regex
df['NewCol4'] = df['SearchCol3'].str.extract(a_regex).fillna('')
#Look up values from dictionary
df['NewCol4'] = df['NewCol4'].apply(lambda x: month_replacements.get(x,''))
#Use default value from other coumn if no other value
df['NewCol4'] = df.apply(lambda row: row['Col2'] if row['NewCol4'] == '' else row['NewCol4'], axis=1)
100 loops, best of 3: 147 ms per loop
In [10]:
%%timeit -n 10
df['NewCol4'] = ''
for index, row in df.iterrows():
searchcol = row["SearchCol3"]
for s in mapping:
if s in searchcol:
df.loc[index, "NewCol4"] = mapping[s]
break
else:
df.loc[index, "NewCol4"] = row["Col2"]
10 loops, best of 3: 2.82 s per loop