我测试了其中的一些。 @A-Za-z 的建议是一项重大改进,但可能会做得更快。
编辑:我重新运行了预先计算替换字典和数据框(以及预编译的正则表达式)的测试。新的时间安排是:
- 原文:11.71 秒
- @A-Za-z:4.72 秒,提高了 60%。
- @piRSquared:4.95 秒,提高了 58%。
- 预编译:2.81 秒,提高了 76%。
时序中包含数据生成和正则表达式编译的原始结果:
“测试您的代码我得到了 15 秒,@A-Za-z 的代码给了 8-9 秒,而我自己的解决方案将其降低到 6 秒。它使用预编译的正则表达式。请参阅此答案的结尾。”
进口:
import pandas as pd
import re
import timeit
您的原始代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def org(printout=False):
def parse_text(data):
for key, replacement in miscdict.items():
data['q1'] = data['q1'].str.replace( key, replacement )
return data
data2 = parse_text(data)
if printout:
print(data2)
org(printout=True)
print(timeit.timeit(org, number=10000))
这用了 11.7 秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
11.71043858179268
用户@A-Za-z的代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt1(printout=False):
data['q1'].replace(miscdict, regex = True, inplace = True)
if printout:
print(data)
alt1(printout=True)
print(timeit.timeit(alt1, number=10000))
这用了 4.7 秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
4.721581550644499
用户@piRSquared 的代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt2(printout=False):
# regex = True is added later because it doesn't work without it.
data = data.replace(miscdict, regex = True)
if printout:
print(data)
alt2(printout=True)
print(timeit.timeit(alt2, number=10000))
这用了 5.0 秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
4.951810616074919
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt3(printout=False):
def parse_text(text):
for pattern, replacement in miscdict_comp.items():
text = pattern.sub(replacement, text)
return text
data["q1"] = data["q1"].apply(parse_text)
if printout:
print(data)
alt3(printout=True)
print(timeit.timeit(alt3, number=10000))
这用了 2.8 秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
2.810334940701157
这个想法是预编译你想要改变的模式。
我从这里得到了这个想法:https://jerel.co/blog/2011/12/using-python-for-super-fast-regex-search-and-replace