如何实现更快的条件检查功能？ Python答案

【问题标题】：How to implement a faster condition checking function? python如何实现更快的条件检查功能？ Python
【发布时间】：2014-02-26 08:38:10
【问题描述】：

我有一个小脚本想要根据几个条件提取一些独特的单词，并且检查这些条件需要很长时间。

可能是因为它检查了一个大字典，并且它还对每个标记应用了一个词干分析器。

条件是：

令牌不在所选字典中
token 长度超过 1
标记不在一组固定的标点符号中
token 不是纯数字
令牌不以“'s”结尾

是否有更快的多条件检查实现？任何基于 python 的解决方案都是可以接受的，即使使用子进程或 cython 或调用 c/c++ 实现。

请记住，实际上，有更多条件，字典最多有 100,000 个条目。我已经做了类似以下的事情，即使使用yield，链接多个条件也很慢。

import string
from nltk.stem import PorterStemmer

porter = PorterStemmer()

dictionary = ['apple', 'pear', 'orange', 'water', 'eat', 'the', 'with', 'an', 'pie', 'full', 'of', 'water', 'at', 'lake', 'on', 'wednesday', 'plus', 'and', 'many', 'more', 'word']

text = "PEAR eats the Orange, with an Apple's MX2000 full of water - h20 - at Lake 0129 on wednesday."

def extract(txt, dic):
    for i in txt.split():
        _i = i.strip().strip(string.punctuation).lower()
        if _i not in dic and len(_i) > 1 and not _i.isdigit() \
        and porter.stem(_i) not in dictionary and not i.endswith("'s"): 
            yield _i

for i in extract(text, dictionary):
    print i

[出]

MX2000
h20

【问题讨论】：

你不会加紧条件的多样性；将它们与and 结合起来是最有用的方法（如果它们都需要为真）。但是你可能会在算法的其他点调整你的结果。您可以使用re.finditer() 而不是split() 来遍历txt 中的标记；这样你就不会建立一个你不需要的所有令牌的列表。
您可以先将dictionary 设为真正的字典，而不是列表。
甚至是一套。 dictionary = { 'apple', 'pear', 'orange', ... }

标签： python c++ list dictionary cython

【解决方案1】：

我脑子里想了两件事：

将字典更改为set（如@Alfe 建议的那样）。考虑到您的数据长度较长，这肯定有助于提高速度。
由于一旦某些规则为假，比较就会结束，您可以重新安排测试，以便首先运行最快和/或最具辨别力的规则。不过，在这种情况下，我并不清楚最好的顺序。尝试一下。

【讨论】：

拼写为“defin__i__tely”
@alvas "definately" 是“肯定”的常见拼写错误，也是我（和其他人）最喜欢的小毛病之一。
总是把它们混在一起。已编辑。