提高数据帧上文本清理的性能答案

【问题标题】：Improving the performance of text cleanup on a dataframe提高数据帧上文本清理的性能
【发布时间】：2017-08-28 12:57:23
【问题描述】：

我有一个 df：

id    text
1     This is a good sentence
2     This is a sentence with a number: 2015
3     This is a third sentence

我有一个文本清理功能：

def clean(text):
    lettersOnly = re.sub('[^a-zA-Z]',' ', text)
    tokens = word_tokenize(lettersOnly.lower())
    stops = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stops]
    tokensPOS = pos_tag(tokens)
    tokensLemmatized = []
    for w in tokensPOS:
        tokensLemmatized.append(WordNetLemmatizer().lemmatize(w[0], get_wordnet_pos(w[1])))
    clean = " ".join(tokensLemmatized)
    return clean

get_wordnet_pos()是这个：

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

我正在将 extractFeatures() 应用于 pandas 列并创建一个包含结果的新列：

df['cleanText'] = df['text'].apply(clean)

结果df：

id    cleanText
1     good sentence
2     sentence number
3     third sentence

循环时间似乎呈指数增长。例如，使用%%timeit，将其应用于五行，每个循环运行 17 毫秒。 300 行以每个循环 800 毫秒运行。 500 行以每个循环 1.26 秒的速度运行。

我通过在函数外部实例化 stops 和 WordNetLemmatizer() 来更改它，因为它们只需要调用一次。

stops = set(stopwords.words('english'))
lem = WordNetLemmatizer()
def clean(text):
    lettersOnly = re.sub('[^a-zA-Z]',' ', text)
    tokens = word_tokenize(lettersOnly.lower())
    tokens = [w for w in tokens if not w in stops]
    tokensPOS = pos_tag(tokens)
    tokensLemmatized = []
    for w in tokensPOS:
        tokensLemmatized.append(lem.lemmatize(w[0], get_wordnet_pos(w[1])))
    clean = " ".join(tokensLemmatized)
    return clean

在apply 行上运行%prun -l 10 会生成此表：

         672542 function calls (672538 primitive calls) in 2.798 seconds

   Ordered by: internal time
   List reduced from 211 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4097    0.727    0.000    0.942    0.000 perceptron.py:48(predict)
     4500    0.584    0.000    0.584    0.000 {built-in method nt.stat}
     3500    0.243    0.000    0.243    0.000 {built-in method nt._isdir}
    14971    0.157    0.000    0.178    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
    57358    0.129    0.000    0.155    0.000 perceptron.py:250(add)
     4105    0.117    0.000    0.201    0.000 {built-in method builtins.max}
   184365    0.084    0.000    0.084    0.000 perceptron.py:58(<lambda>)
     4097    0.057    0.000    0.213    0.000 perceptron.py:245(_get_features)
      500    0.038    0.000    1.220    0.002 perceptron.py:143(tag)
     2000    0.034    0.000    0.068    0.000 ntpath.py:471(normpath)

可以预见，感知器标记器似乎占用了大量资源，但我不确定如何简化它。此外，我不确定在哪里调用 nt.stat 或 nt._isdir。

我应该如何更改函数或应用方法以提高性能？这个函数是 Cython 还是 Numba 的候选函数？

【问题讨论】：

不能说没有您的数据和预期的输出。
添加了样本输入数据和清洗功能的结果。我得到了正确的输出 - 问题更多是关于如何更快地获得正确的输出。
有趣。单词的顺序重要吗？我猜是的？
是的，因为cleanedText 稍后会被发送到矢量化器以收集 ngram、频率、tf-idf 权重等。
我看到最明显的改进点是将get_wordnet_pos 减少为str defaultdict。

标签： python performance pandas nltk apply

【解决方案1】：

我在这里看到的第一个明显改进点是整个get_wordnet_pos 函数应该可以简化为字典查找：

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

取而代之的是，从 collections 包中初始化一个 defaultdict：

import collections 
get_wordnet_pos = collections.defaultdict(lambda: wordnet.NOUN)
get_wordnet_pos.update({'J' : wordnet.ADJ,  
                        'V' : wordnet.VERB, 
                        'N' : wordnet.NOUN, 
                        'R' : wordnet.ADV })

然后您将像这样访问查找：

get_wordnet_pos[w[1][0]]

接下来，如果要在多个地方使用，您可以考虑预编译正则表达式模式。您获得的加速并不多，但这一切都很重要。

pattern = re.compile('[^a-zA-Z]')

在你的函数中，你会调用：

pattern.sub(' ', text)

OTOH，如果您知道您的文本来自哪里并且对您可能看到和可能看不到的内容有所了解，您可以预编译一个字符列表并改用 str.translate，这样会快得多比笨拙的基于正则表达式的替换：

tab = str.maketrans(dict.fromkeys("1234567890!@#$%^&*()_+-={}[]|\'\":;,<.>/?\\~`", '')) # pre-compiled use once substitution table (keep this outside the function)

text = 'hello., hi! lol, what\'s up'
new_text = text.translate(tab) # this would run inside your function

print(new_text)

'hello hi lol whats up'

此外，我想说word_tokenize 太过分了——无论如何，你所做的就是去掉特殊字符，所以你失去了word_tokenize 的所有好处，这对标点符号等有很大影响。你可以选择使用text.split()。

最后，跳过clean = " ".join(tokensLemmatized) 步骤。只需返回列表，然后在最后一步调用df.applymap(" ".join)。

我把基准测试留给你。

【讨论】：

非常感谢 - 非常有帮助。对于 defaultdict，它会抛出一个错误说TypeError: 'collections.defaultdict' object is not callable。除此之外，您所说的替换和拆分很有意义。
@CameronTaylor 有个小错误。您可以将字典称为get_wordnet_pos[...]，而不是(...)。将编辑我的答案。
另一个怪癖可能是在原始函数中，标签是由startswith 找到的。有没有办法在defaultdict 中实现它？因为目前我相信它将大多数事物视为名词，因为很多标签不仅仅是一个字母。
@CameronTaylor 哦，我的错。我以为这被理解了。实际上，您可以这样调用字典：get_wordnet_pos[w[1][0]] 其中w[1] 是单词，w[1][0] 是第一个字符。
不，我只是很密集。再次感谢你。在处理您的其余建议时，我会接受您的回答。已经看到速度提高了。