提高预处理速度答案

【问题标题】：Improving the speed of preprocessing提高预处理速度
【发布时间】：2019-01-08 10:26:54
【问题描述】：

以下代码用于使用自定义词形还原函数对文本进行预处理：

%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.utils import simple_preprocess, lemmatize
from gensim.parsing.preprocessing import STOPWORDS
STOPWORDS = list(STOPWORDS)

def preprocessor(s):
    result = []
    for token in lemmatize(s, stopwords=STOPWORDS, min_length=2):
        result.append(token.decode('utf-8').split('/')[0])
    return result

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')

%%time
X_train, X_test, y_train, y_test = train_test_split([preprocessor(x) for x in data.text],
                                                    data.label, test_size=0.2, random_state=0)
#10.8 seconds

问题： 词形还原过程的速度可以提高吗？

在大约 80,000 个文档的大型语料库中，目前大约需要两个小时。 lemmatize() 函数似乎是主要瓶颈，因为像 simple_preprocess 这样的 gensim 函数非常快。

感谢您的帮助！

【问题讨论】：

标签： gensim lemmatization

【解决方案1】：

您可能需要重构代码，以便更轻松地分别对每个部分进行计时。 lemmatize() 可能是您的瓶颈的一部分，但其他重要贡献者可能也可能是：(1) 通过列表 .append() 编写大型文档，一次一个令牌； (2)utf-8解码。

另外，gensim lemmatize() 依赖于来自Pattern 库的parse() 函数；您可以尝试其他的词形还原实用程序，例如 NLTK 或 Spacy 中的那些。

最后，由于词形还原可能是一项固有的成本高昂的操作，并且可能会在您的管道中多次处理相同的源数据，您可能需要设计您的流程，以便将结果重新写入磁盘，然后在后续运行中重新使用——而不是总是“在线”完成。

【讨论】：