【发布时间】:2019-01-08 10:26:54
【问题描述】:
以下代码用于使用自定义词形还原函数对文本进行预处理:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.utils import simple_preprocess, lemmatize
from gensim.parsing.preprocessing import STOPWORDS
STOPWORDS = list(STOPWORDS)
def preprocessor(s):
result = []
for token in lemmatize(s, stopwords=STOPWORDS, min_length=2):
result.append(token.decode('utf-8').split('/')[0])
return result
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
%%time
X_train, X_test, y_train, y_test = train_test_split([preprocessor(x) for x in data.text],
data.label, test_size=0.2, random_state=0)
#10.8 seconds
问题: 词形还原过程的速度可以提高吗?
在大约 80,000 个文档的大型语料库中,目前大约需要两个小时。 lemmatize() 函数似乎是主要瓶颈,因为像 simple_preprocess 这样的 gensim 函数非常快。
感谢您的帮助!
【问题讨论】:
标签: gensim lemmatization