【发布时间】:2020-12-07 19:49:22
【问题描述】:
我有以下离线环境的代码:
import pandas as pd
import re
from nltk.stem import PorterStemmer
test = {'grams': ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}
def rower(x):
cleanQ = []
for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
splitQ = []
for row in cleanQ: splitQ.append(row.split())
splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
splitQ = list(map(' '.join, splitQ))
print(splitQ)
originQ = []
for i in splitQ:
originQ.append(PorterStemmer().stem(i))
print(originQ)
rower(test.grams)
产生这个:
['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']
第一个列表显示了应用PorterStemmer() 函数之前的句子。第二个列表显示了应用PorterStemmer() 函数后的句子。
如您所见,PorterStemmer() 仅当单词位于句子中的最后一个单词时,才会将单词 three 修剪为 thre。当单词 three 不是最后一个单词时,three 将保留 three。我似乎无法弄清楚它为什么这样做。我还担心如果我将rower(x) 函数应用于其他句子,它可能会在我不注意的情况下产生类似的结果。
如何防止PorterStemmer 区别对待最后一个词?
【问题讨论】:
标签: python nltk porter-stemmer