PorterStemmer() 以不同方式修剪句子中的最后一个单词答案

【问题标题】：PorterStemmer() trims the last word in a sentence differentlyPorterStemmer() 以不同方式修剪句子中的最后一个单词
【发布时间】：2020-12-07 19:49:22
【问题描述】：

我有以下离线环境的代码：

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams':  ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}

def rower(x):
    cleanQ = []  
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
    
    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    splitQ = list(map(' '.join, splitQ))
    print(splitQ)
    
    originQ = []    
    for i in splitQ: 
        originQ.append(PorterStemmer().stem(i))
    print(originQ)
    
rower(test.grams)

产生这个：

['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']

第一个列表显示了应用PorterStemmer() 函数之前的句子。第二个列表显示了应用PorterStemmer() 函数后的句子。

如您所见，PorterStemmer() 仅当单词位于句子中的最后一个单词时，才会将单词 three 修剪为 thre。当单词 three 不是最后一个单词时，three 将保留 three。我似乎无法弄清楚它为什么这样做。我还担心如果我将rower(x) 函数应用于其他句子，它可能会在我不注意的情况下产生类似的结果。

如何防止PorterStemmer 区别对待最后一个词？

【问题讨论】：

标签： python nltk porter-stemmer

【解决方案1】：

这里的主要错误是您一次将多个单词而不是一个单词传递给词干分析器。整个字符串 (third donkey three) 被认为是一个单词，最后一部分正在被词干。

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
                  'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}

ps = PorterStemmer()

def rower(x):
    cleanQ = []
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())

    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    print('IN:', splitQ)
    originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
    print('OUT:', originQ)


rower(test.grams)

输出：

IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]

对于为什么词干会遗漏某些单词的最后一个“e”，有很好的解释。如果输出不符合您的期望，请考虑使用 lemmatizer。

How to stop NLTK stemmer from removing the trailing “e”?

【讨论】：

换行为originQ = [' '.join([ps.stem(word) for word in sent]) for sent in splitQ]