为什么干后懦弱变成懦弱？答案

【问题标题】：Why cowardly becomes cowardli after stemming?为什么干后懦弱变成懦弱？
【发布时间】：2014-06-12 16:22:10
【问题描述】：

我注意到在应用 Porter 词干（来自 NLTK 库）后，我得到了奇怪的词干，例如 "cowardli" 或 "contrari"。对我来说，它们看起来根本不像茎。

没事吧？会不会是我哪里搞错了？

这是我的代码：

string = string.lower()
tokenized = nltk.tokenize.regexp_tokenize(string,"[a-z]+")
filtered = [w for w in tokenized if w not in nltk.corpus.stopwords.words("english")]


stemmer = nltk.stem.porter.PorterStemmer()
stemmed = []
for w in filtered:
    stemmed.append(stemmer.stem(w))

这是我用于处理http://pastebin.com/XUMNCYAU 的文本（陀思妥耶夫斯基《罪与罚》一书的开头）。

【问题讨论】：

如果结果对您来说不正确，那么您可能正在寻找引理，而不是词干。见stackoverflow.com/questions/771918/…
不，我专门寻找茎。但我现在看到 Porter Stemmer 翻译了一些示例词，其方式类似于它对我的翻译方式。可能我对词干的概念理解得不够好，我会更深入地挖掘
见stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers

标签： nlp nltk stemming

【解决方案1】：

首先让我们看一下NLTK 具有的不同词干分析器/词形还原器：

>>> from nltk import stem
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> porter = stem.porter.PorterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> wnl = stem.wordnet.WordNetLemmatizer()
>>> word = "cowardly"
>>> lancaster.stem(word)
'coward'
>>> porter.stem(word)
u'cowardli'
>>> snowball.stem(word)
u'coward'
>>> wnl.stem(word)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'WordNetLemmatizer' object has no attribute 'stem'
>>> wnl.lemmatize(word)
'cowardly'

注意：WordNetLemmatizer 不是词干分析器，因此它输出 cowardly 的词形还原，在这种情况下它是同一个词。

似乎 Porter stemmer 是唯一改变 cowardly -> cowardli 的，让我们看一下代码以了解它发生的原因，请参阅 http://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer。

这似乎是ly -> li的部分：

def _step1c(self, word):
    """step1c() turns terminal y to i when there is another vowel in the stem.
    --NEW--: This has been modified from the original Porter algorithm so that y->i
    is only done when y is preceded by a consonant, but not if the stem
    is only a single consonant, i.e.

       (*c and not c) Y -> I

    So 'happy' -> 'happi', but
      'enjoy' -> 'enjoy'  etc

    This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
    'enjoy'. Step 1c is perhaps done too soon; but with this modification that
    no longer really matters.

    Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
    'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
    'flies' ...
    """
    if word[-1] == 'y' and len(word) > 2 and self._cons(word, len(word) - 2):
        return word[:-1] + 'i'
    else:
        return word

【讨论】：

那么如果我们尝试使用 Senti Wordnet 或 Wordnet 获取意义，它会返回适当的意义吗？