首先让我们看一下NLTK 具有的不同词干分析器/词形还原器:
>>> from nltk import stem
>>> lancaster = stem.lancaster.LancasterStemmer()
>>> porter = stem.porter.PorterStemmer()
>>> snowball = stem.snowball.EnglishStemmer()
>>> wnl = stem.wordnet.WordNetLemmatizer()
>>> word = "cowardly"
>>> lancaster.stem(word)
'coward'
>>> porter.stem(word)
u'cowardli'
>>> snowball.stem(word)
u'coward'
>>> wnl.stem(word)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'WordNetLemmatizer' object has no attribute 'stem'
>>> wnl.lemmatize(word)
'cowardly'
注意:WordNetLemmatizer 不是词干分析器,因此它输出 cowardly 的词形还原,在这种情况下它是同一个词。
似乎 Porter stemmer 是唯一改变 cowardly -> cowardli 的,让我们看一下代码以了解它发生的原因,请参阅 http://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer。
这似乎是ly -> li的部分:
def _step1c(self, word):
"""step1c() turns terminal y to i when there is another vowel in the stem.
--NEW--: This has been modified from the original Porter algorithm so that y->i
is only done when y is preceded by a consonant, but not if the stem
is only a single consonant, i.e.
(*c and not c) Y -> I
So 'happy' -> 'happi', but
'enjoy' -> 'enjoy' etc
This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->
'enjoy'. Step 1c is perhaps done too soon; but with this modification that
no longer really matters.
Also, the removal of the vowelinstem(z) condition means that 'spy', 'fly',
'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried',
'flies' ...
"""
if word[-1] == 'y' and len(word) > 2 and self._cons(word, len(word) - 2):
return word[:-1] + 'i'
else:
return word