Spacy token.lemma_ 不识别名词和代词答案

【问题标题】：Spacy token.lemma_ not identifying nouns and pronounsSpacy token.lemma_ 不识别名词和代词
【发布时间】：2021-02-16 07:01:47
【问题描述】：

我一直在关注 Lemmatization 的教程 -> https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

如 spacy lemmatization 部分所述，我加载了 'en-core-web-sm' 模型，从给定句子中解析并提取每个单词的词元。

我的代码如下

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

doc = nlp(sentence)

lemmatized_spacy_output = " ".join([token.lemma_ for token in doc])
print(lemmatized_spacy_output)

输入

"The striped bats are hanging on their feet for best"

它给出的输出为

the stripe bat be hang on their foot for good

而预期的输出是

the strip bat be hang on -PRON- foot for good'

可以看出，stripes 词应该被识别为动词，但由于某种原因它被归类为名词（因为输出是条带，而不是条带）。此外，它不是识别人称代词，而是按原样给出标记。

我已经尝试了很多 github 和 stackoverflow 问题，但没有一个针对我的查询。

【问题讨论】：

本教程看起来像是针对 spaCy v2.x 而不是 v3.x，其中一些行为已经改变。
@aab 我的 spacy 版本显示 3.0.3。您能否详细说明 spacy 3.x 中的哪些行为发生了变化？

标签： spacy pos-tagger lemmatization

【解决方案1】：

就像 aab 在他的评论中所说的那样。您使用的是哪个版本？我使用 spacy 的第 3 版并调用

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
doc = nlp(sentence)

for token in doc:
    print(token.text, " -- ", token.pos_, " -- ",token.lemma_)

The  --  DET  --  the
striped  --  VERB  --  stripe
bats  --  NOUN  --  bat
are  --  VERB  --  be
hanging  --  VERB  --  hang
on  --  ADP  --  on
their  --  PRON  --  their
feet  --  NOUN  --  foot
for  --  ADP  --  for
best  --  ADJ  --  good

这意味着striped 被识别为动词

【讨论】：

我的 spacy 版本是 3.0.3。只要我记得，striped的动词版本应该是strip，名词版本应该是stripe（请参阅我在问题中链接的文章）。此外，即使 token.pos 显示 PRON，但它们仍然显示为它们，而不是显示为 -PRON-。怎么做才能得到 PRON 的输出？
如果你想获得与文章中完全相同的输出，你需要将你的 spacy 降级到 v2。我不确定了，你关心什么。你认为你做错了什么，因为你得到不同的结果？这很好，别担心，继续从你的文章中学习
我一定会尝试降级到 v2。我关心的只是获得文章中显示的结果。我认为其他一些功能可能会为我提供文章中的确切输出。无论如何，感谢您的帮助！