将平均感知器标记器 POS 转换为 WordNet POS 并避免元组错误答案

【问题标题】：Convert Averaged Perceptron Tagger POS to WordNet POS and Avoid Tuple Error将平均感知器标记器 POS 转换为 WordNet POS 并避免元组错误
【发布时间】：2023-03-19 11:11:01
【问题描述】：

我有使用 NLTK 的平均感知器标记器进行 POS 标记的代码：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果：

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试了循环遍历每个标记的标记并使用 WordNet lemmatizer 对其进行词形还原的代码：

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

产生的错误：

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我觉得这里有两个问题：

POS 标签未转换为 WordNet 可以理解的标签（我尝试实现类似于此答案 wordnet lemmatization and pos tagging in python 的内容，但没有成功）
数据结构的格式不正确，无法循环遍历每个元组（除了os 相关代码之外，我找不到更多关于此错误的信息）

如何通过词形还原来跟进 POS 标记以避免这些错误？

【问题讨论】：

标签： python python-3.x nlp nltk pos-tagger

【解决方案1】：

Python解释器明确告诉你：

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS 是一个元组数组，因此您不能将其元素直接传递给lemmatize() 方法（查看WordNetLemmatizerhere 类的代码）。只有字符串类型的对象有方法endswith()，所以你需要从tokenPOS传递每个元组的第一个元素，就像这样：

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))

方法lemmatize() 使用wordnet.NOUN 作为默认POS。不幸的是，Wordnet 使用的标签与其他 nltk 语料库不同，因此您必须手动翻译它们（如您提供的链接中所示）并使用正确的标签作为 lemmatize() 的第二个参数。完整的脚本，带有来自this answer的方法get_wordnet_pos()：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)

【讨论】：

这可以消除元组错误 - 谢谢。但是词形还原器仍然默认为名词，对吗？如何从平均感知器标记器标签转换为 WordNet 标记？
@CameronTaylor 我已经升级了我的答案 - 希望它有所帮助。
对，这样就可以与树库模型一起使用。我正在寻找平均感知器模型和 WordNet 之间的转换。我只是不确定在这方面从哪里开始。例如，treebank_tag 的平均感知器等效值是多少？
@CameronTaylor 平均感知器使用树库模型标签：ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html