NLTK Python中的词义消歧答案

【问题标题】：Word sense disambiguation in NLTK PythonNLTK Python中的词义消歧
【发布时间】：2011-04-11 14:50:39
【问题描述】：

我是 NLTK Python 的新手，我正在寻找一些可以进行词义消歧的示例应用程序。我在搜索结果中有很多算法，但没有示例应用程序。我只是想传一句话，想通过参考wordnet库来了解每个单词的意思。谢谢

我在 PERL 中找到了一个类似的模块。 http://marimba.d.umn.edu/allwords/allwords.html NLTK Python 中是否存在这样的模块？

【问题讨论】：

这是一个 python 实现：github.com/alvations/pywsd

标签： python nltk

【解决方案1】：

是的，可以使用 NLTK 中的 wordnet 模块。您帖子中提到的工具中使用的相似度测量也存在于NLTK wordnet模块中。

【讨论】：

【解决方案2】：

参考http://jaganadhg.freeflux.net/blog/archive/2010/10/16/wordnet-sense-similarity-with-nltk-some-basics.html

【讨论】：

【解决方案3】：

NLTK 有访问 Wordnet 的 api。 Wordnet 将单词作为同义词集。这会给你一些关于这个词、它的上位词、下位词、词根等的信息。

“使用 NLTK 2.0 Cookbook 进行 Python 文本处理”是一本让您开始了解 NLTK 的各种功能的好书。它易于阅读、理解和实施。

此外，您还可以查看其他有关使用维基百科进行词义消歧的论文（NLTK 领域之外）。

【讨论】：

【解决方案4】：

是的，事实上，NLTK 团队编写了a book，其中有多个关于分类的章节，并且明确涵盖了how to use WordNet。您也可以从 Safari 购买实体版图书。

仅供参考：NLTK 由自然语言编程学者编写，用于他们的编程入门课程。

【讨论】：

据我了解，那一章专门介绍分类，但对词义消歧不是很深入。

【解决方案5】：

作为对 OP 请求的实际回答，这里是几个 WSD 方法的 python 实现，它以 NLTK 的 synset(s) 的形式返回意义，https://github.com/alvations/pywsd

包括

Lesk算法（包括原始Lesk、改编的Lesk和简单的Lesk）
基线算法（随机感觉、第一感觉、最常见感觉）

可以这样使用：

#!/usr/bin/env python -*- coding: utf-8 -*-

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']

print "======== TESTING simple_lesk ===========\n"
from lesk import simple_lesk
print "#TESTING simple_lesk() ..."
print "Context:", bank_sents[0]
answer = simple_lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING simple_lesk() with POS ..."
print "Context:", bank_sents[1]
answer = simple_lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING simple_lesk() with POS and stems ..."
print "Context:", plant_sents[0]
answer = simple_lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print

print "======== TESTING baseline ===========\n"
from baseline import random_sense, first_sense
from baseline import max_lemma_count as most_frequent_sense

print "#TESTING random_sense() ..."
print "Context:", bank_sents[0]
answer = random_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING first_sense() ..."
print "Context:", bank_sents[0]
answer = first_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING most_frequent_sense() ..."
print "Context:", bank_sents[0]
answer = most_frequent_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

[出]：

======== TESTING simple_lesk ===========

#TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities

#TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING simple_lesk() with POS and stems ...
Context: The workers at the industrial plant were overworked
Sense: Synset('plant.n.01')
Definition: buildings for carrying on industrial labor

======== TESTING baseline ===========
#TESTING random_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition: put into a bank account

#TESTING first_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING most_frequent_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

【讨论】：

【解决方案6】：

最近，wsd.py 模块中的部分pywsd 代码已移植到NLTK' 的前沿版本中，试试：

>>> from nltk.wsd import lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> lesk(sent, ambiguous)
Synset('bank.v.04')
>>> lesk(sent, ambiguous).definition()
u'act as the banker in a game or in gambling'

为了获得更好的 WSD 性能，请使用 pywsd 库而不是 NLTK 模块。一般来说，来自pywsd 的simple_lesk() 比来自NLTK 的lesk 好。有空我会尽量更新NLTK模块。

在回应 Chris Spencer 的评论时，请注意 Lesk 算法的局限性。我只是给出算法的准确实现。这不是灵丹妙药，http://en.wikipedia.org/wiki/Lesk_algorithm

另外请注意，虽然：

lesk("My cat likes to eat mice.", "cat", "n")

不给你正确答案，可以用pywsd实现max_similarity()：

>>> from pywsd.similarity import max_similiarity
>>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'

@Chris，如果你想要一个 python setup.py ，请礼貌地请求，我会写它...

【讨论】：

不幸的是，准确性非常糟糕。 lesk("My cat likes to eat mice.", "cat", "n") => Synset('computerized_tomography.n.01')。而且 pywsd 甚至没有安装脚本...
亲爱的 Chris，您尝试过 lesk 的其他变体吗？特别是。 simple_lesk() 还是 adapted_lesk？已知原始 lesk 存在问题，因此包中提供了其他解决方案。 en.wikipedia.org/wiki/Lesk_algorithm。另外，我在空闲时间维护，这不是我谋生的...
是的，我尝试了您包中的所有 Lesk 变体，但没有一个适用于我的示例语料库。我必须创建一个变体，该变体还使用与该词相关的所有下位词和分词的注解，只是为了获得一些积极的结果，但即便如此，它也只有 15% 的准确率。这不是你的代码，而是 Lesk 的问题。这根本不是一个可靠的启发式方法。
尝试最大化相似度。它可能会做得更好。此外，我正在编写更多算法，但这留给 9 月份的代码冲刺。另外，看看更先进的方法。最后，最常见的感觉通常做得很好，当使用 MFS 后退时，最先进的技术可以击败它 1-2%，最多 5%...
伙计们，采用手动标记的语料库（人类消除歧义的同义词）并在其上训练某种 ML 分类器是否有意义？那么经过训练的分类器可以作为另一种消歧算法包含在你的包中，如果我们看到它在训练期间看不见的文本上的准确性很高）