【问题标题】:Word sense disambiguation in NLTK PythonNLTK Python中的词义消歧
【发布时间】:2011-04-11 14:50:39
【问题描述】:

我是 NLTK Python 的新手,我正在寻找一些可以进行词义消歧的示例应用程序。我在搜索结果中有很多算法,但没有示例应用程序。我只是想传一句话,想通过参考wordnet库来了解每个单词的意思。 谢谢

我在 PERL 中找到了一个类似的模块。 http://marimba.d.umn.edu/allwords/allwords.html NLTK Python 中是否存在这样的模块?

【问题讨论】:

标签: python nltk


【解决方案1】:

是的,可以使用 NLTK 中的 wordnet 模块。 您帖子中提到的工具中使用的相似度测量也存在于NLTK wordnet模块中。

【讨论】:

    【解决方案2】:
    【解决方案3】:

    NLTK 有访问 Wordnet 的 api。 Wordnet 将单词作为同义词集。这会给你一些关于这个词、它的上位词、下位词、词根等的信息。

    “使用 NLTK 2.0 Cookbook 进行 Python 文本处理”是一本让您开始了解 NLTK 的各种功能的好书。它易于阅读、理解和实施。

    此外,您还可以查看其他有关使用维基百科进行词义消歧的论文(NLTK 领域之外)。

    【讨论】:

      【解决方案4】:

      是的,事实上,NLTK 团队编写了a book,其中有多个关于分类的章节,并且明确涵盖了how to use WordNet。您也可以从 Safari 购买实体版图书。

      仅供参考:NLTK 由自然语言编程学者编写,用于他们的编程入门课程。

      【讨论】:

      • 据我了解,那一章专门介绍分类,但对词义消歧不是很深入。
      【解决方案5】:

      作为对 OP 请求的实际回答,这里是几个 WSD 方法的 python 实现,它以 NLTK 的 synset(s) 的形式返回意义,https://github.com/alvations/pywsd

      包括

      • Lesk算法(包括原始Lesk改编的Lesk简单的Lesk
      • 基线算法(随机感觉、第一感觉、最常见感觉)

      可以这样使用:

      #!/usr/bin/env python -*- coding: utf-8 -*-
      
      bank_sents = ['I went to the bank to deposit my money',
      'The river bank was full of dead fishes']
      
      plant_sents = ['The workers at the industrial plant were overworked',
      'The plant was no longer bearing flowers']
      
      print "======== TESTING simple_lesk ===========\n"
      from lesk import simple_lesk
      print "#TESTING simple_lesk() ..."
      print "Context:", bank_sents[0]
      answer = simple_lesk(bank_sents[0],'bank')
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      
      print "#TESTING simple_lesk() with POS ..."
      print "Context:", bank_sents[1]
      answer = simple_lesk(bank_sents[1],'bank','n')
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      
      print "#TESTING simple_lesk() with POS and stems ..."
      print "Context:", plant_sents[0]
      answer = simple_lesk(plant_sents[0],'plant','n', True)
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      
      print "======== TESTING baseline ===========\n"
      from baseline import random_sense, first_sense
      from baseline import max_lemma_count as most_frequent_sense
      
      print "#TESTING random_sense() ..."
      print "Context:", bank_sents[0]
      answer = random_sense('bank')
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      
      print "#TESTING first_sense() ..."
      print "Context:", bank_sents[0]
      answer = first_sense('bank')
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      
      print "#TESTING most_frequent_sense() ..."
      print "Context:", bank_sents[0]
      answer = most_frequent_sense('bank')
      print "Sense:", answer
      print "Definition:",answer.definition
      print
      

      [出]:

      ======== TESTING simple_lesk ===========
      
      #TESTING simple_lesk() ...
      Context: I went to the bank to deposit my money
      Sense: Synset('depository_financial_institution.n.01')
      Definition: a financial institution that accepts deposits and channels the money into lending activities
      
      #TESTING simple_lesk() with POS ...
      Context: The river bank was full of dead fishes
      Sense: Synset('bank.n.01')
      Definition: sloping land (especially the slope beside a body of water)
      
      #TESTING simple_lesk() with POS and stems ...
      Context: The workers at the industrial plant were overworked
      Sense: Synset('plant.n.01')
      Definition: buildings for carrying on industrial labor
      
      ======== TESTING baseline ===========
      #TESTING random_sense() ...
      Context: I went to the bank to deposit my money
      Sense: Synset('deposit.v.02')
      Definition: put into a bank account
      
      #TESTING first_sense() ...
      Context: I went to the bank to deposit my money
      Sense: Synset('bank.n.01')
      Definition: sloping land (especially the slope beside a body of water)
      
      #TESTING most_frequent_sense() ...
      Context: I went to the bank to deposit my money
      Sense: Synset('bank.n.01')
      Definition: sloping land (especially the slope beside a body of water)
      

      【讨论】:

        【解决方案6】:

        最近,wsd.py 模块中的部分pywsd 代码已移植到NLTK' 的前沿版本中,试试:

        >>> from nltk.wsd import lesk
        >>> sent = 'I went to the bank to deposit my money'
        >>> ambiguous = 'bank'
        >>> lesk(sent, ambiguous)
        Synset('bank.v.04')
        >>> lesk(sent, ambiguous).definition()
        u'act as the banker in a game or in gambling'
        

        为了获得更好的 WSD 性能,请使用 pywsd 库而不是 NLTK 模块。一般来说,来自pywsdsimple_lesk() 比来自NLTKlesk 好。有空我会尽量更新NLTK模块。


        在回应 Chris Spencer 的评论时,请注意 Lesk 算法的局限性。我只是给出算法的准确实现。这不是灵丹妙药,http://en.wikipedia.org/wiki/Lesk_algorithm

        另外请注意,虽然:

        lesk("My cat likes to eat mice.", "cat", "n")
        

        不给你正确答案,可以用pywsd实现max_similarity()

        >>> from pywsd.similarity import max_similiarity
        >>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition 
        'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
        >>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition 
        'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
        

        @Chris,如果你想要一个 python setup.py ,请礼貌地请求,我会写它...

        【讨论】:

        • 不幸的是,准确性非常糟糕。 lesk("My cat likes to eat mice.", "cat", "n") => Synset('computerized_tomography.n.01')。而且 pywsd 甚至没有安装脚本...
        • 亲爱的 Chris,您尝试过 lesk 的其他变体吗?特别是。 simple_lesk() 还是 adapted_lesk?已知原始 lesk 存在问题,因此包中提供了其他解决方案。 en.wikipedia.org/wiki/Lesk_algorithm。另外,我在空闲时间维护,这不是我谋生的...
        • 是的,我尝试了您包中的所有 Lesk 变体,但没有一个适用于我的示例语料库。我必须创建一个变体,该变体还使用与该词相关的所有下位词和分词的注解,只是为了获得一些积极的结果,但即便如此,它也只有 15% 的准确率。这不是你的代码,而是 Lesk 的问题。这根本不是一个可靠的启发式方法。
        • 尝试最大化相似度。它可能会做得更好。此外,我正在编写更多算法,但这留给 9 月份的代码冲刺。另外,看看更先进的方法。最后,最常见的感觉通常做得很好,当使用 MFS 后退时,最先进的技术可以击败它 1-2%,最多 5%...
        • 伙计们,采用手动标记的语料库(人类消除歧义的同义词)并在其上训练某种 ML 分类器是否有意义?那么经过训练的分类器可以作为另一种消歧算法包含在你的包中,如果我们看到它在训练期间看不见的文本上的准确性很高)
        猜你喜欢
        • 2013-04-03
        • 2014-08-11
        • 1970-01-01
        • 2015-04-20
        • 2015-02-09
        • 2015-01-08
        • 2011-10-13
        • 2014-10-16
        • 1970-01-01
        相关资源
        最近更新 更多