【问题标题】:Compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet根据 WordNet 计算名词、动词、形容词和副词的平均多义词
【发布时间】:2017-10-05 10:38:00
【问题描述】:

我正在尝试根据 WordNet 计算名词、动词、形容词和副词的平均多义词。 这是我定义的函数:

def averagePolysemy(synsets):
    allSynsets = list(wn.all_synsets(synsets))
    lemmas = [synset.lemma_names() for synset in allSynsets]
    senseCount = 0
    for lemma in lemmas:
        senseCount = senseCount + len(wn.synsets(lemma, synsets))
    return senseCount/len(allSynsets)

averagePolysemy(wn.NOUN)

当我调用它时,我得到了错误:

Traceback (most recent call last):

File "<ipython-input-214-345e72500ae3>", line 1, in <module>
averagePolysemy(wn.NOUN)

File "<ipython-input-213-616cc4af89d1>", line 6, in averagePolysemy
senseCount = senseCount + len(wn.synsets(lemma, synsets))

File "/Users/anna/anaconda/lib/python3.6/site-
packages/nltk/corpus/reader/wordnet.py", line 1483, in synsets
lemma = lemma.lower()

AttributeError: 'list' object has no attribute 'lower'e 'lower'

我不确定我的函数的哪一部分导致了这个错误。

【问题讨论】:

  • 请显示完整的回溯。
  • 看起来synset.lemma_names 应该是sysnet.lemma_names()
  • 我调整了,但仍然遇到同样的错误
  • 您需要考虑lemma_names 返回的内容。看起来它返回一个列表。 synsets 是否需要一个列表?好像没有。
  • 您可能还对按类别计算单词的“感觉熵”感兴趣,请查看 aclweb.org/anthology/S/S16/S16-1147.pdf 的 Eqn 1(免责声明:论文的共同作者)

标签: python nltk wordnet


【解决方案1】:

首先,让我们定义什么是多义词。

多义词:一个词或短语的多种可能含义并存。

(来源:https://www.google.com/search?q=polysemy

来自Wordnet

WordNet® 是一个大型的英语词汇数据库。名词、动词、形容词和副词被分组为一组认知同义词(同义词),每个同义词表达一个不同的概念。同义词通过概念语义和词汇关系相互联系。

在 WordNet 中有几个我们应该熟悉的术语:

同义词:独特的概念/意义

引理:词根形式

词性 (POS):单词的语言类别

单词:单词的表层形式(表层词不在WordNet中)

(注意:@alexis 在lemma vs synset 上有一个很好的答案:https://stackoverflow.com/a/42050466/610569;另见https://stackoverflow.com/a/23715743/610569https://stackoverflow.com/a/29478711/610569

在代码中:

from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for 
# the word in wordnet.
for synset in wn.synsets(word):
    # Each synset comes with an ID.
    offset = str(synset.offset()).zfill(8)
    # Each meaning comes with their 
    # linguistic category (i.e. POS)
    pos = synset.pos()
    # Usually, offset + POS is the way 
    # to index a synset.
    idx = offset + '-' + pos
    # Each meaning also comes with their
    # distinct definition.
    definition = synset.definition()
    # For each meaning, there are multiple
    # root words (i.e. lemma)
    lemmas = ','.join(synset.lemma_names())
    print ('\t'.join([idx, definition, lemmas]))

[出]:

00189565-n  a score in baseball made by a runner touching all four bases safely run,tally
00791078-n  the act of testing something    test,trial,run
07460104-n  a race run on foot  footrace,foot_race,run
00309011-n  a short trip    run
01926311-v  move fast by using one's feet, with one foot off the ground at any given time   run
02075049-v  flee; take to one's heels; cut and run  scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away

回到问题,如何“根据WordNet计算名词、动词、形容词和副词的平均多义词”

由于我们使用的是 WordNet,表面词已经不存在了,我们只剩下引理。

首先,我们需要定义名词、动词、形容词中的引理。

from nltk.corpus import wordnet as wn
from collections import defaultdict

words_by_pos = defaultdict(set)

for synset in wn.all_synsets():
    pos = synset.pos()
    for lemma in synset.lemmas():
        words_by_pos[pos].add(lemma)

但这是对引理与 POS 之间关系的简化视图:

# There are 5 POS.
>>> words_by_pos.keys() 
dict_keys(['a', 's', 'r', 'n', 'v'])

# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062

让我们看看我们是否可以忽略它并继续前进:

# Lets look that the verb 'v' category
num_meanings_per_verb = []

for word in words_by_pos['v']:
    # No. of meaning for a word given a POS.
    num_meaning = len(wn.synsets(word, pos='v'))
    num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb) / len(num_meanings_per_verb))

[出]:

2.1866273523545225

这个数字是什么意思? (如果它有任何意义)

意思是

  • 在 WordNet 中的每个动词中,
  • 平均有2个意思;
  • 忽略了一些词在其他 POS 类别中具有更多含义的事实

也许它有一些意义,也许但是如果我们看一下num_meanings_per_verb中的值的计数:

Counter({1: 101168,
         2: 11136,
         3: 3384,
         4: 1398,
         5: 747,
         6: 393,
         7: 265,
         8: 139,
         9: 122,
         10: 85,
         11: 74,
         12: 39,
         13: 29,
         14: 10,
         15: 19,
         16: 10,
         17: 6,
         18: 2,
         20: 5,
         26: 1,
         30: 1,
         33: 1})

【讨论】:

    猜你喜欢
    • 2015-11-26
    • 1970-01-01
    • 1970-01-01
    • 2018-06-22
    • 1970-01-01
    • 2018-07-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多