检查两个单词是否相互关联答案

【问题标题】：check if two words are related to each other检查两个单词是否相互关联
【发布时间】：2013-09-23 04:21:48
【问题描述】：

我有两个列表：一、用户的兴趣；第二，关于一本书的关键词。我想根据用户给定的兴趣列表向用户推荐这本书。我正在使用 Python 库 difflib 的 SequenceMatcher 类来匹配“game”、“games”、“gaming”、“gamer”等类似的词。ratio 函数给了我一个介于 [0, 1] 说明 2 个字符串的相似程度。但是我遇到了一个例子，我计算了“循环”和“射击”之间的相似性。结果是0.6667。

for interest in self.interests:
    for keyword in keywords:
       s = SequenceMatcher(None,interest,keyword)
       match_freq = s.ratio()
       if match_freq >= self.limit:
            #print interest, keyword, match_freq
            final_score += 1
            break

有没有其他方法可以在 Python 中执行这种匹配？

【问题讨论】：

看看这个库：github.com/seatgeek/fuzzywuzzy
Heelo。您的意思是 0.6667 小于 self.limit 因此循环和射击被声明为不匹配，而您希望得到这两个词之间匹配的结果？还是相反？
@eyquem 他们在循环和射击之间没有相似之处，但是比例函数给出了很高的匹配度......这就是问题
提升self.limit的值你怎么看？
@eyquem ratio() 调用“游戏”并且“游戏”给出“0.6”作为结果，同时“循环”和“射击”给出“0.66667”作为调用比率的结果()，因此增加 self.limit 将无济于事

标签： python python-2.7 nlp nltk

【解决方案1】：

首先一个词可以有多种含义，当您尝试找到相似的词时，您可能需要一些词义消歧http://en.wikipedia.org/wiki/Word-sense_disambiguation。

给定一对词，如果我们以最相似的一对词义作为衡量两个词是否相似的标准，我们可以试试这个：

from nltk.corpus import wordnet as wn
from itertools import product

wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)

maxscore = 0
for i,j in list(product(*[sem1,sem2])):
  score = i.wup_similarity(j) # Wu-Palmer Similarity
  maxscore = score if maxscore < score else maxscore

您还可以使用其他相似度函数。 http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html。唯一的问题是当您遇到不在 wordnet 中的单词时。那我建议你回退到difflib。

【讨论】：

在您的产品电话中使用*[...] 是否有原因？似乎您可以只使用product(sem1, sem2) 来获得相同的效果（迭代器sem1 和sem2 的乘积）。我错过了什么吗？
请原谅*，我的代码有点冗长。哈哈哈，所以product(*[sem1,sem2]) 实际上与product(sem1,sem2) 相同，但您可以看到第一个是由[sem1,sem2] 组成的单个列表，但第二个作为2 个参数传递sem1, sem2。
issame = list(product(x,y)) == list(product(*[x,y])) 你会得到issame == True =)
@alvas 只是在这里吐口水......你不想也做j.wup_similarity(i)吗？我发现“热”、“暖”与“暖”、“热”的得分不同。
见stackoverflow.com/questions/20075335/…，注意wup_similarity是基于路径=)

【解决方案2】：

起初，我想用正则表达式来执行额外的测试来区分低比率的匹配。它可以是一种解决特定问题的解决方案，例如以 ing 结尾的单词发生的问题。但这只是一个有限的案例，可能还有许多其他案例需要为每个案例添加特定的治疗。

然后我想我们可以尝试找到额外的标准来消除语义上不匹配的单词，这些单词的字母相似度足以被检测为匹配在一起，尽管比率很低，
同时捕获具有低比率的真正语义匹配术语，因为它们很短。

有可能

from difflib import SequenceMatcher

interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')

s = SequenceMatcher(None)

limit = 0.50

for interest in interests:
    s.set_seq2(interest)
    for keyword in keywords:
        s.set_seq1(keyword)
        b = s.ratio()>=limit and len(s.get_matching_blocks())==2
        print '%10s %-10s  %f  %s' % (interest, keyword,
                                      s.ratio(),
                                      '** MATCH **' if b else '')
    print

给予

  shooting loop        0.333333  
  shooting looping     0.666667  
  shooting game        0.166667  

    gaming loop        0.000000  
    gaming looping     0.461538  
    gaming game        0.600000  ** MATCH **

   looping loop        0.727273  ** MATCH **
   looping looping     1.000000  ** MATCH **
   looping game        0.181818

请注意文档中的这一点：

SequenceMatcher 计算并缓存有关第二个序列，所以如果你想将一个序列与多个序列进行比较序列，使用 set_seq2() 设置常用序列一次，重复调用 set_seq1()，对其他每个序列调用一次。

【讨论】：

为stemming 考虑是个好主意。感谢分享。
@2ero 抱歉，英语不是我的母语，有时我很难理解。目前我不明白你对stemming的意思是什么
别担心，这是information retrieval，en.wikipedia.org/wiki/Stemming中的一种技术

【解决方案3】：

那是因为 SequenceMatcher 基于 edit distance 或类似的东西。语义相似性更适合您的情况或两者的混合。

看看 NLTK 包 (code example)，因为你正在使用 python，也许这个 paper

使用 c++ 的人可以查看open source project 以供参考

【讨论】：

+1 推荐 NLTK！它非常适合这种用途。
this question 或 this other question 可能有助于语义相似性位