从单词得分的句子得分答案

【问题标题】：Sentence scoring from word score从单词得分的句子得分
【发布时间】：2019-09-02 15:31:34
【问题描述】：

我有一个关于狗的大型论坛，上面有标记的帖子。来自文档频率 * 文本频率的索引分数让我可以完美地衡量一个主题应该是什么。例如

print (getscores('dog food'))
# keyword scores range between 1 and 2
# {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5, ..... 'like':1.00001}

从那里似乎很容易对句子进行评分并找到最能代表主题的句子，或者我是这么认为的。在这个例子中，第二句话非常适合。

def method1 (sen):
    score = 1
    for word in sen.split():
        score=score*scores.get(word,1)
    return score

def method2 (sen):
    score = 1
    for word in sen.split():
        score=score*scores.get(word,1)
    return score / len(sen.split())

scores = {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5,'intended':1.4}
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']


for sen in sens:
    print (sen)
    print (method1(sen))
    print (method2(sen))

#dog food
#3.6
#1.8 (winner method 2)
#dog food is food intended for consumption by dogs
#13.607999999999999
#1.5119999999999998
#like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty
#22.032220320000004 (winner method 1)
#0.7868650114285716

平均分数有利于短句，而增加分数有利于长句。补偿句子长度（每个单词乘以 0.92 左右）将适用于一个主题，但对于下一个主题需要另一个因素。

所以这种方法会让我一事无成。是否有任何已知的句子评分方法可以给我提供关键词权重最高的句子，但也考虑到关键词密度和句子长度？

【问题讨论】：

标签： python nlp

【解决方案1】：

如果您在处理管道中使用 Multi-word expressions (MWE)，您的结果可能会有所改善。这种预处理通常会在 TfIdf 步骤之前完成。下面的代码说明了它们是如何使用的：

from nltk.tokenize import MWETokenizer

#Instantiate the tokenizer with a list of NWEs:
tokenizer = MWETokenizer( [('dog', 'food'), ('band', 'camp')])

tl1  = tokenizer.tokenize('dog food is food intended for consumption by dogs'.split())
print(tl1)
tl2 = tokenizer.tokenize('like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty'.split())
print(tl2)

#['dog_food', 'is', 'food', 'intended', 'for', 'consumption', 'by', 'dogs']
#['like', 'this', 'one', 'time', 'at', 'band_camp', 'there', 'was', 'all', 'this', 'food', 'and', 'and', 'a', 'dog', 'this', 'dog', 'who', 'ate', 'all', 'the', 'food', 'and', 'then', 'my', 'bowl', 'was', 'empty']

Spacy 依赖解析器和 POS 标记器可用于提取此类 MWE。

以下示例将检测一些可能是 MWE 的复合名词：

import spacy
nlp = spacy.load('en_core_web_sm')

sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']

def getCompoundNouns(sentence):
    doc = nlp(sentence)
    answer = []
    for t in doc:
        if t.dep_ == 'compound' and t.pos_  == 'NOUN':
            neighboringToken = t.nbor()
            if neighboringToken.pos_  == 'NOUN':
                answer.append((t.text, t.nbor()))
    if not answer:
        return(None)
    return(answer)

for s in sens:
    print(getCompoundNouns(s))

#[('dog', food)]
#[('dog', food)]
#[('band', camp)]

【讨论】：