【发布时间】:2019-09-02 15:31:34
【问题描述】:
我有一个关于狗的大型论坛,上面有标记的帖子。来自文档频率 * 文本频率的索引分数让我可以完美地衡量一个主题应该是什么。例如
print (getscores('dog food'))
# keyword scores range between 1 and 2
# {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5, ..... 'like':1.00001}
从那里似乎很容易对句子进行评分并找到最能代表主题的句子,或者我是这么认为的。在这个例子中,第二句话非常适合。
def method1 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score
def method2 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score / len(sen.split())
scores = {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5,'intended':1.4}
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']
for sen in sens:
print (sen)
print (method1(sen))
print (method2(sen))
#dog food
#3.6
#1.8 (winner method 2)
#dog food is food intended for consumption by dogs
#13.607999999999999
#1.5119999999999998
#like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty
#22.032220320000004 (winner method 1)
#0.7868650114285716
平均分数有利于短句,而增加分数有利于长句。补偿句子长度(每个单词乘以 0.92 左右)将适用于一个主题,但对于下一个主题需要另一个因素。
所以这种方法会让我一事无成。是否有任何已知的句子评分方法可以给我提供关键词权重最高的句子,但也考虑到关键词密度和句子长度?
【问题讨论】: