Python - 出现频率最高的前 3 个单词答案

【问题标题】：Python - top 3 words with highest frequencyPython - 出现频率最高的前 3 个单词
【发布时间】：2023-03-14 01:26:01
【问题描述】：

正如标题所说，我需要编写一个代码来返回频率最高的 3 个单词（来自输入字符串）的列表。这是我目前所拥有的：

输入：

import collections

print(sstr)

输出：

['22574999', 'communication was sent']
['22582857', 'message originated from an industrial area in pacoima']
['22585166', 'your message will never be delivered']
['22585424', 'message has been delivered ']

在：

import collections

id = sstr[0]
info = (sstr[1]).split()
print(id,info)

输出：

22574999 ['communication', 'was', 'sent']
22582857 ['message', 'originated', 'from', 'an', 'industrial', 'area', 'in', 'pacoima']
22585166 ['your', 'message', 'will', 'never', 'be', 'delivered']
22585424 ['message', 'has', 'been', 'delivered']

在：

import collections

id = sstr[0]
info = (sstr[1]).split()
c = collections.Counter()

for word in info:
    c[word] += 1

print(c.most_common(3))

输出：

Counter({'communication': 1, 'was': 1, 'sent': 1})
Counter({'message': 1, 'originated': 1, 'from': 1, 'an': 1, 'industrial': 1, 'area': 1, 'in': 1, 'pacoima': 1})
Counter({'your': 1, 'message': 1, 'will': 1, 'never': 1, 'be': 1, 'delivered': 1})
Counter({'message': 1, 'has': 1, 'been': 1, 'delivered': 1})

我想将所有行合并为一个并找到频率最高的前 3 个单词。以及如何找到频率最高的前3个单词的id总和？

我想得到以下结果

结果：

top 3 words with highest frequency:

message :3 
delivered:2    
communication:1

sum of id in which there аре top 3 words with highest frequency:

message:3       Is included (22582857,22585166,22585424 )     
delivered:2     Is included(22585166,22585424)
communication:1 Is included (22574999)

【问题讨论】：

那么.. 是什么阻止你写它？
循环遍历 sstr 的值并将所有单词添加到一个 Counter 中，而不是为每一行创建一个单独的 Counter。

标签： python python-3.x

【解决方案1】：

from collections import Counter, defaultdict

messages = [
    ['364616', 'baa baa black sheep'],
    ['364617', 'have you any wool'],
    ['364618', 'yes sir yes sir'],
    ['364619', 'three bags full'],
    ['364620', 'one for the master'],
    ['364621', 'and one for the dame'],
    ['364622', 'and one for the little boy'],
    ['364623', 'who lives down the lane']]

word_counts = Counter()
word_to_msgids = defaultdict(set)

for msgid, msg in messages:
    for word in msg.split(): # use set(msg.split()) to drop duplicates
        word_counts[word] += 1
        word_to_msgids[word].add(msgid)

for word, count in combined_word_counts.most_common(8):
    msgids = ', '.join(word_to_msgids[word])
    print '"{}" appears {} times in messages {}'.format(word, count, msgids)

输出

"the" appears 4 times in messages 364621, 364620, 364623, 364622
"one" appears 3 times in messages 364621, 364620, 364622
"for" appears 3 times in messages 364621, 364620, 364622
"and" appears 2 times in messages 364621, 364622
"yes" appears 2 times in messages 364618
"sir" appears 2 times in messages 364618
"baa" appears 2 times in messages 364616
"down" appears 1 times in messages 364623

注意：我认为您不需要单独计算每条消息中的字数。如果你真的需要它：

msgid_to_word_counts = {msgid:Counter(s.split()) for msgid, s in messages}

如果您想在'baa baa black sheep' 中计数'baa' 一次而不是两次，请使用set 从split() 的结果中删除重复项

msgid_to_word_counts = {msgid:Counter(set(s.split())) for msgid, s in messages}

【讨论】：

我想将所有行合并为一个并找到前3个。如何做到这一点？
所有行的并集是我命名为word_counts 的变量，它是我从头开始创建的。如果你想从我在最后展示的msgid_to_word_counts 创建它，你可以这样做：首先创建combined_word_counts = Counter()。然后做for cntr in msgid_to_word_counts.values(): combined_word_counts.update(cntr)。（combined_word_counts 将与我之前在回答中展示的 word_counts 相同，只是更间接地得出。）