创建文本中单词的字典答案

【问题标题】：Creating a dictionary of the words in text创建文本中单词的字典
【发布时间】：2015-11-05 19:15:47
【问题描述】：

我想创建一个包含文本中所有唯一单词的字典。关键是词，值是词的频率

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
word_listT = str(' '.join(dtt)).split()
wordsT = {v:k for (k, v) in enumerate(word_listT)}
print wordsT

我希望是这样的：

{'we': 2, 'is': 1, 'peace': 2, 'at': 2, 'want': 2, 'our': 3, 'home': 4, 'you': 1, 'went': 1, 'nice': 1}

但是，我收到了这个：

{'we': 14, 'is': 12, 'peace': 16, 'at': 17, 'want': 15, 'our': 10, 'home': 18, 'you': 0, 'went': 7, 'nice': 13}

显然，我误用了功能或做错了什么。

请帮忙

【问题讨论】：

标签： python dictionary text enumerate

【解决方案1】：

您所做的问题是您正在存储单词所在位置的数组索引，而不是这些单词的计数。

要实现这一点，您只需使用collections.Counter

from collections import Counter

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
counted_words = Counter(' '.join(dtt).split())
# if you want to see what the counted words are you can print it
print counted_words

>>> Counter({'home': 4, 'our': 3, 'we': 2, 'peace': 2, 'at': 2, 'want': 2, 'is': 1, 'you': 1, 'went': 1, 'nice': 1})

SOME CLEANUP： 如 cmets 中所述

str() 对于您的' '.join(dtt).split() 来说是不必要的

您还可以删除列表分配并在同一行进行计数器

Counter(' '.join(dtt).split())

关于您的列表索引的更多细节；首先你必须了解你的代码在做什么。

dtt = [
    'you want home at our peace', 
    'we went our home', 
    'our home is nice', 
    'we want peace at home'
]

注意这里有 19 个单词； print len(word_listT) 返回 19。现在在下一行 word_listT = str(' '.join(dtt)).split() 您正在列出所有单词，如下所示

word_listT = [
    'you', 
    'want', 
    'home', 
    'at', 
    'our', 
    'peace', 
    'we', 
    'went', 
    'our', 
    'home', 
    'our', 
    'home', 
    'is', 
    'nice', 
    'we', 
    'want', 
    'peace', 
    'at', 
    'home'
]

再数一遍：19 个单词。最后一个词是“家”。并且列表索引从 0 开始，因此 0 到 18 = 19 个元素。 yourlist[18] 是“家”。这与字符串位置或任何东西无关，只是新数组的索引。 :)

【讨论】：

@Toly 当然！很高兴我能帮助你！你应该看看里面的集合，里面有很多有用的工具。Counter 就是其中之一，我也一直在使用defaultdict。如果您还有任何问题，请随时提出，如果可以，我会尽力提供帮助:)
@JohnRuddell join() 返回一个字符串，为什么要再次将其转换为字符串？ Counter(' '.join(dtt).split()) 会做
@helloV 抱歉，我刚刚复制了 OP 在那里所做的事情，而没有真正阅读它。 Counter 部分是我要添加的。但是是的 str() 在那里完全没有必要
@JohnRuddell - 一个问题。我知道这些数字代表索引（在我的代码中）。但是为什么索引是 18 而字典中的单词少于 18 个呢？它是否从原始字符串中获取索引？如果是，如何确保索引来自字典而不是原始字符串？
@Toly 用代码更容易解释，所以看看我所做的编辑。我希望这会有所帮助。如果您仍然不明白或有其他问题，请告诉我:)

【解决方案2】：

试试这个：

from collections import defaultdict

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
word_list = str(' '.join(dtt)).split()
d = defaultdict(int)
for word in word_list:
    d[word] += 1

【讨论】：

【解决方案3】：

enumerate 返回带有索引的单词列表，而不是频率。也就是说，当您创建 wordsT 字典时，每个 v 实际上是 k 的最后一个实例的 word_listT 中的索引。做你想做的事，使用 for 循环可能是最直接的。

wordsT = {}
for word in word_listT:
    try:
        wordsT[word]+=1
    except KeyError:
        wordsT[word] = 1

【讨论】：