计算 Python 文件中的单词数答案

【问题标题】：Counting number of words in Python file计算 Python 文件中的单词数
【发布时间】：2018-01-24 11:38:04
【问题描述】：

我正在尝试计算文件中出现多个单词的实例数。

这是我的代码：

#!/usr/bin/env python

file = open('my_output', 'r')

word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))

代码中的问题是它只计算 word1 的实例数。这段代码如何固定计算 word2 和 word3？

谢谢！

【问题讨论】：

标签： python python-3.x

【解决方案1】：

我认为如果你这样做，这段代码会更好地工作，而不是连续读取和拆分文件：[这样你可以找到你在文件中找到的任意数量单词的词频]

 file=open('my_output' , 'r')
 s=file.read()
 s=s.split()
 w=set(s)
 tf={}
 for i in s:
     tf[i]=s.count(i)
 print(tf)

【讨论】：

【解决方案2】：

主要问题是file.read() 消耗文件。因此，第二次搜索时，您最终会搜索一个空文件。最简单的解决方案是读取一次文件（如果不是太大），然后只搜索之前读取的文本：

#!/usr/bin/env python

with  open('my_output', 'r') as file:
    text =  file.read()

word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))

为了提高性能，也可以只拆分一次：

#!/usr/bin/env python

with  open('my_output', 'r') as file:
    split_text =  file.read().split()

word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))

使用with也将确保文件在被读取后正确关闭。

【讨论】：

【解决方案3】：

你可以试试这个：

file = open('my_output', 'r')

splitFile = file.read().split()

lst = ['wordA','wordB','wordC']

for wrd in lst:
    print(wrd, splitFile.count(wrd))

【讨论】：

【解决方案4】：

使用collections.Counter 对象的简短解决方案：

import collections

with open('my_output', 'r') as f:    
    wordnames = ('wordA', 'wordB', 'wordC')
    counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
    for c in counts:
        print(c[0], c[1])

对于以下示例文本行：

'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'

我们会得到输出：

wordB 1
wordC 2
wordA 3

【讨论】：

【解决方案5】：

在您的代码中，文件在第一行被消耗（耗尽），因此下一行不会返回任何计数：第一个 file.read() 读取文件的全部内容并将其作为字符串返回。第二个file.read() 没有什么要读取的，只返回一个空字符串'' - 第三个file.read() 也是如此。

这是一个应该做你想做的版本：

from collections import Counter

counter = Counter()

with open('my_output', 'r') as file:
    for line in file:
        counter.update(line.split())
print(counter)

你可能需要做一些预处理（为了摆脱特殊字符和,和.等等）。

Counter 在 python 标准库中，对于这类事情非常有用。

请注意，这种方式您只需对文件进行一次迭代，而不必随时将整个文件存储在内存中。

如果您只想跟踪某些单词，您可以只选择它们而不是将整行传递给计数器：

from collections import Counter
import string

counter = Counter()

words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)

with open('my_output', 'r') as file:
    for line in file:
        line = line.translate(chars_to_remove)
        w = (word for word in line.split() if word in words)
        counter.update(w)
print(counter)

我还提供了一个预处理示例：punctuation 将在计数之前被删除。

【讨论】：

【解决方案6】：

from collections import Counter

#Create a empty word_list which stores each of the words from a line.
word_list=[]

#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')

#read all the lines in a file
for line in file_handle.readlines():

    #get each line, 
    #split each line into list of words
    #extend those returned words into the word_list

    word_list.extend(line.split())

# close the file object
file_handle.close()

#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)

print dictionary_of_words

【讨论】：

请考虑为未来的读者添加一些解释而不是代码来扩充您的答案
@etov，好建议。包括每个步骤的步骤