忽略python中的注释行从文件中获取字数答案

【问题标题】：Get word count from a file ignoring comment lines in python忽略python中的注释行从文件中获取字数
【发布时间】：2017-07-26 09:34:34
【问题描述】：

我正在尝试使用 Python 计算文件中某个单词的出现次数。但我必须忽略文件中的 cmets。

我有这样的功能：

def getWordCount(file_name, word):
  count = file_name.read().count(word)
  file_name.seek(0)
  return count

如何忽略以 # 开头的行？

我知道这可以通过逐行读取文件来完成，如this question 中所述。有没有更快、更 Pythonian 的方式来做到这一点？

【问题讨论】：

一行是否可能包含内容后跟评论？喜欢foo # comment？
file_name.read() 不是很 Pythonic。 file_name 建议这是一个带有文件名的字符串，但.read() 建议这是一个文件对象。至于您的问题：您是否考虑过阅读文件line by line？
@WillemVanOnsem 对不起这个错误。是的。他们可以是
@kazemakase 我正在传递文件对象，但不能将其命名为文件。因此我将其命名为file_name
嗯，你数数的速度比看每个单词都快。不管你是逐行做，还是批量做对性能都有一定的影响，但是就大哦来说，所有的方法至少O(n)...

标签： python algorithm file file-io io

【解决方案1】：

您可以使用正则表达式来过滤掉 cmets：

import re

text = """ This line contains a word. # empty
This line contains two: word word  # word
newline
# another word
"""

filtered = ''.join(re.split('#.*', text))
print(filtered)
#  This line contains a word. 
# This line contains two: word word  
# newline

print(text.count('word'))  # 5
print(filtered.count('word'))  # 3

只需将text 替换为您的file_name.read()。

【讨论】：

OP 声明 cmets 是以散列开头的行。您还从一行中间开始过滤掉 cmets（这对于实际的 cmets 示例当然更典型）。
@Alfe 对。 OP 在 cmets 中澄清，内容行也可以后跟评论。

【解决方案2】：

更多 Pythonian 应该是这样的：

def getWordCount(file_name, word):
  with open(file_name) as wordFile:
    return sum(line.count(word)
      for line in wordFile
      if not line.startswith('#'))

更快（与 Pythonian 无关）可能是将整个文件读入一个字符串，然后使用正则表达式查找不在以哈希开头的行中的单词。

【讨论】：

由于 Python cmets 允许在 '#' 之前使用空格，因此您可能应该使用 line.strip().startswith('#')。

【解决方案3】：

您可以做一件事，只需创建一个没有注释行的文件，然后运行您的代码 Ex.

infile = file('./file_with_comment.txt')

newopen = open('./newfile.txt', 'w')
for line in infile :
    li=line.strip()
    if not li.startswith("#"):
        newopen.write(line)

newopen.close()

这将删除以 # 开头的每一行，然后在 newfile.txt 上运行您的函数

def getWordCount(file_name, word):
  count = file_name.read().count(word)
  file_name.seek(0)
  return count

【讨论】：