计算 txt 文件中的单词数和唯一单词数 - Python答案

【问题标题】：Counting number of words and unique words from txt file- Python计算 txt 文件中的单词数和唯一单词数 - Python
【发布时间】：2015-06-02 10:43:48
【问题描述】：

我正在尝试读取一个文本文件，去掉标点符号，将所有内容设为小写，然后打印单词总数，唯一单词的总数（例如，如果它在文本中，则表示“a” 20 次，只计算一次），然后打印出现频率最高的单词及其频率（即 a:20）。

我意识到 StackOverflow 上有类似的问题，但我是一个初学者，我正在尝试使用最少数量的导入来解决这个问题，并且想知道是否有办法对此进行编码而不是导入类似 Collections 的东西。

下面有我的代码，但我不明白为什么我没有得到我需要的答案。此代码正在打印整个文本文件（每个单词换行，所有标点符号都被删除），然后打印：

e 1
n 1
N 1
o 1

我认为，“无”按其频率分成字符。为什么我的代码给了我这个答案，我可以做些什么来改变它？

代码如下：

file=open("C:\\Users\\Documents\\AllSonnets.txt", "r")


def strip_sonnets():
    import string
    new_file=file.read().split()
    for words in new_file:
        data=words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
        print(data)

new_file=strip_sonnets()
new_file=str(new_file)

count={}
for w in new_file:
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print (word, times)

【问题讨论】：

标签： python file text strip

【解决方案1】：

如果您只想删除单词末尾的标点符号，您不需要翻译。 collections.Counter dict 也会为你计算单词：

from collections import Counter
from string import punctuation


with open("in.txt") as f:       
    c = Counter(word.http://stackoverflow.com/posts/29328942/editrstrip(punctuation) for line in f for  word in line.lower().split())

# print each word and how many times it appears
for k, freq in c.items():
   print(k,freq)

要按频率从高到低的顺序查看单词，您可以使用.most_common()：

for k,v in c.most_common():
    print(k,v)

没有导入使用dict.get:

c = {}
with open("in.txt") as f:
    for line in f:
        for word in line.lower().split():
            key = word.rstrip(punctuation)
            c[key] = c.get(key, 0) + 1

然后按频率排序：

from operator import itemgetter

for k,v in sorted(c.items(),key=itemgetter(1),reverse=True):
    print(k,v)

为什么你看到 None 是因为你设置了 new_file=strip_sonnets() 并且你的函数什么都不返回，所以对于所有没有指定返回值的函数它默认返回 None。

然后设置new_file=str(new_file)，因此当您迭代for w in new_file 时，您将迭代None 中的每个字符

你需要返回数据：

def strip_sonnets():
    new_file=file.read().split()
    for words in new_file:
        data= words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
    return data # return

我会将您的函数简化为返回一个生成器表达式，该表达式返回所有去掉标点符号并降低的单词：

 path = "C:\\Users\\Documents\\AllSonnets.txt"

def strip_sonnets():
    with open(path, "r") as f:     
        return (word.lower().rstrip(punctuation) for line in f for word in line.split())

.rstrip(punctuation) 基本上是在重复使用 strip 和 replace 来执行您尝试对代码执行的操作。

【讨论】：

@NizamMohamed，可能是因为您的代码转储。您还没有回答问题，您只是转储了代码。