如何查找文本文件中常用词的数量并在python中删除它们？答案

【问题标题】：How to find the number of common words in a text file and delete them in python?如何查找文本文件中常用词的数量并在python中删除它们？
【发布时间】：2019-09-17 07:37:53
【问题描述】：

问题是：

首先，查找文本文件中所有单词的个数
其次，删除常见的单词，a, an , and, to, in, at, but,...（允许写出这些单词的列表）
第三，求剩余词数（唯一词）
列出它们

文件名应该作为函数的参数

我已经完成了问题的第一部分

import re

file = open('text.txt', 'r', encoding = 'latin-1')

word_list = file.read().split()

for x in word_list:
    print(x)

res = len(word_list)
print ('The number of words in the text:' + str(res))


def uncommonWords (file):
    uncommonwords = (list(file))
    for i in uncommonwords:
        i += 1
        print (i)

代码显示直到单词的数量，之后什么都没有出现。

【问题讨论】：

好吧，你定义了一个函数但从不调用它 (uncommonWords)，所以这是意料之中的。
如果你的意思是我应该在最后尝试“返回文件”，我也尝试过，但没有成功

标签： python-3.x

【解决方案1】：

你可以这样做

# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])

# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
    for line in text_file:
        for word in line.split():
            words_in_file.add(word)

# remove common words from word list
unique_words = words_in_file - stop_words

print(list(unique_words))

【讨论】：

【解决方案2】：

首先，您可能希望摆脱标点符号：如this answer 所示，您应该这样做：

 nonPunct = re.compile('.*[A-Za-z0-9].*')
 filtered = [w for w in text if nonPunct.match(w)]

那么，你可以这样做

from collections import Counter
counts = Counter(filtered)

然后您可以使用list(counts.keys()) 访问唯一单词列表，然后您可以选择忽略不想要的单词

[word for word in list(counts.keys()) if word not in common_words]

希望这能回答您的问题。

【讨论】：