从文件集中删除停用词答案

【问题标题】：Remove stopwords from set of files从文件集中删除停用词
【发布时间】：2021-03-21 20:38:18
【问题描述】：

我正在尝试读入目录中的所有文件，访问带有停用词的文件，浏览每个文件，从每个文件中删除停用词，然后生成所有已删除停用词的文件的副本。我能够读取所有文件并将它们打印为一个数组，但是当涉及到两个步骤时我被卡住了：删除停用词，并将生成的文件打印到一组新文件中。最后三行代码是仅生成一个文本文件的示例，但我需要某种循环来生成它们。

import pathlib

stop_words = open("StopWordList.txt")
stop_words.read()

for path in pathlib.Path(r'C:\Users\Usuario\Desktop\HelloWorld\emails').iterdir():
    if path.is_file():
        current_file = open(path, "r")
        lines = current_file.read()
        words = lines.split()

        for y in stop_words:
            if not y in stop_words:

                appendFile = open('filteredtext.txt', 'a')
                appendFile.write(" "+y)
                appendFile.close()

【问题讨论】：

参考这个链接geeksforgeeks.org/removing-stop-words-nltk-python我相信这就是你可能想要的
你从stop_words 得到y，然后检查y 是否在stop_words 中？这没有道理。你应该从words 得到y
也许你应该得到words，用它来创建没有stopwords的新列表，然后将所有单词连接成一个字符串，然后将该字符串保存到文件中。它可能比单独写每个单词更快。
顺便说一句：如果单词末尾有一些字符怎么办 - .、,、!、?。您必须在使用 stopworkds 检查之前将其删除。你还应该比较小写。也许你应该为此使用模块nltk。
您应该使用path 为结果生成新名称，而不是'filteredtext.txt'

标签： python

【解决方案1】：

我没有测试它（我没有包含停用词的文件）但我会做

import pathlib
import sys


if len(sys.argv) > 1:
    folder = sys.argv[1] # get folder as parameter
else:
    folder = r'C:\Users\Usuario\Desktop\HelloWorld\emails'


stop_words = open("StopWordList.txt").read().lower()  # to compare lowercase stopwords with lowercase words

for path in pathlib.Path(folder).iterdir():
    if path.is_file():

        # --- read all text at once ---

        input_file = open(path) #  it use `r` as default
        text = input_file.read()
        input_file.close()

        original_words = text.split()  # text -> words

        # --- remove stopwords ---

        filtered_words = []

        for word in original_words:
            temp_word = word.lower().rstrip('.,!?')
            if temp_word not in stop_words:   # check lowercase without `.,!?`
                filtered_words.appeend(word)  # keep original word

        # --- save all text at once ---

        output_path = path + '-filtered'  # create new filename

        text = " ".join(filtered_words)  # words -> text

        output_file = open('output_path', 'w')
        output_file.write(text)
        output_file.close()

它不会删除.,!?，也不会转换为小写。

最终你可以将代码拆分为函数

import pathlib
import sys

# --- functions ---

def read_words(path):

    input_file = open(path) #  it use `r` as default
    text = input_file.read()
    input_file.close()

    words = text.split()  # text -> words

    return words

def filter_words(words, stopwords):

    filtered_words = []

    for word in all_words:
        temp_word = word.lower().rstrip('.,!?')
        if temp_word not in stop_words:
            filtered_words.appeend(word)

    return filtered_words

def write_words(path, words):

    text = " ".join(words)  # words -> text

    output_file = open(path, 'w')
    output_file.write(text)
    output_file.close()

# --- main ---

if len(sys.argv) > 1:
    folder = sys.argv[1] # get folder as parameter
else:
    folder = r'C:\Users\Usuario\Desktop\HelloWorld\emails'

stop_words = open("StopWordList.txt").read().lower()  # to compare lowercase stopwords with lowercase words

for path in pathlib.Path(folder).iterdir():
    if path.is_file():
        words = read_words(path)
        words = fiter_words(words, stopwords)
        write_words(path + '-fitered', words)

【讨论】：

你可以像这样从 nltk.corpus import stopwords 和 getenglish stopwords 这样的 nltk 包中获取停用词 stop_words = set(stopwords.words('english'))
@SachinRajput 我知道nltk（可能我什至已经安装了它）但我懒得测试这段代码:)

【解决方案2】：

以下代码-基于您的代码-我帮助您，除了一些修改之外，我已将您原始代码的一些变量的名称更改为更有意义：

import os
import pathlib

# preparing the StopWordList to be used with each file
stop_words = open("StopWordList.txt")
StopWordList_raw = stop_words.read()
stop_words.close()

StopWordList = StopWordList_raw.split()

# Pathes of source files and that for after-modifications
files_path = 'C:\Users\Usuario\Desktop\HelloWorld\emails'
# another folder, your should create first to store files after modifications in
files_after_path = 'C:\Users\Usuario\Desktop\HelloWorld\emails2'

for path in pathlib.Path(files_path).iterdir():
    if path.is_file():
        current_file = open(path, "r")
        # get the all contents of the file
        original_file_content = current_file.read()

        # get a copy for modifications if stop words exist
        file_content_after = original_file_content

        # remove stop words by replacing each of them with '' - if exist
        for w in StopWordList:
            file_content_after = file_content_after.replace(w, '')

        # if modifications done to the original content, save the file in the atrget path
        if original_file_content != file_content_after:
            file_name_after = os.path.join(files_after_path, path.name)
            print(file_name_after)
            TargetFile = open(file_name_after, 'w')
            TargetFile.write(file_content_after)
            TargetFile.close()

【讨论】：