从列表中删除停用词并使用 NLTK 读取到 TXT答案

【问题标题】：Remove Stopwords from List and Read to TXT with NLTK从列表中删除停用词并使用 NLTK 读取到 TXT
【发布时间】：2018-04-09 11:22:58
【问题描述】：

你们。我一直在尝试从已读取 pdf 的列表中删除停用词，但每当我使用 nltk 从列表或新列表中删除这些停用词时，它都会在 TXT 文件中将原始列表返回给我。我制作了一个单独的程序来测试停用词功能是否有效，它在那里工作正常，但由于某种原因在这种情况下不是。

还有更好的方法吗？任何帮助将不胜感激。

import PyPDF2 as pdf

import nltk
from nltk.corpus import stopwords

stopping_words = set(stopwords.words('english'))

stop_words = list(stopping_words)

# creating an object 
file = open("C:\\Users\\Name\\Documents\\Data Analytics Club\\SampleBook-English2-Reading.pdf", "rb")

# creating a pdf reader object
fileReader = pdf.PdfFileReader(file)

# print the number of pages in pdf file
textData = []

for pages in fileReader.pages:
    theText = pages.extractText()

    # for char in theText:
    #   theText.replace(char, "\n")

    textData.append(theText)

final_list = []

for i in textData:
    if i in stopwords.words('english'):
        textData.remove(i)
    final_list.append(i.strip('\n'))

# filtered_word_list = final_list[:] #make a copy of the word_list

# for word in final_list: # iterate over word_list
#   if word in stopwords.words('english'):
#       final_list.remove(word) # remove word from filtered_word_list if it is a stopword

# filtered_words = [word for word in final_list if word not in stop_words]

# [s.strip('\n') for s in theText]
# [s.replace('\n', '') for s in theText]


# text_data = []

# for elem in textData:
#         text_data.extend(elem.strip().split('n'))  

# for line in textData:
#     textData.append(line.strip().split('\n'))
#--------------------------------------------------------------------

import os.path

save_path = "C:\\Users\\Name\\Documents\\Data Analytics Club"

name_of_file = input("What is the name of the file: ")

completeName = os.path.join(save_path, name_of_file + ".txt")   

file1 = open(completeName, "w")

# file1.write(str(final_list))

for line in final_list:
    file1.write(line)

file1.close()

【问题讨论】：

标签： python nltk stop-words

【解决方案1】：

问题出在这一行

if i in stopwords.words('english'):
    textData.remove(i)

您只删除了该单词的一次出现。如果您阅读here，它只会删除第一次出现的单词。

您可能想要删除它的是：

Python 2

filter(lambda x: x != i, textData)

Python 3

list(filter(lambda x: x != i, textData))

编辑

所以我意识到您实际上是在迭代要从中删除元素的列表时已经很晚了。所以，你可能不想这样做。更多信息，请参考here

相反，你想做的是：

for i in set(textData):
    if i in stopwords.words('english'):
        pass
    else
        final_list.append(i.strip('\n'))

编辑 2

所以显然问题来自这里，需要解决：

for pages in fileReader.pages:
    theText = pages.extractText()
    words = theText.splitlines()
    textData.append(theText)

但是，对于我测试过的文件，它仍然在同一个句子中出现了间距和合并单词的问题。它给了我诸如'sameuserwithinacertaintimeinterval(typicallysettoa'和'bedirectionaltocapturethefactthatonestorywasclicked'之类的词

话虽如此，问题在于 PyPDF2 类。您可能希望求助于其他读者。如果仍然没有帮助，请发表评论

【讨论】：

感谢您的输入，但 txt 文件看起来仍然和以前一样，没有删除任何停用词（）。我尝试了很多方法，你的方法看起来很有希望，但是我猜 Python 不太喜欢我。问题可能与txt文件写入部分有关吗？它与常规 print() 一起工作，但不在 txt 中。谢谢。
给我一点时间来测试一下。我相信pages.extractText() 提供的是一个长字符串而不是单词。如果是这种情况，那么您将需要使用split(" ") 将其组成单词。
不要失去希望，我们都曾走上这条我们觉得编程语言不善待我们的道路。到目前为止，你做得很好，很荣幸 :)
我编辑了我的解决方案。可悲的是，问题似乎出在 pypdf2 库
哇，真可惜。我目前不知道有任何其他库或方法可以从 pdf 中读取单词。无论如何，感谢所有的帮助和鼓励的好话！