【发布时间】:2018-04-09 11:22:58
【问题描述】:
你们。我一直在尝试从已读取 pdf 的列表中删除停用词,但每当我使用 nltk 从列表或新列表中删除这些停用词时,它都会在 TXT 文件中将原始列表返回给我。我制作了一个单独的程序来测试停用词功能是否有效,它在那里工作正常,但由于某种原因在这种情况下不是。
还有更好的方法吗?任何帮助将不胜感激。
import PyPDF2 as pdf
import nltk
from nltk.corpus import stopwords
stopping_words = set(stopwords.words('english'))
stop_words = list(stopping_words)
# creating an object
file = open("C:\\Users\\Name\\Documents\\Data Analytics Club\\SampleBook-English2-Reading.pdf", "rb")
# creating a pdf reader object
fileReader = pdf.PdfFileReader(file)
# print the number of pages in pdf file
textData = []
for pages in fileReader.pages:
theText = pages.extractText()
# for char in theText:
# theText.replace(char, "\n")
textData.append(theText)
final_list = []
for i in textData:
if i in stopwords.words('english'):
textData.remove(i)
final_list.append(i.strip('\n'))
# filtered_word_list = final_list[:] #make a copy of the word_list
# for word in final_list: # iterate over word_list
# if word in stopwords.words('english'):
# final_list.remove(word) # remove word from filtered_word_list if it is a stopword
# filtered_words = [word for word in final_list if word not in stop_words]
# [s.strip('\n') for s in theText]
# [s.replace('\n', '') for s in theText]
# text_data = []
# for elem in textData:
# text_data.extend(elem.strip().split('n'))
# for line in textData:
# textData.append(line.strip().split('\n'))
#--------------------------------------------------------------------
import os.path
save_path = "C:\\Users\\Name\\Documents\\Data Analytics Club"
name_of_file = input("What is the name of the file: ")
completeName = os.path.join(save_path, name_of_file + ".txt")
file1 = open(completeName, "w")
# file1.write(str(final_list))
for line in final_list:
file1.write(line)
file1.close()
【问题讨论】:
标签: python nltk stop-words