【发布时间】:2018-07-20 12:53:37
【问题描述】:
我有一个 .txt 文件,其中包含四个字符串,全部由换行符分隔。
当我标记文件时,它会处理每一行数据,这是完美的。
但是,当我尝试从文件中删除停用词时,它只会从最后一个字符串中删除停用词。
我想处理文件中的所有内容,而不仅仅是最后一句话。
我的代码:
with open ('example.txt') as fin:
for tkn in fin:
print(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
输出:
['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED: ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']
如上所示,它只处理最后一行。
编辑: 我的txt文件内容:
this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.
smile smiling smiled
there are multiple words here that you should be able to use for lemmas/synonyms.
【问题讨论】:
标签: python nltk tokenize stop-words