【发布时间】:2015-04-11 08:26:53
【问题描述】:
我必须从包含 50K 推文的文本文件中删除停用词。当我运行此代码时,它成功删除了停用词,但同时它也删除了空格。我想要文本中的空白。
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import codecs
import nltk
stopset = set(stopwords.words('english'))
writeFile = codecs.open("outputfile", "w", encoding='utf-8')
with codecs.open("inputfile", "r", encoding='utf-8') as f:
line = f.read()
tokens = nltk.word_tokenize(line)
tokens = [w for w in tokens if not w in stopset]
for token in tokens:
writeFile.write(token)
【问题讨论】:
标签: python-2.7 nltk stop-words