【发布时间】:2019-01-21 05:07:56
【问题描述】:
有没有更好(更快)的方法从 csv 文件中删除停用词?
这是简单的代码,一个多小时后我还在等待结果(所以我什至不知道它是否真的有效):
import nltk
from nltk.corpus import stopwords
import csv
import codecs
f = codecs.open("agenericcsvfile.csv","r","utf-8")
readit = f.read()
f.close()
filtered = [w for w in readit if not w in stopwords.words('english')]
csv 文件有 50.000 行,总共约 1500 万字。为什么需要这么长时间?可悲的是,这只是一个子语料库。我将不得不使用超过 100 万行和超过 3 亿字来执行此操作。那么有没有办法加快速度呢?还是更优雅的代码?
CSV 文件示例:
1 text,sentiment
2 Loosely based on The Decameron, Jeff Baena's subversive film takes us behind the walls of a 13th century convent and squarely in the midst of a trio of lustful sisters, Alessandra (Alison Brie), Fernanda (Aubrey Plaza), and Ginerva (Kate Micucci) who are "beguiled" by a new handyman, Massetto (Dave Franco). He is posing as a deaf [...] and it is coming undone from all of these farcical complications.,3
3 One might recommend this film to the most liberally-minded of individuals, but even that is questionable as [...] But if you are one of the ribald loving few, who likes their raunchy hi-jinks with a satirical sting, this is your kinda movie. For me, the satire was lost.,5
4 [...]
[...]
50.000 The movie is [...] tht is what I ahve to say.,9
所需的输出将是没有停用词的相同 csv 文件。
【问题讨论】:
-
这是哪个 Python 版本 - 可以添加适当的标签吗?
-
另外,请添加一个有意义的输入 CSV 样本和该样本所需的输出。
-
顺便说一句,CSV中的字符串不应该被引用吗?否则,如何区分文本中的
,和,分隔文本和情感?此外,readit似乎只是一个包含文件中所有字符的字符串,而不是单词列表。 (您导入,但从不使用csv模块。) -
@tobias_k 我试过了,但它会是一个没有换行符的字符串?有没有办法正确地做到这一点?
标签: python python-3.x csv stop-words