【发布时间】:2017-05-01 16:36:52
【问题描述】:
我正在尝试简单地处理一些 twitter 数据,我想在其中计算数据集中产生的最频繁的单词。
但是,我在第 45 行不断收到以下错误:
IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
43 for line in f:
44 parts = re.split("^\d+\s", line)
45 tweet = re.split("\s(Status)", parts[-1])[10]
46 tweet = tweet.replace("\\n"," ")
47 terms_all = [term for term in process_tweet(tweet)]
IndexError: list index out of range
我已经添加了我的完整代码以供审查,有人可以建议。
import codecs
import re
from collections import Counter
from nltk.corpus import stopwords
word_counter = Counter()
def punctuation_symbols():
return [".", "", "$","%","&",";",":","-","&","?"]
def is_rt_marker(word):
if word == "b\"rt" or word == "b'rt" or word == "rt":
return True
return False
def strip_quotes(word):
if word.endswith(""):
word = word[0:-1]
if word.startswith(""):
word = word[1:]
return word
def process_tweet(tweet):
keep = []
for word in tweet.split(" "):
word = word.lower()
word = strip_quotes(word)
if len(word) == 0:
continue
if word.startswith("https"):
continue
if word in stopwords.words('english'):
continue
if word in punctuation_symbols():
continue
if is_rt_marker(word):
continue
keep.append(word)
return keep
with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f:
n = 0
for line in f:
parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]
tweet = tweet.replace("\\n"," ")
terms_all = [term for term in process_tweet(tweet)]
word_counter.update(terms_all)
n += 1
if n == 50:
break
print(word_counter.most_common(10))
【问题讨论】:
-
您共享的回溯引用的代码与您在其下方粘贴的代码不同。特别是
tweet = re.split("\s(Status)", parts[-1])[10]与tweet = re.split("\s(Status)", parts[1])[0]相比。你能澄清一下吗? -
@etemple1:道歉也应该是 1,0。我正在尝试不同的组合,并且回溯是为先前的迭代生成的。关于为什么 [1],[0] 不起作用的任何想法?还要澄清 n=0 是设置索引,而 [1] 是定义行开始正确吗?
-
BTW
[term for term in process_tweet(tweet)]相当于list(process_tweet(tweet)),在你的情况下,它相当于process_tweet(tweet)。
标签: python csv indexing jupyter-notebook