读取 CSV 文件时列表索引越界答案

【问题标题】：List index out of bounds when reading CSV File读取 CSV 文件时列表索引越界
【发布时间】：2017-05-01 16:36:52
【问题描述】：

我正在尝试简单地处理一些 twitter 数据，我想在其中计算数据集中产生的最频繁的单词。

但是，我在第 45 行不断收到以下错误：

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
 43 for line in f:
 44 parts = re.split("^\d+\s", line)
 45 tweet = re.split("\s(Status)", parts[-1])[10]
 46 tweet = tweet.replace("\\n"," ")
 47 terms_all = [term for term in process_tweet(tweet)]
 IndexError: list index out of range

我已经添加了我的完整代码以供审查，有人可以建议。

    import codecs
import re
from collections import Counter
from nltk.corpus import stopwords

word_counter = Counter()

def punctuation_symbols():
    return [".", "", "$","%","&",";",":","-","&amp;","?"]

def is_rt_marker(word):
    if word == "b\"rt" or word == "b'rt" or word == "rt":
        return True
    return False

def strip_quotes(word):
    if word.endswith(""):
        word = word[0:-1]
    if word.startswith(""):
        word = word[1:]
    return word

def process_tweet(tweet):
    keep = []
    for word in tweet.split(" "):
        word = word.lower()
        word = strip_quotes(word)
        if len(word) == 0:
            continue
        if word.startswith("https"):
            continue
        if word in stopwords.words('english'):
            continue
        if word in punctuation_symbols():
            continue
        if is_rt_marker(word):
            continue
        keep.append(word)
    return keep

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0
    for line in f:
        parts = re.split("^\d+\s", line)
        tweet = re.split("\s(Status)", parts[1])[0]
        tweet = tweet.replace("\\n"," ")
        terms_all = [term for term in process_tweet(tweet)]
        word_counter.update(terms_all)

        n += 1
        if n == 50:
            break

print(word_counter.most_common(10))

【问题讨论】：

您共享的回溯引用的代码与您在其下方粘贴的代码不同。特别是 tweet = re.split("\s(Status)", parts[-1])[10] 与 tweet = re.split("\s(Status)", parts[1])[0] 相比。你能澄清一下吗？
@etemple1：道歉也应该是 1,0。我正在尝试不同的组合，并且回溯是为先前的迭代生成的。关于为什么 [1],[0] 不起作用的任何想法？还要澄清 n=0 是设置索引，而 [1] 是定义行开始正确吗？
BTW [term for term in process_tweet(tweet)] 相当于 list(process_tweet(tweet))，在你的情况下，它相当于 process_tweet(tweet)。

标签： python csv indexing jupyter-notebook

【解决方案1】：

parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]

这些可能是有问题的行。

您假设 parts 确实拆分并且具有超过 1 个元素。拆分可能无法在line 中找到拆分字符串，因此parts 变为等于[line]。然后parts[1] 崩溃了。

在第二行之前添加一个检查。打印 line 值以更好地了解发生了什么。

【讨论】：