Python 2.7 上的 UnicodeDecodeError答案

【问题标题】：UnicodeDecodeError on Python 2.7Python 2.7 上的 UnicodeDecodeError
【发布时间】：2015-02-13 07:54:57
【问题描述】：

遇到一些问题。我正在对长度为 160 万的数据集进行 TwitterSentimentAnalysis。由于我的电脑无法完成这项工作（由于计算量太大），教授告诉我使用大学服务器。

我刚刚意识到，在服务器上，python 版本是 2.7，它不允许我在 csv reader 中使用参数 encoding 来读取文件。

任何时候我得到UnicodeDecodeError，我都必须手动从数据集中删除推文，否则我无法完成剩下的工作。我已尝试解决网站上的所有问题，但没有解决任何问题。

我只想跳过引发错误的那一行，因为该集合足够大，可以让我进行良好的分析。

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8", errors='replace') for s in row]
    def __iter__(self):
        return self

def extraction(file, textCol, sentimentCol):
    "The function reads the tweets"
    #fp = open(file, "r",encoding="utf8")
    fp = open(file, "r")
    tweetreader = UnicodeReader(fp)
    #tweetreader = csv.reader( fp, delimiter=',', quotechar='"', escapechar='\\' )
    tweets = []
    for row in tweetreader:
        # It takes the column in which the tweets and the sentiment are
        if row[sentimentCol]=='positive' or row[sentimentCol]=='4':
            tweets.append([remove_stopwords(row[textCol]), 'positive']);
        else:
            if row[sentimentCol]=='negative' or row[sentimentCol]=='0':
                tweets.append([remove_stopwords(row[textCol]), 'negative']);
            else:
               if row[sentimentCol]=='irrilevant' or row[sentimentCol]=='2' or row[sentimentCol]=='neutral':
                   tweets.append([remove_stopwords(row[textCol]), 'neutral']);

    tweets = filterWords(tweets)
    fp.close()
    return tweets;

错误：

Traceback (most recent call last):
  File "sentimentAnalysis_v4.py", line 165, in <module>
    newTweets = extraction("sentiment2.csv",5,0)
  File "sentimentAnalysis_v4.py", line 47, in extraction
    for row in tweetreader:
  File "sentimentAnalysis_v4.py", line 29, in next
    row = self.reader.next()
  File "sentimentAnalysis_v4.py", line 19, in next
    return self.reader.next().encode("utf-8", errors='ignore')
  File "/usr/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/usr/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/usr/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd9 in position 48: invalid continuation byte

【问题讨论】：

您是否 100% 确定数据被编码为 UTF-8？
Hrm，您显示的回溯表明codecs.getreader(encoding)(f) 返回的对象在您调用next 时不会返回unicode。
@MartijnPieters 如果不是，它是否应该阅读其他推文？我必须手动删除的推文属于“à€€€€”类型和类似这样的奇怪东西。
@MartijnPieters 不幸的是，我不是唯一使用服务器的人，所以我不能要求他们为我更新 Python。使用另一个数据集我没有问题。那么，这是一种绕过这些线并转到下一个的方法吗？我必须在一周内交付项目，我很讨厌。
我给了你一个解决方法；这里确切的 2.7 Python 版本并不重要。

标签： python python-2.7 unicode sentiment-analysis

【解决方案1】：

如果您输入的数据格式不正确，我不会在此处使用codecs 进行读取。

使用较新的io.open() function 并指定错误处理策略； 'replace' 应该这样做：

class ForgivingUTF8Recoder:
    def __init__(self, filename, encoding):
        self.reader = io.open(f, newline='', encoding=encoding, errors='replace')
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

我将 newline 处理设置为 '' 以确保 CSV 模块能够正确处理值中的换行符。

不用传入打开的文件，只需传入文件名：

tweetreader = UnicodeReader(file)

这不会让您跳过错误行，而是通过替换无法用U+FFFD REPLACEMENT CHARACTER 解码的字符来处理错误行；如果您想跳过整行，仍然可以在您的列中查找。

【讨论】：