UnicodeDecodeError：'ascii'编解码器无法解码，使用gensim，python3.5答案

【问题标题】：UnicodeDecodeError: 'ascii' codec can't decode, with gensim, python3.5UnicodeDecodeError：'ascii'编解码器无法解码，使用gensim，python3.5
【发布时间】：2016-12-26 04:56:52
【问题描述】：

我在 windows 和 Linux 上都使用 python 3.5，但得到相同的错误： UnicodeDecodeError：“ascii”编解码器无法解码位置 0 的字节 0xc1：序数不在范围内（128）错误日志如下：重新加载的模块：lazylinker_ext Traceback（最近一次调用最后一次）：

  File "<ipython-input-2-d60a2349532e>", line 1, in <module>
    runfile('C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py',     wdir='C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622')

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py", line 70, in     <module>
    model = gensim.models.Word2Vec.load('model_all_no_lemma')

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\models\word2vec.py",     line 1485, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 248,     in load
    obj = unpickle(fname)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 912, in unpickle
    return _pickle.loads(f.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0:     ordinal not in range(128)

1.我检查并发现默认解码方法是utf-8：导入系统 sys.getdefaultencoding() 输出[2]: 'utf-8'

读取文件时，我还添加了.decode('utf-8')
我确实在开头添加了 shepang 行并声明了 utf-8 所以我真的不知道为什么python无法读取文件。有人可以帮帮我吗？

代码如下：

# -*- coding: utf-8 -*-
import gensim
import csv
import numpy as np
import math
import string
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word



class SpeechParser(object):

    def __init__(self, filename):
        self.filename = filename
        self.lemmatize = WordNetLemmatizer().lemmatize
        self.cached_stopwords = stopwords.words('english')

    def __iter__(self):

        with open(self.filename, 'rb', encoding='utf-8') as csvfile:
            file_reader = csv.reader(csvfile, delimiter=',', quotechar='|', )
            headers = file_reader.next()
            for row in file_reader:
                parsed_row = self.parse_speech(row[-2])
                yield parsed_row

    def parse_speech(self, row):

        speech_words =  row.replace('\r\n', ' ').strip().lower().translate(None, string.punctuation).decode('utf-8', 'ignore')         

        return speech_words.split()

    # -- source: https://github.com/prateekpg2455/U.S-Presidential-    Speeches/blob/master/speech.py --
    def pos(self, tag):
        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return ''

if __name__ == '__main__':

    # instantiate object
    sentences = SpeechParser("sample.csv")

    # load an existing model
    model = gensim.models.Word2Vec.load('model_all_no_lemma')



    print('\n-----------------------------------------------------------')
    print('MODEL:\t{0}'.format(model))

    vocab = model.vocab

    # print log-probability of first 10 sentences
    row_count = 0
    print('\n------------- Scores for first 10 documents: -------------')
    for doc in sentences: 
        print(sum(model.score(doc))/len(doc))
        row_count += 1
        if row_count > 10:
            break
    print('\n-----------------------------------------------------------')

【问题讨论】：

那是很多代码，你没有告诉我们错误在哪一行。发布堆栈跟踪可以更容易地发现。那么，你能减少这个吗？如果问题出在with open(self.filename, 'rb', encoding='utf-8') as csvfile:，您可以将其缩小到一个简单的open('whateverthefilenamewas', 'r', encoding="utf-8").read()。在这种情况下，这意味着您的文件不是 utf-8 编码的。仅仅因为默认的文件系统编码是 utf9 并不意味着这个文件是。
@tdelaney 我的错，刚刚添加了日志。
@tdelaney。谢谢，但我检查了文件，它是使用 utf-8 编码的。我尝试了以下，它引发了同样的错误。

标签： encoding utf-8 python-3.5 gensim word2vec

【解决方案1】：

当您尝试在 Python 3 中使用包含非 ASCII 字符的 Python 2 pickle 文件时，这看起来像是 Gensim 中的一个错误。

当您调用时，正在发生 unpickle：

model = gensim.models.Word2Vec.load('model_all_no_lemma')

在 Python 3 中，在 unpickle 期间，它希望将遗留字节字符串转换为 (Unicode) 字符串。默认操作是在严格模式下使用 'ASCII' 进行解码。

修复将取决于您原始 pickle 文件中的编码，并且需要您修补 gensim 代码。

我对 gensim 不熟悉，所以您必须尝试以下两个选项：

强制 UTF-8

您的非 ASCII 数据可能是 UTF-8 格式。

编辑C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
转到912行

换行阅读：

return _pickle.loads(f.read(), encoding='utf-8')

字节模式

Python3 中的 Gensim 可以愉快地处理字节字符串：

编辑C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
转到912行

换行阅读：

return _pickle.loads(f.read(), encoding='bytes')

【讨论】：