imdb 审查编码错误答案

【问题标题】：Imdb review encoding errorimdb 审查编码错误
【发布时间】：2017-10-09 10:04:32
【问题描述】：

我正在尝试构建一个 RNN 模型，将评论分类为正面或负面情绪。

有一本词汇词典，在预处理中，我对一些索引序列进行了回顾。
例如，

“这部电影是最好的”--> [2,5,10,3]

当我尝试获取常用词汇并查看其内容时，出现此错误：

num of reviews 100
number of unique tokens : 4761
Traceback (most recent call last):
  File "preprocess.py", line 47, in <module>
    print(vocab)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)

代码如下：

from bs4 import BeautifulSoup
reviews = []
for item in os.listdir('imdbdata/train/pos')[:100]:
    with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f:
        sample = BeautifulSoup(f.read()).get_text()
    sample = word_tokenize(sample.lower())
    reviews.append(sample)
print("num of reviews", len(reviews))
word_freq = nltk.FreqDist(itertools.chain(*reviews))
print("number of unique tokens : %d"%(len(word_freq.items())))
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict((w,i) for i,w in enumerate(index_to_word))
print(vocab)

问题是，在使用 Python 处理 NLP 问题时，我如何才能摆脱这个UnicodeEncodeError？尤其是在使用open 函数获取一些文本时。

【问题讨论】：

标签： python nlp rnn

【解决方案1】：

您的终端似乎已配置为 ASCII。因为字符'\xe9' 超出了ASCII 字符的范围（0x00-0x7F），所以它不能在ASCII 终端上打印。它也不能编码为ASCII：

>>> s = '\xe9'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

您可以通过在打印时显式编码字符串并通过将不支持的字符替换为 ? 来处理编码错误来解决此问题：

>>> print(s.encode('ascii', errors='replace'))
b'?'

该字符看起来像是 ISO-8859-1 编码的带有锐角 (é) 的小写字母 e。

您可以检查用于标准输出的编码。就我而言，它是 UTF-8，打印该字符没有问题：

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print('\xe9')
é

您也许可以强制 Python 使用不同的默认编码；有一些讨论 here，但最好的方法是使用支持 UTF-8 的终端。

【讨论】：