“utf-8”编解码器无法解码字节 0x80答案

【问题标题】：'utf-8' codec can't decode byte 0x80“utf-8”编解码器无法解码字节 0x80
【发布时间】：2016-04-24 16:40:25
【问题描述】：

我正在尝试下载经过 BVLC 训练的模型，但遇到此错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

我认为是因为下面的函数 (complete code)

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

知道如何解决这个问题吗？

【问题讨论】：

错误信息很清楚。要么您的文件根本不是 UTF8，要么已损坏。
这就是我尝试打印时得到的结果f<_io.TextIOWrapper name='models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel' mode='r' encoding='utf8'>
有趣。那么当您明确指定文件编码时会发生什么？ open(filename, 'r', encoding='utf8') 之类的东西？
我试图用这个with open(filename, 'r', encoding='utf8') as f: 修改第二行，但我得到了同样的错误
不，不告诉 Python 它是 UTF8。除非您确定它应该是 - 但 Python 告诉您它是 not 有效的 UTF8，而是别的东西。用一个好的代码编辑器打开文件，看看里面有什么。

标签： python utf-8 caffe

【解决方案1】：

您正在打开一个不是 UTF-8 编码的文件，而您的系统的默认编码设置为 UTF-8。

由于您正在计算 SHA1 哈希，因此您应该将数据读取为 二进制。 hashlib 函数要求您传入字节：

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

注意在文件模式下添加b。

见open() documentation：

mode 是一个可选字符串，它指定文件打开的模式。它默认为'r'，这意味着以文本模式打开以供阅读。 [...] 在文本模式下，如果未指定 encoding，则使用的编码取决于平台：调用locale.getpreferredencoding(False) 以获取当前的语言环境编码。（对于读取和写入原始字节，请使用二进制模式并保持 encoding 未指定。）

来自hashlib module documentation：

您现在可以使用 update() 方法为这个对象提供类似字节的对象（通常是字节）。

【讨论】：

【解决方案2】：

您没有指定以二进制模式打开文件，因此f.read() 正在尝试将文件作为 UTF-8 编码的文本文件读取，这似乎不起作用。但是由于我们采用 bytes 的哈希值，而不是 strings 的哈希值，因此编码是什么，甚至文件是否是文本都无关紧要：只需打开它，然后将其作为二进制文件读取。

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

但是

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

【讨论】：

【解决方案3】：

由于文档和 src 代码中没有任何提示，我不知道为什么，但使用 b char（我猜是二进制）完全有效（tf-version: 1.1.0）：

image_data = tf.gfile.FastGFile(filename, 'rb').read()

For more information, check out: gfile

【讨论】：