Python3 UnicodeDecodeError 与 readlines() 方法答案

【问题标题】：Python3 UnicodeDecodeError with readlines() methodPython3 UnicodeDecodeError 与 readlines() 方法
【发布时间】：2023-03-03 06:54:21
【问题描述】：

尝试创建一个 twitter 机器人来读取行并发布它们。通过我的共享服务器空间上的 virtualenv 使用 Python3 和 tweepy。这是代码中似乎有问题的部分：

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

这是我得到的错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

错误专门指向f=filename.readlines() 作为错误的来源。知道可能出了什么问题吗？谢谢。

【问题讨论】：

See this post，它有两个非常有用的答案，您应该尝试一下。
我使用了encoding='iso-8859-1'，解决了我的问题
@hsinghal：ISO-8859-1（又名 latin-1）将始终有效，但通常错误。问题是它可以解码来自任何编码的任何字节，但如果原始文本不是真正的 latin-1，它将解码为垃圾。您需要知道真正的编码，而不仅仅是猜测； UTF-8 主要是自检的，因此它不太可能解码二进制乱码，但 latin-1 会很乐意将二进制乱码解码为文本乱码，并且从不低声抱怨。
@ShadowRanger 感谢您的解释。它增加了我目前的知识。

标签： python python-3.x unicode tweepy sys

【解决方案1】：

我认为最好的答案（在 Python 3 中）是使用 errors= 参数：

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

证明：

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['�abc\n', 'line2\n', 'line3']
>>>

请注意，errors= 可以是 replace 或 ignore。这是ignore 的样子：

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']

【讨论】：

【解决方案2】：

您的默认编码似乎是 ASCII，其中输入很可能是 UTF-8。当您在输入中点击非 ASCII 字节时，它会引发异常。与其说是readlines 本身对问题负责；相反，它导致读取+解码发生，并且解码失败。

不过，这很容易解决； Python 3 中的默认open 允许您提供输入的已知encoding，用任何其他可识别的编码替换默认值（在您的情况下为ASCII）。提供它可以让您继续读取为str（而不是明显不同的原始二进制数据bytes 对象），同时让Python 完成从原始磁盘字节转换为真实文本数据的工作：

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

【讨论】：

我喜欢这个解决方案的简单性，但我只是在 python 3.6.8 中尝试过，但它失败了。
@M.H.：它将在 UTF-8 数据上工作。如果不是 UTF-8，你需要弄清楚它是什么。这在 3.6.8 上和在任何其他 3.x 版本上一样有效（以及在 Python 2.6+ 上，如果您使用 from io import open 将 Py2 open 替换为 Py3 版本）。如果你不知道编码，你就只能猜测了。

【解决方案3】：

最终为自己找到了一个可行的答案：

filename=open(argfile, 'rb')

This post帮了我很多忙。

【讨论】：

如果您实际使用的是 Python 3，这将极大地改变您的行为；以二进制模式打开意味着您不仅不会获得换行符（诚然只是 Windows 上的一个问题），而且您会返回 bytes 对象而不是 str（如果您想使用它们，必须手动 decode 它们） str)。我发布了an answer that avoids this（假设你知道编码，无论如何你都需要知道它来执行decode）。