UnicodeDecodeError 异常是如何工作的？答案

【问题标题】：How does UnicodeDecodeError exception work?UnicodeDecodeError 异常是如何工作的？
【发布时间】：2020-06-28 10:49:51
【问题描述】：

如果我获取一个包含无效 utf-8 字符的文件（我将此页面保存为 file.txt：https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html）并尝试检查此文件中的每一行是否包含有效的 utf8（如果不是，则此行应该是忽略）我收到错误消息：

文件“test.py”，第 13 行，用于文件中的行：文件“C:\Users\user\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py”，第 23 行， in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 240: character maps to undefined

代码：

file = open("file.txt", "r")

for line in file:
    line = line.strip()
    try:
        print(line)
    except UnicodeDecodeError:
        print("UnicodeDecodeError " + line)
        pass

except-line & pass 是否应该不确保如果发生 UnicodeDecodeError，它将被忽略并且脚本继续下一行？

【问题讨论】：

如果异常实际上来自print(line)，它会，但堆栈跟踪很清楚地表明它不是。
另外，对错误消息使用代码格式。不要使用引号格式。

标签： python python-3.x utf-8

【解决方案1】：

在读取一行时引发异常，这发生在 for 迭代器中。您需要将读取的行封装到异常处理块中，为此您需要比for提供的隐式迭代器更多的手动处理：

with open('file.txt', 'r', encoding='utf-8') as fh:
    while True:
        try:
            line = fh.readline()
        except UnicodeDecodeError:
            print('error')  # line is obviously not available to output here, since it failed to decode
            continue

        if not line:
            break  # end the loop when the file is at its end

        print(line)

请注意，我不确定这对损坏的文件的实际表现如何，甚至在遇到损坏的字节后是否可以继续读取。如果这不起作用，您将需要更多手动操作并以'rb' 模式打开文件以获取原始bytes，然后您手动尝试.decode('utf-8')。这种方法还允许您将无法解码的行输出为原始字节，这在将文件作为文本读取时是不可能的。

【讨论】：