将文件转换为 Ascii 会引发异常答案

【问题标题】：Convert file to Ascii is throwing exceptions将文件转换为 Ascii 会引发异常
【发布时间】：2015-10-29 11:36:29
【问题描述】：

由于my previous question，我编写了这个代码：

def ConvertFileToAscii(args, filePath):
    try:
       # Firstly, make sure that the file is writable by all, otherwise we can't update it
        os.chmod(filePath, 0o666)

        with open(filePath, "rb") as file:
            contentOfFile = file.read()

        unicodeData = contentOfFile.decode("utf-8")
        asciiData = unicodeData.encode("ascii", "ignore")

        asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

        temporaryFile = tempfile.NamedTemporaryFile(mode='wt', delete=False)
        temporaryFileName = temporaryFile.name

        with open(temporaryFileName, 'wb')  as file:
            file.write(asciiData)

        if ((args.info) or (args.diagnostics)):
            print(filePath + ' converted to ASCII and stored in ' + temporaryFileName)


        return temporaryFileName

    #
    except KeyboardInterrupt:
        raise

    except Exception as e:
        print('!!!!!!!!!!!!!!!\nException while trying to convert ' + filePath + ' to ASCII')
        print(e)
        exc_type, exc_value, exc_traceback = sys.exc_info()
        print(traceback.format_exception(exc_type, exc_value, exc_traceback))

        if args.break_on_error:
            sys.exit('Break on error\n')

当我运行它时，我得到这样的异常：

['Traceback (most recent call last):
', '  File "/home/ker4hi/tools/xmlExpand/xmlExpand.py", line 99, in ConvertFileToAscii
    unicodeData = contentOfFile.decode("utf-8")
    ', "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1081: invalid start byte"]

我做错了什么？

我真的不关心将它们转换为 ASCII 的数据丢失。

ox9C 是 Ü 一个带有变音符号（元音变音符号）的 U，没有它我也能活。

如何将此类文件转换为仅包含纯 Ascii 字符？我真的需要将它们打开为二进制并检查每个字节吗？

【问题讨论】：

有趣。似乎它不适用于 1081 的字节 - 可能想检查那里有什么（如果 1080 是 EOF，这可能是 utf 阅读器期望起始字节的一个很好的理由。此外，如果转换不前一个字节正常工作，它可能会影响这个）。

标签： python python-3.x unicode

【解决方案1】：

用途：

contentOfFile.decode('utf-8', 'ignore')

例外来自 decode 阶段，您没有忽略错误。

【讨论】：

这个答案在正确的轨道上，但在很多方面都是错误的；-) 输出是 unicode 而不是 ASCII 字节。此外，请求是忽略“ascii”解码错误而不是“utf-8”解码错误。一般来说，“替换”往往比“忽略”更容易理解。如果可能的话，尝试找出原始编码而不是坚持使用已知不正确的编解码器进行解码是更好的策略。

【解决方案2】：

0x00f6 是在ISO-8859-1 中编码的ö (ouml)。我的猜测是你使用了错误的 Unicode 解码器。

试试看：unicodeData = contentOfFile.decode("ISO-8859-1")

【讨论】：

【解决方案3】：

您不需要将整个文件加载到内存中并在其上调用.decode()。 open() 有 encoding 参数（在 Python 2 上使用 io.open()）：

with open(filename, encoding='ascii', errors='ignore') as file:
    ascii_char = file.read(1)

如果您需要 Unicode 文本的 ascii 音译；考虑unidecode。

【讨论】：

【解决方案4】：

我真的不关心将它们转换为 ASCII 的数据丢失。 ... 如何将此类文件转换为仅包含纯 Ascii 字符？

一种方法是对decode 方法使用replace 选项。 replace 优于 ignore 的优点是您可以获得缺失值的占位符，这有助于防止对文本的误解。

务必使用 ASCII 编码而不是 UTF-8。 否则，当解码器尝试重新同步时，您可能会丢失相邻的 ascii 字符。

最后，在解码步骤之后运行encode('ascii')。否则，您将得到一个 unicode 字符串而不是字节字符串。

>>> string_of_unknown_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_unknown_encoding.decode('ascii', 'replace')
>>> back_to_bytes = now_in_unicode.replace('\ufffd', '?').encode('ascii')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)
b'L?wis'

也就是说，TheRightWay™ 这样做是开始关心数据丢失并使用正确的编码（显然您的输入不是 UTF-8，否则解码不会失败）：

>>> string_of_known_latin1_encoding = 'L\u00f6wis'.encode('latin-1')
>>> now_in_unicode = string_of_known_latin1_encoding.decode('latin-1')
>>> back_to_bytes = now_in_unicode.encode('ascii', 'replace')
>>> type(back_to_bytes)
<class 'bytes'>
>>> print(back_to_bytes)

【讨论】：