【发布时间】:2015-10-29 11:36:29
【问题描述】:
由于my previous question,我编写了这个代码:
def ConvertFileToAscii(args, filePath):
try:
# Firstly, make sure that the file is writable by all, otherwise we can't update it
os.chmod(filePath, 0o666)
with open(filePath, "rb") as file:
contentOfFile = file.read()
unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")
asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')
temporaryFile = tempfile.NamedTemporaryFile(mode='wt', delete=False)
temporaryFileName = temporaryFile.name
with open(temporaryFileName, 'wb') as file:
file.write(asciiData)
if ((args.info) or (args.diagnostics)):
print(filePath + ' converted to ASCII and stored in ' + temporaryFileName)
return temporaryFileName
#
except KeyboardInterrupt:
raise
except Exception as e:
print('!!!!!!!!!!!!!!!\nException while trying to convert ' + filePath + ' to ASCII')
print(e)
exc_type, exc_value, exc_traceback = sys.exc_info()
print(traceback.format_exception(exc_type, exc_value, exc_traceback))
if args.break_on_error:
sys.exit('Break on error\n')
当我运行它时,我得到这样的异常:
['Traceback (most recent call last):
', ' File "/home/ker4hi/tools/xmlExpand/xmlExpand.py", line 99, in ConvertFileToAscii
unicodeData = contentOfFile.decode("utf-8")
', "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 1081: invalid start byte"]
我做错了什么?
我真的不关心将它们转换为 ASCII 的数据丢失。
ox9C 是 Ü 一个带有变音符号(元音变音符号)的 U,没有它我也能活。
如何将此类文件转换为仅包含纯 Ascii 字符?我真的需要将它们打开为二进制并检查每个字节吗?
【问题讨论】:
-
有趣。似乎它不适用于 1081 的字节 - 可能想检查那里有什么(如果 1080 是 EOF,这可能是 utf 阅读器期望起始字节的一个很好的理由。此外,如果转换不前一个字节正常工作,它可能会影响这个)。
标签: python python-3.x unicode