将文件从 cp1251 转换为 utf8答案

【问题标题】：Converting file from cp1251 to utf8将文件从 cp1251 转换为 utf8
【发布时间】：2023-03-15 20:16:01
【问题描述】：

我看到了类似的问题，但对它们的回答没有帮助。这段代码：

with codecs.open( sourceFileName, "r",  sourceEncoding, ) as sourceFile:
    contents = sourceFile.read()

with codecs.open( sourceFileName, "w", "utf-8") as targetFile:
    if contents:
        targetFile.write(contents)

返回错误“UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 1: character maps to undefined”

这段代码：

with open(sourceFileName, "rb") as sourceFileBin:
    contents = sourceFileBin.read().decode(sourceEncoding)

with open(sourceFileName, "wb") as targetFile:
    targetFile.write( contents.encode("unt-8"))

产生相同的错误。麻烦的符号是西里尔字母“И”（据我所知，它由“0xc8”而不是“0x98”表示）。我在 windows 上使用 python 2.7。

UPD：原来，原始文件编码可能不是 cp1251，这些错误可能是文本编辑器中的错误的结果。但是，我所有的文本编辑器都可以正确读取此文件。然后我正在寻找一些解决方法，因为没有这个特定字母的文件被正确转换。

【问题讨论】：

我知道。该脚本可能在 Python3 中有效，因为它直接处理 unicode 对象。但是在 2.7 版中，有两种类型的字符串对象：str 和 unicode，可悲的是，str 是默认值:)
chr(0x98)是≤，你确定是cp1251错误吗？

标签： python encoding cp1251

【解决方案1】：

我发现由于某种错误（或只是我的愚蠢），我试图转换已转换的文件。

非常抱歉浪费了您的时间

【讨论】：

识别已经转换的文件很有用：u'И'.encode('utf-8').decode('cp1251')（它会重现您的错误）