替换编码无法识别的字符答案

【问题标题】：Replace character that is not recognised by encoding替换编码无法识别的字符
【发布时间】：2019-12-03 13:51:00
【问题描述】：

我有一个要导入的大文件。该文件由数百万行客户创建的数据组成。因此，一些用户使用了编码无法识别的字符（每 100,000 个字符少于 1 个字符）。

但是，这会导致代码中断，因为它无法识别字符，并给我以下错误：

UnicodeEncodeError: 'charmap' codec can't encode character '\x96' in position 619: character maps to <undefined>

在上述特定情况下，编码无法识别长连字符。

我目前用来读取文件并进行一些转换的代码是：

def conversion(path, source, count):
    file = open(path, "w")
    iFile = open(source, 'r', encoding="utf-8")
    len_text = 1
    file.write("[\n")

    for line in iFile:                          # For all the lines in the file
        line = line.strip()                     # Remove newline/whitespace from begin and end of line
        line = line.replace('"newDetails":{','')
        line = line.replace('},"addrDate"',',"addrDate"')
        line = line.replace('},"open24Id"',',"open24Id"')

        if len_text != count:                   # While len_text does not equal line_count
            line+= r","                         # Add , to end of the line
            line+= "\n"                         # Add \n to end of line
            file.write(line)                    # Write line to file
        else:
            line += "\n"                        # Add \n to end of line
            file.write(line)                    # Write line to file

        len_text += 1                           # Increment len_text by 1

    file.write("]")                             # Write ] to end of file
    file.close()                                # Close file
    return

中断发生在file.write(line)。

如何让脚本搜索并将字符 \x96 替换为另一个字符？

【问题讨论】：

您可以在翻译它的位上尝试一下，然后循环它以使其继续尝试以便捕获错误，然后在 except 中您可以删除或替换它。跨度>
不确定我是否关注。不试一试，如果失败，跳过该行？
输出文件使用您的平台默认编码，在您的情况下是一些（Windows？）不支持所有字符的 8 位代码页（但如果您运行它可能是不同的编码另一台机器上的脚本）。为什么不对输出文件也使用像 UTF-8 这样的通用编码？
如果您绝对必须使用 8 位编码，您应该明确指定它，并且您可以添加一个错误处理程序来替换不支持的字符，例如。 open(path, 'w', encoding='cp1252', errors='replace').
顺便说一句，'\x96' 不是破折号，而是控制字符。看起来您可能已经输入了乱码。

标签： python python-3.x utf-8 character-encoding

【解决方案1】：

根据我的评论：尝试将捕获消息的错误部分，除了是你如何处理它，所以如果你说

try:
    your code
except UnicodeEncodeError:
    break

会跳过它，但会做类似的事情

try:
    your code
except UnicodeEncodeError:
    file.write("Your character")

这将允许您使用您的代码，并且当它遇到该错误时，它将用您要替换的字符替换它。使用代码将其更改为您希望它的工作方式，我只是做了一个通用示例。

【讨论】：

这对您有帮助吗？
它让我在这个过程中走得更远。发现新的错误，然后解决这些问题。谢谢
很高兴能帮到你。
如果您给我这些错误，我可以更新答案以尝试修复它。