如何写入与打印格式相同的文件？答案

【问题标题】：How can I write to a file with the same formatting as print?如何写入与打印格式相同的文件？
【发布时间】：2021-09-23 20:48:51
【问题描述】：

TL;DR

尝试将字符串写入文件时发生以下错误：

代码

logfile.write(cli_args.last_name)

输出

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)

但这有效：

代码

print(cli_args.last_name)

输出

佩雷斯

为什么？

完整上下文

我编写了一个脚本，它从 Linux CLI 接收数据，对其进行处理，最后使用提供的数据创建 Zendesk 票证。它是一种 CLI API，因为在我的脚本之前有一个更大的系统，它有一个带有表单的 Web 界面，用户可以在其中填写字段的值，然后替换为 CLI 脚本。例如：

myscript.py --first_name '_first_name_' --last_name '_last_name_'

直到昨天更新网络时，该脚本都可以正常运行。我认为他们更改了与字符集或编码相关的内容。

我通过打开一个文件并编写一些信息性消息来使用 F 字符串进行一些简单的日志记录，以防万一发生任何故障，因此我可以回去检查它发生的位置。此外，使用 argparse 模块读取 CLI 属性。示例：

logfile.write(f"\tChecking for opened tickets for user '{cli_args.first_name} {cli_args.last_name}'\n")

网站更新后，我收到如下错误：

UnicodeEncodeError: 'ascii' 编解码器无法在位置编码字符 8-9：序数不在范围内（128）

做一些故障排除我发现这是因为一些用户输入带有重音符号的名称，例如Carlos Pérez。

我需要脚本再次运行并为这样的输入做好准备，所以我通过检查 Web 控制台输入表单中的 HTTP 标头来寻找答案，发现它使用了Content-Type: text/html; charset=UTF-8；我的第一次尝试是将 CLI 参数中传递的 str 编码到 utf-8 并使用相同的编解码器再次解码，但没有成功。

第二次尝试时，我检查了 Python 文档 str.encode() 和 bytes.decode()。所以我尝试了这个：

logfile.write(
    "\tChecking for opened tickets for user "
    f"'{cli_args.first_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')} "
    f"{cli_args.last_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')}'"
)

它起作用了，但删除了带有重音符号的字母，所以Carlos Pérez 变成了Carlos Prez，在这种情况下这对我没有用，我需要完整的输入。

作为一个绝望的举动，我尝试打印我试图写入日志文件的相同 F 字符串，令我惊讶的是它起作用了。它在没有任何编码/解码过程的情况下打印到控制台Carlos Pérez。

打印是如何工作的？为什么尝试写入文件不起作用？但最重要的是如何写入与打印格式相同的文件？

编辑 1 @MarkTolonen

尝试了以下方法：

logfile = open("/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/755bug.txt", mode="a", encoding="utf8")
logfile.write(cli_args.body)
logfile.close()

输出：

Traceback（最近一次调用最后一次）：文件“/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py”，第 414 行，在主要的（）文件“/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py”，第 81 行，在 main logfile.write(cli_args.body) UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed

编辑 2

我设法得到了导致问题的文本：

if __name__ == "__main__":
    string = (
        "Buenos d\udcc3\udcadas,\r\n\r\n"
        "Mediante  monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
        "Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
        "Causas sugeridas del evento: _snmp_f14_\r\n"
        "Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
        "Validaciones de bajo impacto: _snmp_f16_\r\n"
        "Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
        "Saludos."
    )

    # Output: Text with the unicodes translated
    print(string)

    # Output: "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed"
    with open(file="test.log", mode="w", encoding="utf8") as logfile:
        logfile.write(string)

【问题讨论】：

你是如何创建logfile的？
这是在 Windows 还是 Linux 上运行？您可以指定文件在打开时应为 UTF-8。你可以说print("string string string", file=logfile)而不是写。
创建一个minimal reproducible example..
使用上述信息编辑您的问题。打开文件时使用encoding='utf8'选项支持所有Unicode字符。
以防万一，@MarkTolonen 有正确的答案。只需在打开文件时将文件声明为 UTF-8。问题解决了。 Linux 上的 stdin/stdout 就是这样打开的。

标签： python python-3.x string file write

【解决方案1】：

答案是encoding 参数open。观察：

Last login: Wed Jul 14 15:05:24 2021 from 50.126.68.34
[timrprobocom@jared-ingersoll ~]$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('x.txt','a')
>>> g = open('y.txt','a',encoding='utf-8')
>>> s = "spades \u2660 spades"
>>> f.write(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2660' in position 7: ordinal not in range(128)
>>> g.write(s)
15
>>>
[timrprobocom@jared-ingersoll ~]$ hexdump -C y.txt
00000000  73 70 61 64 65 73 20 e2  99 a0 20 73 70 61 64 65  |spades ... spade|
*
00000011

【讨论】：

logfile = open("test.txt", mode="a", encoding="utf8") logfile.write(cli_args.body) logfile.close() 输出：UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
@CarlosPérez 如果您使用的是 Windows，则可能需要使用 'utf-8-sig' 以确保 Windows 编辑器在打开文件时会选择正确的编码。
@CarlosPérez 您需要先删除这些代理项。我敢肯定有一个关于 SO 的问题会告诉你如何去做。

【解决方案2】：

看起来上游配置错误。您的 string 似乎是由带有错误编码的 decode 操作产生的，并带有 errors='surrogateescape' 错误处理。从显示的数据来看，解码操作似乎试图将 UTF-8 编码的文本解码为 ASCII。

errors='surrogateescape' 是一种编码在decode 操作期间处理无效字节的方法。错误处理程序在转换为 Unicode 字符串时用 U+DC80..U+DCFF 范围内的部分代理替换无效字节，并且可以通过执行 encode 和 @987654328 来反转该过程以取回原始字节字符串@ 和相同的编码。

string 中的部分代理与decode(encoding='ascii', errors='surrogateescape') 调用在给定数据实际以 UTF-8 编码时产生的模式相匹配 - 代理都在 surrogateescape 使用的范围内，以及它们对应的字节形成有效的 UTF-8。在下面的代码中，我恢复了原始字节，然后将它们正确解码为 UTF-8。一旦 Unicode 字符串有效，就可以使用encoding='utf8' 将其写入日志文件。

string = (
    "Buenos d\udcc3\udcadas,\r\n\r\n"
    "Mediante  monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
    "Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
    "Causas sugeridas del evento: _snmp_f14_\r\n"
    "Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
    "Validaciones de bajo impacto: _snmp_f16_\r\n"
    "Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
    "Saludos."
)

fixed = string.encode('ascii',errors='surrogateescape').decode('utf8')
print(fixed)

with open(file="test.log", mode="w", encoding="utf8") as logfile:
    logfile.write(fixed)

您可以在PEP 383 中阅读有关代理转义的更多信息。

【讨论】：

这是对我有用的解决方案，但是我不明白打印功能如何正确显示这些字符，或者 stdout 在这里做繁重的工作？
string.encode('ascii',errors='surrogateescape') 没有任何意义 - 大多数 Unicode 输入都失败了，包括完全有效的 Unicode 输入。例如，它在输入 'ü' 时失败。
surrogateescape 的设计目的是使broken_bytes.decode(some_encoding, errors='surrogateescape').encode(some_encoding, errors='surrogateescape') == broken_bytes 即使broken_bytes 实际上并不代表用some_encoding 编码的Unicode 文本。它旨在在解码时处理错误，而不是在编码时处理错误，并且在编码未使用 surrogateescape 解码的文本时使用 surrogateescape 没有意义. encode 行为仅用于反转转换 surrogateescape 适用于 decode。
@user2357112supportsMonica OP 发布的数据看起来像用surrogateescape 解码，这就是为什么我用它编码以反转过程并用正确的utf8 解码编解码器。请注意，代理都是初始代理，这就是编解码器表示损坏字节的方式。看起来无论该字符串来自何处，都已使用该错误处理程序进行了解码。
@MarkTolonen：那么，如果包含该说明，答案会更好。