字节对象以“repr 格式”存储为 b'foo' 而不是 encode()-ing 到字符串 - 如何修复？答案

【问题标题】：Bytes object stored in "repr format" as b'foo' instead of encode()-ing to string -- how to fix?字节对象以“repr 格式”存储为 b'foo' 而不是 encode()-ing 到字符串 - 如何修复？
【发布时间】：2018-12-11 18:26:59
【问题描述】：

某个倒霉的同事将一些数据保存到这样的文件中：

s = b'The em dash: \xe2\x80\x94'
with open('foo.txt', 'w') as f:
    f.write(str(s))

他们应该在什么时候使用

s = b'The em dash: \xe2\x80\x94'
with open('foo.txt', 'w') as f:
    f.write(s.decode())

现在foo.txt 看起来像

b'The em-dash: \xe2\x80\x94'

而不是

The em dash: —

我已经将该文件作为字符串读取：

with open('foo.txt') as f:
    bad_foo = f.read()

现在如何将bad_foo 从错误保存的格式转换为正确保存的字符串？

【问题讨论】：

.decode 没有编码名称就没有意义。无论如何，您为什么首先使用字节字符串？这样做的惯用方法是使用 Unicode 字符串并让 Python 在写入文件时对其进行编码。
@tripleee 是别人做的，我的任务是撤消它:)
我怀疑没有什么比 eval 更有用的建议了。
@tripleee 这是一个自我回答。见stackoverflow.com/a/53730411/2954547
@shadowtalker 在页面上的“发布您的问题”按钮下方有一个“回答您自己的问题”复选框，让您在比赛前得到答案；-)

标签： python python-3.x unicode character-encoding

【解决方案1】：

你可以试试literal eval

from ast import literal_eval
test = r"b'The em-dash: \xe2\x80\x94'"
print(test)
res = literal_eval(test)
print(res.decode())

【讨论】：

【解决方案2】：

如果您相信输入不是恶意的，您可以在损坏的字符串上使用ast.literal_eval。

import ast

# Create a sad broken string
s = "b'The em-dash: \xe2\x80\x94'"

# Parse and evaluate the string as raw Python source, creating a `bytes` object
s_bytes = ast.literal_eval(s)

# Now decode the `bytes` as normal
s_fixed = s_bytes.decode()

否则，您将不得不手动解析并删除或替换有问题的重复转义。

【讨论】：

【解决方案3】：

此代码在我的计算机上运行正常。但是，如果您仍然遇到错误，这可能会对您有所帮助

with open('foo.txt', 'r', encoding="utf-8") as f:
    print(f.read())

【讨论】：

不是，问题是文件中包含字节串的repr()。