在 Python 3 中将 Unicode 序列转换为字符串答案

【问题标题】：Converting Unicode sequences to a string in Python 3在 Python 3 中将 Unicode 序列转换为字符串
【发布时间】：2016-02-01 18:44:34
【问题描述】：

在解析 HTML 响应以在 Bash CLI 中的 Kubuntu 15.10 上使用 Python 3.4 提取数据时，使用 print() 我得到如下所示的输出：

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

如何在我的应用程序中输出实际文本本身？

这是生成字符串的代码：

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

这是从 HTML 页面返回 JSON 的函数：

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

在谷歌搜索该问题时，我发现information 中的quite a bit 与Python 2 相关，但是Python 3 完全改变了Python 中处理字符串的方式，尤其是Unicode。 p>

如何在 Python 3 中将示例字符串 (\u05ea) 转换为字符 (ת)？

附录：

这里有一些关于message['body']的信息：

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

请注意，最后一行确实按预期工作，但存在一些问题：

使用 unicode-escape 解码字符串文字是错误的，因为 Python 转义不同于 JSON 转义的许多字符。（谢谢bobince）
encode() 依赖默认编码，这是一件坏事。（谢谢bobince）
encode() 在某些较新的 Unicode 字符（例如 \ud83d\ude03）上失败，并出现 UnicodeEncodeError “surrogates not allowed”。

【问题讨论】：

什么是print(ascii(message['body']))？无关：使用messages = response.json()。
如果输入不是 JSON 那么它是什么？ print(response.content[:50]); print(response.headers['Content-Type'])。可以更改服务返回的上游格式吗？
这不是我问的。按原样运行注释中的代码。
@J.F.Sebastian: b'\r\n\n\n<!DOCTYPE html> <html lang="en"> <head> <meta ' 和 text/html; charset=utf-8。谢谢。
现在我们正在取得进展。您能否发布用于获取messages 的真实代码？（在requests.get() 和json.loads() 之间，包括）

标签： python python-3.x string unicode python-3.4

【解决方案1】：

您的输入似乎使用反斜杠作为转义字符，您应该先取消转义文本，然后再将其传递给json：

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

不要在 JSON 文本上使用'unicode-escape' 编码；它可能会产生不同的结果：

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['?']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'?' == '\U0001F602' 是U+1F602 (FACE WITH TEARS OF JOY)。

【讨论】：

非常非常感谢！非常感谢您耐心解决 cmets 问题的根源。
@dotancohen 我不确定我是否理解这个问题的实际答案。要将 unicode 序列转换为其字符串表示，我们必须使用 JSON？是这样吗？没有编码/解码技巧来解决这个问题？
另外，这与s.encode('utf-8').decode('unicode-escape') 有什么不同吗？
很难在一条评论中解释我的目标，所以请参阅my separate question。
@BramVanroy [重新发表评论以修复错别字] 没有。如果你已经有一个纯 Unicode 文本，那么你不需要对它做任何事情。如果您有 JSON 格式的 Unicode 文本，则只需使用 result = json.loads(json_text)。如果您有乱码输入，请尝试在上游修复它；如果不能，请使用任何必要的方法来修复您特定的损坏输入。请注意：'\u2603' 和 r'\u2603' 在 Python 中是完全不同的东西（您的问题表明您看不到区别）。