如何用正确的 unicode 字符替换转义的 unicode 字符？答案

【问题标题】：How to replace escaped unicode characters with proper unicode characters?如何用正确的 unicode 字符替换转义的 unicode 字符？
【发布时间】：2018-12-16 08:41:49
【问题描述】：

我有一个这样的字符串：

'https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'

我需要将任意转义的 unicode 字符 ('\\uXXXX') 替换为其等效的 未转义 unicode 字符 ('\uXXXX')。我有正则表达式来提取所有必要的部分（'\\uXXXX' 部分和re.sub() 的'XXXX' 部分）但我找不到用\u{} 替换正确部分的方法，因为 Python 给出了Unicode 错误，需要预填充字符，例如 '\u003d'。使用原始字符串不起作用，因为 '\u{}' 只是转换回 '\\u{}' 并且我们最终回到了我们开始的地方。

有没有办法做到这一点？如果您想要代码示例，可以在这里查看：

# data loaded from a https://www.google.com/search image search

results_source = urllib.request.urlopen(url_request).read().decode()
searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

for count, unicode in enumerate(re.findall(r"(?<=\\u)....", searched_results[i])):
    searched_results[i] = re.sub(re.findall(r"\\u....", searched_results[i])[count], r"\u{}".format(unicode), searched_results[i])

searched_results 是返回的结果列表。列表中项目的一个示例是上面给出的字符串。

【问题讨论】：

请将您的代码简化为最小示例。我们不需要这么多代码。
抱歉，我只是想为这个问题提供上下文。实际问题可以在没有任何代码的情况下作为示例来回答。无论如何，我会削减代码:)
您似乎正在从 Google 结果页面中提取 Javascript / JSON 字符串，因此只需将数据视为 JSON 数据。使用json.loads()。但在这种情况下，请在字符串周围保留 " 引号。
另外，考虑使用 HTML 解析库，如 BeautifulSoup 和 requests 来处理您的 HTTP 请求需求。
@ShadowRanger：对于这种情况，这实际上是错误的建议，它会为非 BMP 代码点产生错误的结果，因为这里的语法是 Javascript / JSON 特定的，而不是 Python 特定的。

标签： python regex unicode

【解决方案1】：

您的正则表达式从网页中提取 JSON 字符串：

searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

您删除的那些" 字符实际上很重要。此处的\uxxxx 转义语法特定于 JSON（和 Javascript）语法；它们与 Python 的使用密切相关，但又有所不同（不多，但当你有非 BMP 代码点时很重要）。

如果您将引号保留在其中，您可以轻松地将它们解码为 JSON：

searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))

最好还是使用 HTML 库来解析页面。使用BeautifulSoup时，可以通过以下方式获取数据：

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]

这会将每个 <div class="rg_meta" ...> 元素的文本内容加载为 JSON 数据，并从每个结果字典中提取 ou 键。不需要正则表达式。

【讨论】：

我打算使用 BeautifulSoup（过去使用过），但我认为我只进行了最少的解析，所以我会坚持使用良好的正则表达式。事后看来，最好只是把它吸起来并使用汤，所以我会稍微编辑一下我的代码。谢谢！

【解决方案2】：

你可以这样做。

>>> url = (
...    'https://www.jobtestprep.co.uk/media/24543/xnumber-series-'
...    'big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,'
...    'aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'
... )
>>> url = url.encode('utf-8').decode('unicode_escape')
>>> print(url)
https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor=center,amode
=crop,awidth=473,aheight=352,arnd=131255524960000000.pagespeed.ic.YolXsWmhs0.png
>>>

【讨论】：