Python：将带有 Unicode 的字符串转换为 HTML 数字代码答案

【问题标题】：Python: Convert String with Unicode to HTML numeric codePython：将带有 Unicode 的字符串转换为 HTML 数字代码
【发布时间】：2021-07-10 08:14:45
【问题描述】：

大家好，我正在寻找一种解决方案，将字符串中包含的所有 unicode 转换为相应的 HTML 实体。

例如：

输入： "这是\u+0024。带有\u+0024的字符串。随机\u+0024.unicode"
输出： "这个是 $ 一个带有 $ random $ unicode 的字符串

我目前对这个问题的解决方案如下：

if "\\u+" in my_string:
  unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode_string = f"U+{unicode_code}"
  html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
  my_string = label_content.replace(unicode_to_replace,  html_code)

但是 unicode 字符串没有以正确的方式转换，有什么建议吗？

提前致谢！

【问题讨论】：

欢迎来到 Stack Overflow。请花2分钟tour。此外，打开Help center 并至少阅读How to Ask。然后，edit您的问题提供minimal reproducible example。只要求代码的问题太宽泛了，很可能是put on hold or closed。抱歉，Stack Overflow 不是免费的代码编写服务……
坦克，我更新了问题。

标签： python html unicode converters

【解决方案1】：

我自己为对此感兴趣的任何人找到了解决方案。它与我所问的有点不同，输出没有将 unicode 显示为 html 实体，而是将它们转换为相应的 char，因为在我的情况下这更好。

所以代码的最后部分如下所示：

# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string =  "This is \u+0024. a string with \u+0024. random \u+0024. unicodes" 

if "\\u+" in my_string :
  unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode = f"\\u{unicode_code}"
  # Where the actual unicode is converted to html entity
  html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
  my_string = my_string .replace(unicode_to_replace, html_entity)


print(my_string)
my_string >> "This is $ a string with $ random $

【讨论】：

【解决方案2】：

我更喜欢申请Regular expression operations (re module)。 pattern 变量覆盖

所有有效的 Unicode 值（参见例如U+042F 而不是中间的U+0024），
输入字符串的所有语法版本：input 原始问题中的变量被编辑了三次（带/不带前导反斜杠和/或尾随点），和
OQ 的自我回答中的my_string 变量不正确：'\u+0024' 引发 truncated \uXXXX escape 错误。

脚本：

import re

def UPlusHtml(matchobj):
    return re.sub( r"^\\?[uU]\+", '&#x', 
             re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';

def UPlusRepl(matchobj):
    return chr( int( re.sub( r"^\\?[uU]\+", '', 
                       re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );

pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"

input = "This is U+0024. a string with U+042f random U+0024. unicode"

print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )

print('--')

my_string =  "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"

print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )

输出：\SO\67105976.py

This is U+0024. a string with U+042f random U+0024. unicode
This is &#x0024; a string with &#x042f; random &#x0024; unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is &#x0024; a string with &#x042F; random &#x0024; unicodes
This is $ a string with Я random $ unicodes

请注意，我自己是一个正则表达式初学者，所以我相信必须存在更有效的基于正则表达式的解决方案，毫无疑问……

【讨论】：