摆脱字符串变量中的无效 unicode 字符答案

【问题标题】：Get rid of invalid unicode character in string variable摆脱字符串变量中的无效 unicode 字符
【发布时间】：2020-01-16 21:28:45
【问题描述】：

我已经输入了一个 python3 requests get 命令（不确定这是否是好的措辞），将其转换为 json，并对其进行解析以接收名称：

'Harrison Elementary School \U0001f3eb'

我查了一下，unicode 字符代表学校，Unicode School Character。但是当我打印它时，我得到：

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3eb' in position 27: character maps to <undefined>

我真的不在乎有那个 unicode 字符。这对我的目的并不重要。

如何从这个或我遇到的任何字符串中删除该 unicode 字符和任何其他无效字符？

【问题讨论】：

您是如何以及在哪里打印的？操作系统和终端/IDE 使用很重要。
报告 Python 版本...例如，Windows 上的 Python 3.6+ 将在终端窗口中打印所有 Unicode 字符而不会引发异常，但如果字体不支持该字符，则使用了替换字符。

标签： python-3.x unicode string-parsing

【解决方案1】：

这个字符并不是真的无效，只是undefined，所以在编码的时候可以经常告诉编码器如何处理错误：

import codecs 

school_name = "Harrison Elementary School \U0001f3eb"
encoded_name = codecs.charmap_encode(school_name, 'ignore')
print(encoded_name)

结果(b'Harrison Elementary School ', 28)

【讨论】：

【解决方案2】：

首先，您必须确定字符无效的原因。似乎在您尝试打印字符串时生成了错误消息，这意味着无法使用默认输出编码对 Unicode 字符进行编码。对于print，这应该是sys.stdout.encoding。

您可以自己对字符串进行编码并忽略无效的字符，但这会给您留下一个字节字符串。有必要将这些字节decode 重新转换为 Unicode 字符串。

def sanitize(s, encoding, errors='ignore'):
    return s.encode(encoding, errors=errors).decode(encoding)

>>> import sys
>>> print(sanitize('Harrison Elementary School \U0001f3eb', sys.stdout.encoding))
Harrison Elementary School

【讨论】：