处理 Python unicode 字符串中错误编码的字符答案

【问题标题】：Handle wrongly encoded character in Python unicode string处理 Python unicode 字符串中错误编码的字符
【发布时间】：2011-08-11 06:46:21
【问题描述】：

我正在处理由 python-lastfm 库返回的 unicode 字符串。

我假设在途中的某个地方，库编码错误并返回可能包含无效字符的 unicode 字符串。

例如，我期望在变量 a 中的原始字符串是“Glück”

>>> 一个 u'Gl\xfcck' >>> 打印一个回溯（最近一次通话最后）：文件“”，第 1 行，在 UnicodeEncodeError：“ascii”编解码器无法在位置 2 编码字符 u'\xfc'：序数不在范围内（128）

\xfc 是转义值252，对应“ü”的latin1编码。不知何故，它以 python 无法自行处理的方式嵌入到 unicode 字符串中。

如何将其转换回包含原始“Glück”的普通或 unicode 字符串？我尝试使用 decode/encode 方法，但要么得到一个 UnicodeEncodeError，要么得到一个包含序列 \xfc 的字符串。

【问题讨论】：

你用的是什么版本的 Python？
什么操作系统？ sys.stdout.encoding 是什么？
BeautifulSoup findall with class attribute- unicode encode error的可能重复
@RestRisiko：还有其他几十个问题

标签： python string unicode character-encoding

【解决方案1】：

我自己在处理一个包含德语单词的文件时偶然发现了这个错误，但我不知道它是用 UTF-8 编码的。当我开始处理单词时，问题就显现出来了，其中一些单词不会显示解码错误。

# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40) 
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

我解决了在字符串上调用encode方法的错误：

>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

【讨论】：

【解决方案2】：

在代码的开头，在导入之后，添加这 3 行。

import sys  # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')

它将在您的程序过程中覆盖系统默认编码 (ascii)。

编辑：除非您确定后果，否则不应这样做，请参阅下面的评论。这篇文章也很有帮助：Dangers of sys.setdefaultencoding('utf-8')

【讨论】：

永远不要这样做。 stackoverflow.com/questions/3828723/…

【解决方案3】：

不要 str() 将您从模型字段中获得的内容转换为字符串，只要它已经是 unicode 字符串即可。（哎呀我完全错过了它与 django 无关）

【讨论】：

【解决方案4】：

您必须使用某种编码将您的 unicode 字符串转换为标准字符串，例如UTF-8：

some_unicode_string.encode('utf-8')

除此之外：这是一个骗局

BeautifulSoup findall with class attribute- unicode encode error

以及关于 SO 的至少十个其他相关问题。先研究一下。

【讨论】：

【解决方案5】：

你的 unicode 字符串没问题：

>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

您在交互式提示中看到的问题是解释器不知道使用什么编码将字符串输出到您的终端，因此它回退到“ascii”编解码器——但该编解码器只知道如何处理 ASCII 字符。它在我的机器上运行良好（因为 sys.stdout.encoding 对我来说是“UTF-8”——可能是因为我的环境变量设置与你的不同）

>>> print u'Gl\xfcck'
Glück

【讨论】：

事实上，Mac 和现代 Linux 桌面默认使用 UTF-8 控制台，所以这很有效。另一方面，Windows 控制台的标准 C 库接口被一些特定于语言环境的代码页所束缚，这些代码页都不是 UTF-8。 Windows 控制台上的非 ASCII 字符始终是试用版。
@Bobince：用于 Windows 控制台的“少数特定于语言环境的代码页”中最普遍的是 cp850，它与其他基于拉丁语的代码页一样，很高兴能正确显示 u-umlaut。如果您确实将未映射的 Unicode 字符发送到 Windows 控制台，那么您不会收到 OP 的关于 ascii 编解码器的消息。你得到例如UnicodeEncodeError: 'charmap' codec can't encode character u'\u9876' in position 2: character maps to <undefined> ...我们仍然不知道为什么 OP 会出现该错误。