在 Python 3 中将 Latin-1 转换为 UTF-8 字符串时出错答案

【问题标题】：Error on converting Latin-1 to UTF-8 String in Python 3在 Python 3 中将 Latin-1 转换为 UTF-8 字符串时出错
【发布时间】：2021-07-29 00:00:48
【问题描述】：

我有一个文本数据集，只能通过使用编码 Latin-1 由 pandas 导入，当我尝试使用其他编码时，它会导致错误。我想从该数据集中清除特殊字符。但是，这些特殊字符以十六进制形式出现，如下所示：

AKU\n\nKU \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84

然后我在另一个线程上看到，我可以通过将其解码为 Latin-1，然后编码为 UTF-8 来摆脱它。但它导致了错误，如图所示。

x = data.iloc[5, 0].decode('iso-8859-1').encode('utf8')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-91-c80119246806> in <module>()
      1 print(data.iloc[5, 0])
----> 2 x = data.iloc[5, 0].decode('iso-8859-1').encode('utf8')
      3 if True:
      4   x = re.sub("[\n\t]", ' ', x)
      5   x = re.sub("\d+", ' ', x)

AttributeError: 'str' object has no attribute 'decode'

基本上，如何将其转换为 UTF-8 以进行后续文本处理？或者有没有其他方法可以摆脱那些不需要转换的人？谢谢

【问题讨论】：

不是latin1，而是utf-8，因为'AKU\n\nKU \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84'.encode('latin1').decode('utf-8')返回'AKU\n\nKU ????????????'。通过 pandas 导入时尝试 utf-8-sig...
当我尝试使用编码utf-8-sig 导入时，它返回一个错误。 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 1389: invalid continuation byte.

标签： python encoding utf-8 text-processing iso-8859-1

【解决方案1】：

你可以使用

import codecs
print(codecs.decode(data.iloc[5, 0], 'unicode-escape').encode('latin1').decode('utf-8'))

见online Python demo：

import codecs
text = r'AKU\n\nKU \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84'
print(codecs.decode(text, 'unicode-escape').encode('latin1').decode('utf-8'))
# => AKU\n\nKU ???

【讨论】：