如何在 Python3 中将字符串从 cp1251 转换为 UTF-8？答案

【问题标题】：How to convert a string from cp1251 to UTF-8 in Python3?如何在 Python3 中将字符串从 cp1251 转换为 UTF-8？
【发布时间】：2019-01-24 11:54:26
【问题描述】：

一个非常简单的 Python 3.6 脚本需要帮助。

首先，它从使用 cp1251 编码的老式服务器下载 HTML 文件。

然后我需要将文件内容放入一个 UTF-8 编码的字符串中。

这是我正在做的事情：

import requests
import codecs

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

#checking that it's in cp1251
print(ri.encoding)

#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')

#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')

print(text)

这是错误：

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    text = codecs.decode(text,'utf-8')
  File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte

如果能提供任何帮助，我将不胜感激。

【问题讨论】：

requests.get 为您解码所有内容。您不必手动执行此操作。
@Tomalak 哦，真的，这么简单？！

标签： python python-3.x utf-8 cp1251

【解决方案1】：

您不需要进行编码/解码。

“当您发出请求时，Requests 会根据 HTTP 标头对响应的编码进行有根据的猜测。当您访问 r.text 时会使用 Requests 猜测的文本编码”

所以这会起作用：

import requests

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

text = ri.text

print(text)

对于非文本请求，您还可以按字节访问响应正文：

ri.content

请查看requests documentation

【讨论】：

【解决方案2】：

不确定您要做什么。

.text 是响应的文本，一个 Python 字符串。编码在 Python 字符串中不起任何作用。

编码仅在您想要转换为字符串（或相反）的字节流时发挥作用。 requests 模块已经为您完成了这项工作。

import requests

ri = requests.get('http://old.moluch.ru/_python_test/0.html')
print(ri.text)

例如，假设您有一个文本文件（即：字节）。然后，当您open() 文件时，您必须选择一种编码 - 编码的选择决定了文件中的字节如何转换为字符。这个手动步骤是必要的，因为open() 无法知道文件字节的编码方式。

另一方面，HTTP 在响应标头 (Content-Type) 中发送此信息，因此 requests 可以知道此信息。作为一个高级模块，它有助于查看 HTTP 标头并为您转换传入的字节。（如果您要使用更底层的urllib，则必须自己进行解码。）

当您使用响应的.text 时，.encoding 属性仅供参考。不过，如果您使用 .raw 属性，它可能是相关的。对于返回常规文本响应的服务器，很少需要使用 .raw。

【讨论】：

是的，现在我明白了。非常感谢。
@Ildar 进一步扩展了答案

【解决方案3】：

您可以通过向解码函数添加设置来简单地忽略错误：

text = codecs.decode(text,'utf-8',errors='ignore')

【讨论】：

【解决方案4】：

当你发出 requests.get 时，许多人已经回答你正在收到解码的消息。我会回答你现在面临的错误。

这条线：

text = codecs.encode(text,'cp1251')

将文本编码为 cp1251，然后您尝试使用 utf-8 对其进行解码，这会在此处显示错误：

text = codecs.decode(text,'utf-8')

用于检测您可以使用的类型：

import chardet
text = codecs.encode(text,'cp1251')
chardet.detect(text) . #output {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}

#OR
text = codecs.encode(text,'utf-8')
chardet.detect(text) . #output {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

因此以一种格式编码然后以其他格式解码会导致错误。

【讨论】：