Python 'ascii' 编解码器无法使用 request.get 对字符进行编码答案

【问题标题】：Python 'ascii' codec can't encode character with request.getPython 'ascii' 编解码器无法使用 request.get 对字符进行编码
【发布时间】：2017-04-01 07:35:06
【问题描述】：

我有一个 Python 程序，它从一个站点爬取数据并返回一个 json。抓取的站点具有元标记 charset = ISO-8859-1。以下是源代码：

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text

之后，我使用 Beautiful Soup 获取信息，然后创建一个 json。问题是，某些符号，即 € 符号显示为 \u0080 或 \x80 （在 python 中），所以我不能在 php.ini 中使用或解码它们。所以我尝试了plain_text.decode('ISO-8859-1) 和plain_text.decode('cp1252') 所以我可以在之后将它们编码为 utf-8 但每次我得到错误：'ascii'编解码器无法在位置 8496 编码字符 u'\xf6'：序数不在范围内(128)。

编辑

@ChrisKoston 建议后的新代码使用.content 而不是.text

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')

现在可以进行编码和解码，但仍然是字符问题。

EDIT2

解决办法是设置.content.decode('cp1252')

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')

特别感谢 Tomalak 的解决方案

【问题讨论】：

尝试使用 source_code.content 而不是 .text
@ChrisKoston 谢谢！现在我能够对 plain_text 进行解码和编码，但遗憾的是它不能解决字符问题。我在上面发布了新代码。
提示：plain_text.decode('cp1252').encode('utf-8') 不会改变 plain_text 的值。
@Tomalak 是的，你是对的，我再次编辑了源代码，但仍然没有改变

标签： python json encoding utf-8 ascii

【解决方案1】：

您必须将decode() 的结果实际存储在某处，因为它不会修改原始变量。

另一件事：

decode() 将字节列表转换为字符串。
encode() 做相反的事情，它将一个字符串变成一个字节列表

BeautifulSoup 对字符串很满意；你根本不需要使用encode()。

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')

提示：要使用 HTML，您可能需要查看 pyquery 而不是 BeautifulSoup。

【讨论】：

感谢您的快速帮助。我编辑了源代码，但运行程序时€ 字符仍然是\x80
\x80 是欧元符号的字符代码。不要看 IDLE 控制台，它想用这种方式显示字符。将字符串写入文件并再次查看。
这适用于标题！为此非常感谢。描述仍然不起作用。我会在问题中发布代码
现在一切正常。我也不得不用. content 替换.text。非常感谢您的帮助！