为什么 Python 能够解析 Amazon 而不能解析 Google/Reddit？答案

【问题标题】：Why is Python able to parse Amazon but not Google/Reddit?为什么 Python 能够解析 Amazon 而不能解析 Google/Reddit？
【发布时间】：2017-04-23 18:08:35
【问题描述】：

我搜索了一段时间没有结果。 Python 似乎能够处理一些——但不是全部——网页：

import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print soup.prettify()

令人惊讶的是，它可以打印 Amazon.com 主页，但不能打印 Reddit。我得到的错误是：

Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xd7' in position 37769: character maps to <undefined>

我的问题：如何编写可以为任何网页编码的程序？我哪里错了？

编辑：进一步的测试表明 google.com 也不起作用。这是一个类似的错误信息：

Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9651: character maps to <undefined>

编辑 2：尝试将 res.text 解码为 utf-8，但出现此错误：

Traceback (most recent call last):File "testweb.py", line 5, in <module>
soup = bs4.BeautifulSoup(res.text.decode('utf-8'), 'html.parser')File "C:\PYTHON27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9358: ordinal not in range(128)

编辑 3：尝试将 res.text 编码为 utf-8，但出现此错误：

Traceback (most recent call last):File "testweb.py", line 8, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9622: character maps to <undefined>

【问题讨论】：

你可以尝试将res.text解码为utf-8:res.text.decode('utf-8')
刚试了一下，还是报错:(。已编辑帖子。

标签： python python-2.7 web-scraping beautifulsoup python-requests

【解决方案1】：

将输出编码更改为utf-8，因此它将输出utf-8编码的文本，并尝试对请求文本进行编码，而不是对其进行解码。

例子：

# -*- coding: utf-8 -*-

import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text.encode('utf-8'), 'html.parser')
print (soup.prettify())

尝试直接在prettify中编码：

print (soup.prettify('latin-1')) 或 print (soup.prettify('utf-8'))

【讨论】：

还是不行。使用此方法更新帖子。谢谢。
检查更新以验证它是否可以帮助您。 @KendrickTV