【发布时间】:2017-04-23 18:08:35
【问题描述】:
我搜索了一段时间没有结果。 Python 似乎能够处理一些——但不是全部——网页:
import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print soup.prettify()
令人惊讶的是,它可以打印 Amazon.com 主页,但不能打印 Reddit。我得到的错误是:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xd7' in position 37769: character maps to <undefined>
我的问题:如何编写可以为任何网页编码的程序?我哪里错了?
编辑:进一步的测试表明 google.com 也不起作用。这是一个类似的错误信息:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9651: character maps to <undefined>
编辑 2:尝试将 res.text 解码为 utf-8,但出现此错误:
Traceback (most recent call last):File "testweb.py", line 5, in <module>
soup = bs4.BeautifulSoup(res.text.decode('utf-8'), 'html.parser')File "C:\PYTHON27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9358: ordinal not in range(128)
编辑 3:尝试将 res.text 编码为 utf-8,但出现此错误:
Traceback (most recent call last):File "testweb.py", line 8, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9622: character maps to <undefined>
【问题讨论】:
-
你可以尝试将
res.text解码为utf-8:res.text.decode('utf-8') -
刚试了一下,还是报错:(。已编辑帖子。
标签: python python-2.7 web-scraping beautifulsoup python-requests