【问题标题】:Why is Python able to parse Amazon but not Google/Reddit?为什么 Python 能够解析 Amazon 而不能解析 Google/Reddit?
【发布时间】:2017-04-23 18:08:35
【问题描述】:

我搜索了一段时间没有结果。 Python 似乎能够处理一些——但不是全部——网页:

import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print soup.prettify()

令人惊讶的是,它可以打印 Amazon.com 主页,但不能打印 Reddit。我得到的错误是:

Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xd7' in position 37769: character maps to <undefined>

我的问题:如何编写可以为任何网页编码的程序?我哪里错了?

编辑:进一步的测试表明 google.com 也不起作用。这是一个类似的错误信息:

Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9651: character maps to <undefined>

编辑 2:尝试将 res.text 解码为 utf-8,但出现此错误:

Traceback (most recent call last):File "testweb.py", line 5, in <module>
soup = bs4.BeautifulSoup(res.text.decode('utf-8'), 'html.parser')File "C:\PYTHON27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9358: ordinal not in range(128)

编辑 3:尝试将 res.text 编码为 utf-8,但出现此错误:

Traceback (most recent call last):File "testweb.py", line 8, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9622: character maps to <undefined>

【问题讨论】:

  • 你可以尝试将res.text解码为utf-8:res.text.decode('utf-8')
  • 刚试了一下,还是报错:(。已编辑帖子。

标签: python python-2.7 web-scraping beautifulsoup python-requests


【解决方案1】:

将输出编码更改为utf-8,因此它将输出utf-8编码的文本,并尝试对请求文本进行编码,而不是对其进行解码。

例子:

# -*- coding: utf-8 -*-

import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text.encode('utf-8'), 'html.parser')
print (soup.prettify())

尝试直接在prettify中编码:

print (soup.prettify('latin-1'))print (soup.prettify('utf-8'))

【讨论】:

  • 还是不行。使用此方法更新帖子。谢谢。
  • 检查更新以验证它是否可以帮助您。 @KendrickTV
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2010-09-19
  • 2020-11-16
  • 2012-06-18
  • 1970-01-01
  • 2011-12-29
  • 2014-09-08
  • 2022-01-15
相关资源
最近更新 更多