为什么请求和 urllib2 缺少网页中的一些文本？答案

【问题标题】：Why are requests and urllib2 missing some text from webpages?为什么请求和 urllib2 缺少网页中的一些文本？
【发布时间】：2016-04-25 09:12:11
【问题描述】：

以下代码提取网页信息

from BeautifulSoup import BeautifulSoup
import requests
import urllib2

url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'

source_code = requests.get(url)
plain_text = source_code.text
print plain_text

site = urllib2.urlopen(url).read()
print site

两个库的结果包括：

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>

很遗憾，这与实际网页不同：

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;">4-5ft</div>

4-5ft 不存在，因此 BeautifulSoup 无法提取。

【问题讨论】：

可能是在HTTP/1.1 200响应发回后异步加载数据。 PS。从网站抓取数据并不总是合法的，请检查已发布数据的许可证或寻找提供类似数据的 REST 服务。
requests 和 urllib2 永远不会执行 JavaScript。但我可以在selenium 中向您展示解决方案。
@GeorgePetrov：请做
@boogie_bullfrog 进展如何？

标签： python html web-scraping python-requests urllib2

【解决方案1】：

安装selenium，完整说明在docs。

pip3 安装硒

下载驱动程序。我更喜欢使用chrome driver，但如果你安装了firefox，下面的代码应该可以正常工作。

from selenium import webdriver
url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'
web = webdriver.Firefox()
# web = webdriver.Remote('http://localhost:9515', desired_capabilities=DesiredCapabilities.CHROME)

source_code = web.get(url)
# Sometimes it take time to load the page that's why: from time import sleep; sleep(2)
plain_text = source_code.page_source

【讨论】：

我改用了web = webdriver.Chrome()。不幸的是，我得到了错误：AttributeError: 'NoneType' object has no attribute 'page_source' with multiple sleep lengths。此外，在抓取多个页面时打开浏览器页面并等待它加载似乎是不合理的。类似问题here