防止在 Python 驱动的 PhantomJS/Selenium 中下载 CSS/其他资源答案

【问题标题】：Prevent CSS/other resource download in PhantomJS/Selenium driven by Python防止在 Python 驱动的 PhantomJS/Selenium 中下载 CSS/其他资源
【发布时间】：2013-10-06 14:27:45
【问题描述】：

我正在尝试通过阻止下载 CSS/其他资源来加速 Python 中的 Selenium/PhantomJS webscraper。我只需要下载 img src 和 alt 标签。我找到了这段代码：

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

通过：How can I control PhantomJS to skip download some kind of resource?

如何/在哪里可以在 Python 驱动的 Selenium 中实现此代码？或者，还有其他更好的方法来阻止 CSS/其他资源下载吗？

注意：我已经找到了如何通过编辑 service_args 变量来防止图像下载：

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

和

PhantomJS 1.8 with Selenium on python. How to block images?

但是 service_args 无法帮助我处理 CSS 等资源。谢谢！

【问题讨论】：

如果您只需要 HTML 并从页面中选择元素，Selenium/PhantomJS 是最佳选择吗？您是否考虑过使用python-requests？
@brechin，这是个好主意，谢谢！不幸的是，我不认为 python-requests 可以获得 javascript 注入的内容。例如，请参阅此页面上的主图像：everlane.com/collections/mens-luxury-tees/products/…。 <div id="content" class="clearfix"> 中的所有内容都是通过backbone.js 注入的，在python-requests 的输出中，我只是得到了一个带有 注释的空div……我可能会遗漏什么吗？
我会查看请求并获取 everlane.com/api/collections

标签： python selenium web-scraping phantomjs headless-browser

【解决方案1】：

一个大胆的年轻灵魂，名为“watsonmw”recently added Ghostdriver（Phantom.js 用于与 Selenium 交互）的功能，允许访问 Phantom.js API calls which require a page object，就像您引用的 onResourceRequested 一样。

对于不惜一切代价的解决方案，请考虑从源代码构建（开发人员指出“大约需要 30 分钟……在现代机器上进行 4 个并行编译作业”）并集成上面链接的补丁。

那么这个（未经测试的）Python 代码应该可以作为概念证明：

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

在那之前，你只会得到一个Can't find variable: page 异常。

祝你好运！有很多很棒的选择，比如在 Javascript 环境中工作、驱动 Gecko、代理等。

【讨论】：

该补丁似乎已经在 Ghostdriver 1.1.0 中，但是当我启动它（使用phantomjs /path/to/ghostdriver/1.1.0/src/main.js）并连接到它（使用driver = webdriver.PhantomJS(port=8910)）时，我仍然得到Can't find variable: page。

【解决方案2】：

Will 的回答让我走上了正轨。（谢谢威尔！）

当前的 PhantomJS (1.9.8) 包括 Ghostdriver 1.1.0，其中已经包含 watsonmw 的补丁。

您需要下载最新的 PhantomJS，执行以下操作（可能需要sudo）：

ln -s path/to/bin/phantomjs  /usr/local/share/phantomjs
ln -s path/to/bin/phantomjs  /usr/local/bin/phantomjs
ln -s path/to/bin/phantomjs  /usr/bin/phantomjs

然后试试这个：

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
    var page = this; // won't work otherwise
    page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

【讨论】：

【解决方案3】：

建议的解决方案对我不起作用，但这个可行（它使用 driver.execute_script）：

driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute_script('''
    this.onResourceRequested = function(request, net) {
        console.log('REQUEST ' + request.url);
    };
''')

【讨论】：