Nutch Selenium 无法连接到 Firefox答案

【问题标题】：Nutch Selenium cannot connect to FirefoxNutch Selenium 无法连接到 Firefox
【发布时间】：2019-08-19 17:50:25
【问题描述】：

我正在尝试使用 Apache Nutch 1.15 抓取动态页面。当我运行爬虫时，我看到以下异常

failed with: java.lang.RuntimeException:org.openqa.selenium.WebDriverException: Failed to connect to binary FirefoxBinary(/usr/bin/firefox) on port 7055;

我尝试过各种版本的 selenium 独立服务器（3.14、3.5、3.1、2.39）和各种版本的 Firefox（68、58、57...），但我总是收到此错误。

我在 SO 和其他网站上搜索并遇到过类似问题，但尚未找到解决方案？

有什么想法吗？难道仅仅是selenium和firefox的版本兼容错误？

例外还提到

Build info: version: '2.48.2'

但我不确定这是指什么？硒？

更新

所以我尝试了 master 分支，成功构建了它，但是在爬网过程中，我在 hadoop.log 中看到了空指针异常。

所以我重新尝试使用 nutch1.15（因为网上其他人报告说 selenium 可以使用这个版本），我下载了 firefox 31.4.0 和 selenium 2.48.2。

我正在按照https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium 上的说明使用 Xvfb 设置和导出显示，然后使用

启动 selenium

java -jar selenium-server-standalone-2.48.2.jar

在单独的终端窗口中（不确定这是否有所不同）。还尝试使用相同的终端并启动 selenium 作为后台进程。我还确保按照上面的链接正确设置了所有 selenium 属性（在 $NUTCH_HOME/conf 和 $NUTCH_HOME/runtime/local/conf 中）

我也尝试过使用 (-D flag) chomedriver 启动 selenium，但我总是看到相同的 firefox 错误，即使在 nutch-site.xml 中的 selenium 属性中更改驱动程序时也是如此（我已将下面的 URL 更改为真人测试）：

    FetcherThread 41 fetch of https://test failed with: java.lang.RuntimeException: org.openqa.selenium.WebDriverException: Failed to connect to binary FirefoxBinary(/usr/bin/firefox) on port 7055; process output follows:
null

    Build info: version: '2.48.2', revision: '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
    System info: host: 'ip-172-30-0-146', ip: '172.30.0.146', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-157-generic', java.version: '1.8.0_222'
    Driver info: driver.version: FirefoxDriver

现在我的想法已经不多了，只是尝试不同版本的 firefox/geckodriver/selenium 看看它是否会产生影响，或者即使错误会改变......

更新 2

使用下面的 cmets 和 nutch-master 分支中提供的 parsechecker 命令，我成功地解析了一些动态生成的页面。

但是当我尝试使用以下命令进行爬网时

runtime/local/bin/crawl -i -D plugin.includes='protocol-selenium|parse-tika' -D selenium.grid.binary=/usr/bin/chromedriver -D selenium.enable.headless=true -s ~/seed ~/sites/test -1

我明白了

    java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally
 (unknown error: DevToolsActivePort file doesn't exist)
      (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

根据link检查它们匹配的chromedriver版本和google-chrome

我已经按照here 的步骤进行操作，但遇到了上述 Firefox 问题，所以切换到使用 parsechecker 可以工作的 chrome 但抓取返回错误？我需要将chrome本身作为后台进程启动吗？我也试过了，但错误仍然存在。我还检查了它正在启动的谷歌浏览器是否与我正在启动的相同？

您能否验证以下步骤是否正确，都来自同一个终端：

/usr/bin/Xvfb :11 -screen 0 1024x768x24 &
导出 DISPLAY=:11
启动 selenium (java -jar selenium-server-standalone-3.14.0.jar &)
启动 chrome (/etc/alternatives/google-chrome --headless --no-sandbox &)（不确定是否需要...）
从 nutch-master 开始爬取：runtime/local/bin/crawl -i -D plugin.includes='protocol-selenium|parse-tika' -D selenium.grid.binary=/usr/bin/chromedriver -D selenium.enable.headless=true -s ~/seed ~/sites/test -1

非常感谢

【问题讨论】：

在 1.16 可用之前，您可能希望使用最近的 Nutch 主控。基于 Selenium 的插件最近已升级。
谢谢@SebastianNagel 我会试试的。
查看github.com/apache/nutch/tree/master/src/plugin/… (README.md) 了解如何设置所有内容的详细说明。
非常感谢@SebastianNagel - 我遵循了该指南。请参阅上面的更新。感觉就像我将 firefox 版本更改为 selenium 版本或 geckodriver 版本时总是看到相同的错误..我在上面添加了更多详细信息。非常感谢
您可能会使用的测试，例如：.../runtime/local/bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-html' -Dselenium.grid.binary=.../selenium-installation/chromedriver -Dselenium.enable.headless=true -followRedirects -dumpText https://nutch.apache.org

标签： selenium selenium-webdriver nutch

【解决方案1】：

所以我设法让这个工作 - 抓取动态页面并使用 AWS cloudsearch 作为索引器。非常感谢 @SebastianNagel 在 cmets 中的帮助。我不确定下面的所有步骤/配置设置都是必要的，但它可能会帮助遇到类似问题的其他人。

使用了以下版本和步骤：

版本：

ChromeDriver 76.0.3809.68
谷歌浏览器 76.0.3809.100
硒 2.53.0
Nutch Master 分支github link

设置：

cd 到 nutch-master 根目录。

请参阅以下link 的第 2 部分来设置 seleunium 设置。

在 conf/nutch-site.xml 中（我在 conf 和 runtime/local/conf 中都设置了）

selenium.driver 到 chrome
selenium.grid.binary 到 chromedriver 的路径（在我的例子中是 /usr/bin/chromedriver）

对于您正在运行爬网的用户，请确保设置了以下 nutch env 设置

NUTCH_HOME=/nutch-master
NUTCH_CONF_DIR=/nutch-master/conf
确保 /nutch-master/src/bin 在您的 PATH 上

抓取步骤

cd /nutch-master
/usr/bin/Xvfb :11 -screen 0 1024x768x24 &
导出 DISPLAY=:11
java -jar /selenium-server-standalone-2.53.0.jar -debug &
runtime/local/bin/crawl -i -D selenium.enable.headless=true -s ~/seed ~/test_dir -1

之后我可以在 AWS Cloudsearch 中看到动态内容，还可以在 test_dir 中查询段以查看动态内容：

<DIR_TO_NUTCH>/nutch-master/runtime/local/bin/nutch readseg -get /test_dir/segments/20190816092351/ "https://<URL_TO_DYNAMIC_PAGE>"

【讨论】：

谢谢！如果你能提供一个 PR 来更新 protocol-selenium 的 README，那就太好了！