在 Nutch 上获取 No Urls to Fetch 错误，即使有要获取的 Url答案

【问题标题】：Getting No Urls to Fetch error on Nutch, even though there are Urls to fetch在 Nutch 上获取 No Urls to Fetch 错误，即使有要获取的 Url
【发布时间】：2013-07-03 21:08:55
【问题描述】：

我还在习惯 Nutch。我设法通过nutch.apache.org 使用bin/nutch crawl urls -dir crawl -depth 6 -topN 10 进行测试爬网，并使用bin/nutch crawl urls -solr http://<domain>:<port>/solr/core1/ -depth 4 -topN 7 将其索引到solr

甚至没有提到它在我自己的网站上超时，我似乎无法让它再次抓取，或抓取任何其他网站（例如 wiki.apache.org）。我已经删除了 nutch 主目录中的所有爬取目录，但仍然出现以下错误（表明没有更多的 URL 可以爬取）：

<user>@<domain>:/usr/share/nutch$ sudo sh nutch-test.sh
solrUrl is not set, indexing will be skipped...
crawl started in: crawl 
rootUrlDir = urls
threads = 10
depth = 6
solrUrl=null
topN = 10
Injector: starting at 2013-07-03 15:56:47
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-03 15:56:50, elapsed: 00:00:03
Generator: starting at 2013-07-03 15:56:50
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

我的urls/seed.txt 文件中有http://nutch.apache.org/。

我的regex-urlfilter.txt 里面有+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*。

我还增加了-depth 和topN 以指定要索引的内容更多，但它总是在第一次抓取后给出错误。如何重置它以使其再次爬行？是否有一些 URL 缓存需要在 Nutch 的某个地方清除？

更新：我们网站的问题似乎是我没有使用www，如果没有www，它就无法解决。通过ping，www.ourdomain.org 确实可以解决。

但我已将其放入必要的文件中，但仍然存在问题。主要看起来Injector: total number of urls rejected by filters: 1 是全面的问题，但不是第一次爬行。为什么以及什么过滤器拒绝 URL，不应该。

【问题讨论】：

标签： solr web-crawler nutch

【解决方案1】：

这让人尴尬。但是旧的 nutch-not-crawling-because-it's-dismissining-urls addage 'check your *-urlfilter.txt' 文件在这里适用。

在我的例子中，我在 url 正则表达式中有一个额外的 /：

+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*

应该是+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*

【讨论】：

咳咳，我之前有一个空格 + 我看不到。然后我找到了你的答案并再次检查它，现在它可以工作了。