【问题标题】:Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch"带有 Solr 3.4 的 Nutch 1.4 - 无法抓取 URL,“没有要获取的 URL”
【发布时间】:2017-05-18 15:40:44
【问题描述】:

我遵循了使用 cygwin、tomcat、nutch 1.4 和 solr 3.4 使用 Nutch 进行网络爬网的教程。我已经可以抓取一次 URL,但不知何故,无论我尝试哪个 URL,这都不起作用了。 我在 runtime/local/conf 中的 regex-urlfilter.txt 如下:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
 +^http://([a-z0-9]*\.)*nutch.apache.org/

runtime/local/bin/urls中我的seed.txt中唯一的URL只有http://nutch.apache.org/

对于爬行我使用命令

$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3

控制台输出是:

cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3

我知道有一些类似的问题,但大多数都没有解决。有人可以帮忙吗?

提前非常感谢您!

【问题讨论】:

    标签: apache solr lucene web-crawler nutch


    【解决方案1】:

    为什么要使用非常古老的 Nutch 版本?但是,您面临的问题是这一行开头的空格:

     _+^http://([a-z0-9]*\.)*nutch.apache.org/
    

    (我用下划线突出显示了空格)以空格开头的每一行,\n# 都会被配置解析器忽略,看看: https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269

    【讨论】:

      【解决方案2】:

      您可以尝试删除目录newCrawl3。 Nutch 不会再抓取一个 url,当它最近被抓取时。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2016-10-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多