【发布时间】:2014-12-08 16:34:16
【问题描述】:
我正在尝试抓取一个网址为 http://def.com/xyz/(say) 的网页,该网页有超过 2000 个传出网址,但是当我查询 solr 时,它显示的文档少于 50 个,而我预计大约 2000 个 文件。 我正在使用以下查询:
./crawl urls TestCrawl http://localhost:8983/solr/ -depth 2 -topN 3000
控制台输出是:
Injector: starting at 2014-12-08 21:36:15
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2014-12-08 21:36:18, elapsed: 00:00:02
我假设 nutch 无法从抓取脚本中获取 topN 值。
【问题讨论】:
-
“注入器:注入的新 URL 总数:0”。你能显示所有控制台输出吗?您仅针对注入作业显示的输出。
标签: solr web-crawler nutch