如何通过 nutch crawl SCRIPT 设置 topN答案

【问题标题】：How to Set topN via nutch crawl SCRIPT如何通过 nutch crawl SCRIPT 设置 topN
【发布时间】：2014-12-08 16:34:16
【问题描述】：

我正在尝试抓取一个网址为 http://def.com/xyz/(say) 的网页，该网页有超过 2000 个传出网址，但是当我查询 solr 时，它显示的文档少于 50 个，而我预计大约 2000 个文件。我正在使用以下查询：

./crawl urls TestCrawl http://localhost:8983/solr/ -depth 2 -topN 3000

控制台输出是：

Injector: starting at 2014-12-08 21:36:15
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2014-12-08 21:36:18, elapsed: 00:00:02

我假设 nutch 无法从抓取脚本中获取 topN 值。

【问题讨论】：

“注入器：注入的新 URL 总数：0”。你能显示所有控制台输出吗？您仅针对注入作业显示的输出。

标签： solr web-crawler nutch

【解决方案1】：

请验证 nutch 配置中的属性db.max.outlinks.per.page。将此值更改为更大的数字或-1 以使所有网址都被抓取和编入索引。

希望这会有所帮助，

乐国岛

【讨论】：