使用 Nutch 爬取指定的 URL 列表答案

【问题标题】：Using Nutch to crawl a specified URL list使用 Nutch 爬取指定的 URL 列表
【发布时间】：2012-02-27 14:13:16
【问题描述】：

我有一百万个 URL 列表要获取。我将此列表用作 nutch 种子，并使用 Nutch 的基本 crawl 命令来获取它们。但是，我发现 Nutch 会自动获取不在列表中的 URL。我确实将抓取参数设置为 -depth 1 -topN 1000000。但它不起作用。有谁知道怎么做？

【问题讨论】：

标签： nutch web-crawler

【解决方案1】：

在nutch-site.xml 中设置此属性。（默认情况下为 true，因此它会向 crawldb 添加外链）

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

【讨论】：

【解决方案2】：

删除 crawl 和 urls 目录（如果之前创建）
创建和更新种子文件（其中列出的 URL 每行 1 个 URL）
重启爬取进程

命令

nutch crawl urllist -dir crawl -depth 3 -topN 1000000

urllist - 存在种子文件（url 列表）的目录
抓取 - 目录名称

即使问题仍然存在，请尝试删除您的 nutch 文件夹并重新启动整个过程。

【讨论】：

我不希望 Nutch 从种子中抓取外链，而只抓取我作为种子提供的 URL。