Nutch Crawler 需要很长时间答案

【问题标题】：Nutch Crawler taking very longNutch Crawler 需要很长时间
【发布时间】：2015-05-13 15:35:41
【问题描述】：

我只想让 Nutch 给我一个它爬取的 url 列表和那个链接的状态。我不需要整个页面的内容或绒毛。有没有办法我可以做到这一点？抓取深度为 3 的 991 个 url 的种子列表需要 3 多个小时来抓取和解析。我希望这会加快速度。

在 nutch-default.xml 文件中有

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

<property>
  <name>file.content.ignored</name>
  <value>true</value>
  <description>If true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.
  !! NO IMPLEMENTED YET !!
  </description>
</property>

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>65536</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>

我认为这些属性可能与它有关，但我不确定。有人可以给我一些帮助和澄清吗？此外，我收到了许多状态码为 38 的网址。我在this 文档中找不到该状态码表示的内容。感谢您的帮助！

【问题讨论】：

标签： apache web-crawler nutch

【解决方案1】：

Nutch 在获取 URL 后进行解析，从获取的 URL 中获取所有外链。来自 URL 的外链用作下一轮的新 fetchlist。

如果跳过解析，则可能不会生成新的 URL，因此不会再进行获取。我能想到的一种方法是将解析插件配置为仅包含您需要解析的内容类型（在您的情况下为外链）。这里有一个例子 - https://wiki.apache.org/nutch/IndexMetatags

此链接描述了解析器https://wiki.apache.org/nutch/Features的功能

现在，要仅获取获取的 URL 列表及其状态，您可以使用

$bin/nutch readdb crawldb -stats 命令。

关于 38 的状态码，看你链接的文档，好像是 URL 的状态 public static final byte STATUS_FETCH_NOTMODIFIED = 0x26

因为，Hex(26) 对应于 Dec(38)。

希望答案能给你一些方向:)

【讨论】：

哇，我不知道为什么我没有考虑将其从十六进制转换为十进制以获取状态 ID，谢谢。通过增加我使用的线程数，我已经能够显着加快速度。这将进行相同爬网的时间减少到 6 分钟。然而，额外的“绒毛”仍然存在。我无法弄清楚您在此处描述的解析内容。在我的数据库中，我只想查看 2 个字段；一个 id （作为自己测试的 url）和 status （作为获取后 url 的状态）。
bin/nutch readdb crawldb -stats 命令仅向我显示按状态 ID 和每个 URL 数细分的总体统计信息。这不是我的最终目标，但它仍然提供信息。
如果您希望按每个 URL 细分状态。使用 bin/nutch readdb crawldb -stats -sort