【发布时间】:2015-05-13 15:35:41
【问题描述】:
我只想让 Nutch 给我一个它爬取的 url 列表和那个链接的状态。我不需要整个页面的内容或绒毛。有没有办法我可以做到这一点?抓取深度为 3 的 991 个 url 的种子列表需要 3 多个小时来抓取和解析。我希望这会加快速度。
在 nutch-default.xml 文件中有
<property>
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>file.content.ignored</name>
<value>true</value>
<description>If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
</description>
</property>
我认为这些属性可能与它有关,但我不确定。有人可以给我一些帮助和澄清吗?此外,我收到了许多状态码为 38 的网址。我在this 文档中找不到该状态码表示的内容。感谢您的帮助!
【问题讨论】:
标签: apache web-crawler nutch