【问题标题】:Nutch crawl stopped after parsing one pageNutch 抓取在解析一页后停止
【发布时间】:2014-04-15 03:40:28
【问题描述】:

使用 nutch 抓取时,只解析一页,不向前移动。任何人都可以请帮忙。下面是nutch输出。

解析第一页后,它正在停止并且不再移动。没有解析成功。

[Naveen@01hw5189 apache-nutch-1.7]$ bin/nutch crawl urls -dir crawlwiki -depth 10 -topN 10
solrUrl is not set, indexing will be skipped...
crawl started in: crawlwiki
rootUrlDir = urls
threads = 10
depth = 10
solrUrl=null
topN = 10
Injector: starting at 2013-09-12 15:51:45
Injector: crawlDb: crawlwiki/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-12 15:51:47, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155149
Generator: finished at 2013-09-12 15:51:50, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:51:50
Fetcher: segment: crawlwiki/segments/20130912155149
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://en.wikipedia.org/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:51:53, elapsed: 00:00:03
ParseSegment: starting at 2013-09-12 15:51:53
ParseSegment: segment: crawlwiki/segments/20130912155149
ParseSegment: finished at 2013-09-12 15:51:54, elapsed: 00:00:01
CrawlDb update: starting at 2013-09-12 15:51:54
CrawlDb update: db: crawlwiki/crawldb
CrawlDb update: segments: [crawlwiki/segments/20130912155149]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-09-12 15:51:56, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:56
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155159
Generator: finished at 2013-09-12 15:52:00, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:52:00
Fetcher: segment: crawlwiki/segments/20130912155159
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://en.wikipedia.org/wiki/Main_Page (queue crawl delay=5000ms)
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:52:02, elapsed: 00:00:02
ParseSegment: starting at 2013-09-12 15:52:02
ParseSegment: segment: crawlwiki/segments/20130912155159
Parsed (8ms):http://en.wikipedia.org/wiki/Main_Page

【问题讨论】:

  • 有同样的问题。你搞清楚了吗?
  • 我目前在使用 Nutch 2.3 时遇到同样的问题。不幸的是,我不得不改用 Nutch 1.9。我仍在寻找解决此问题的方法。

标签: web-crawler nutch


【解决方案1】:

在 wikipedia 上查看 robots.txt 文件

http://en.wikipedia.org/robots.txt

robots.txt 可能会拒绝进一步的深度搜索。机器人文件定义了网络爬虫可以访问的内容,Nutch 遵守这个“网络”

希望对你有帮助

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-08-09
    • 2016-09-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-01-13
    相关资源
    最近更新 更多