【问题标题】:ERROR* Adding 2 documents java.io.IOException: Job failed! ( solr 3.4, nutch 1.4 bin on windows using Cygwin)错误* 添加 2 个文档 java.io.IOException:作业失败! (使用 Cygwin 的 Windows 上的 solr 3.4、nutch 1.4 bin)
【发布时间】:2014-01-15 17:06:57
【问题描述】:
$ ./nutch crawl urls -solr `http://localhost:8080/solr/` -depth 2 -topN 3
cygpath: can't convert empty path
crawl started in: crawl-20140115213017
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=`http://localhost:8080/solr/`
topN = 3
Injector: starting at 2014-01-15 21:30:17
Injector: crawlDb: crawl-20140115213017/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-01-15 21:30:21, elapsed: 00:00:03
Generator: starting at 2014-01-15 21:30:21
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20140115213017/segments/20140115213024
Generator: finished at 2014-01-15 21:30:26, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-01-15 21:30:26
Fetcher: segment: crawl-20140115213017/segments/20140115213024
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching `http://www.parkinson.org/`
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-01-15 21:30:32, elapsed: 00:00:06
ParseSegment: starting at 2014-01-15 21:30:32
ParseSegment: segment: crawl-20140115213017/segments/20140115213024
Parsing: `http://www.parkinson.org/`
ParseSegment: finished at 2014-01-15 21:30:34, elapsed: 00:00:01
CrawlDb update: starting at 2014-01-15 21:30:34
CrawlDb update: db: crawl-20140115213017/crawldb
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213024]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-01-15 21:30:36, elapsed: 00:00:01
Generator: starting at 2014-01-15 21:30:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20140115213017/segments/20140115213038
Generator: finished at 2014-01-15 21:30:39, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-01-15 21:30:39
Fetcher: segment: crawl-20140115213017/segments/20140115213038
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching `http://forum.parkinson.org/`
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching `http://twitter.com/ParkinsonDotOrg`
fetching `http://www.youtube.com/user/NPFGuru`
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-01-15 21:30:44, elapsed: 00:00:04
ParseSegment: starting at 2014-01-15 21:30:44
ParseSegment: segment: crawl-20140115213017/segments/20140115213038
Parsing: `http://forum.parkinson.org/`
ParseSegment: finished at 2014-01-15 21:30:45, elapsed: 00:00:01
CrawlDb update: starting at 2014-01-15 21:30:45
CrawlDb update: db: crawl-20140115213017/crawldb
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213038]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-01-15 21:30:46, elapsed: 00:00:01
LinkDb: starting at 2014-01-15 21:30:46
LinkDb: linkdb: crawl-20140115213017/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: tr`enter code here`ue
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213024
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213038
LinkDb: finished at 2014-01-15 21:30:47, elapsed: 00:00:01
SolrIndexer: starting at 2014-01-15 21:30:47
Adding 2 documents
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2014-01-15 21:30:52
SolrDeleteDuplicates: Solr url: `http://localhost:8080/solr/`
SolrDeleteDuplicates: finished at 2014-01-15 21:30:53, elapsed: 00:00:01
crawl finished: crawl-20140115213017

ERROR* 添加 2 个文档 java.io.IOException:作业失败! (使用 Cygwin 的 Windows 上的 solr 3.4、nutch 1.4 bin) 我是 Apache 的新手...需要一些帮助 尝试将抓取的数据发送到 solr 进行搜索,但出现错误“java.io.IOException: Job failed!”

【问题讨论】:

  • 我觉得和你solr的配置有关。也看看你的 solr 日志(如果那里有错误,或者把它贴在这里)。此外,还要检查你的 nutch 日志(在 nutch/logs 目录中)。

标签: java solr nutch


【解决方案1】:

听起来 Solr 和 Nutch 的架构文件不匹配。看看这篇文章,我使用的是 Solr 4.3,但我觉得它不应该有太大的不同

http://amac4.blogspot.com/2013/07/configuring-nutch-to-crawl-urls.html

日志文件包含有关问题的更详细信息,因此您也可以在此处发布它们。

【讨论】:

  • 仍然不知道问题是什么...但它现在解决了..我只是将 solr 目录从 cygwin/home/solr 更改为 C:/solr 并解决了问题.. 现在可以任何人都给我链接,这有助于我将 TIKA 与 solr 3.4 和 nutch 1.4 bin 集成以及哪个版本适合或兼容???
  • 同一个站点提供了有关使用 Solr 设置 Tika 的信息。它的 Solr 4.3 与 3.4 不同,但大部分内容将是相同的
【解决方案2】:

您的命令似乎是错误的。它应该是: $ ./nutch 抓取网址 -dir newCrawl -solr http://localhost:8080/solr/ -depth 3 -topN 5

你的错误:没有放“-dir”

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-11-02
    • 1970-01-01
    • 2014-03-04
    • 1970-01-01
    • 2013-09-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多