【问题标题】:Nutch: Job FailedNutch:工作失败
【发布时间】:2014-04-02 07:38:19
【问题描述】:

我在运行 nutch 进行注入时遇到问题 以下是我正在运行的命令

bin/nutch 注入 bin/crawl/crawldb bin/urls

运行上述命令后,出现以下错误

Injector: starting at 2014-04-02 13:02:29
Injector: crawlDb: bin/crawl/crawldb
Injector: urlDir: bin/urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 2
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

我是第一次跑步。 我检查了 solr,nutch 安装正确。

以下细节来自日志文件

java.io.IOException: The temporary job-output directory file:/usr/share/apache-nutch-1.8/bin/crawl/crawldb/1639805438/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:46)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:449)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-04-02 12:54:46,251 ERROR crawl.Injector - Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

【问题讨论】:

  • 根据您的日志,您的权限有问题。可能这项工作没有在 /usr/... 中创建文件夹的权限...
  • @Mysterion 谢谢你的回复..正如你建议我改变了权限..但仍然得到同样的错误。
  • 解决了上述错误。
  • 但是 nutch 没有从种子文件中获取 url..有人可以帮忙吗?
  • 你是怎么解决的?请更新问题

标签: ruby-on-rails solr web-crawler nutch


【解决方案1】:

正在使用 bin/nutch injection bin/crawl/crawldb bin/urls 命令进行注入

而不是 bin/nutch 注入 crawl/crawldb bin/urls

这解决了错误。

为了获取 url,我已经对 regex-urlfilter.txt 文件进行了更改,现在可以获取 url。

【讨论】:

    【解决方案2】:

    确保您的任何 nutch 配置文件中没有任何语法错误。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-03-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-09-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多