【问题标题】:Apache Nutch Indexing using elasticsearch使用弹性搜索的 Apache Nutch 索引
【发布时间】:2016-04-20 18:37:41
【问题描述】:

我目前正在使用 Apache Nutch 和 ElasticSearch 堆栈制作搜索引擎。我正在使用 Apache Nutch 2.1 和 ElasticSearch 1.7.3。

我目前正在尝试按照此处的说明直接从 Nutch 索引:https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/。 Nutch 和 Elasticsearch 都在我的本地主机上运行,​​集群名称为“elasticsearch”。

这些是我更改的 nutch-site.xml 的一些部分:

<property>
    <name>plugin.includes</name>
    <value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent problems with the
    underlying commons-httpclient library.
    </description>
</property>

运行命令ant runtime后,我尝试发出命令

bin/nutch elasticindex elasticsearch -all

但它返回了这个:

Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)

我不确定我哪里出错了。这是我的 hadoop.log:

    2016-01-15 15:46:24,106 INFO  elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO  plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO  elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN  elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO  elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO  elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN  elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO  elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO  elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO  elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:67)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:153)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:137)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

谁能帮我解决这个问题?谢谢!

【问题讨论】:

    标签: apache indexing elasticsearch nutch


    【解决方案1】:

    确保您在 nutch 弹性依赖项和本地服务器中运行相同的版本。

    如果不一样,那就不要浪费时间了,直接用http协议从nutch推送到elastic,而不是Java api。

    【讨论】:

    • 抱歉,刚刚看到你的评论。你是什​​么意思保存版本?你的意思是“相同”,对吧?如果是这样,我所做的是,在我的 ivy.xml 中,我改变了这个: 到这个: 我的本地 nutch 是 2.1,我的本地弹性搜索是 1.7.3。我哪里错了。我想尽可能坚持使用默认的 nutch 索引,因为我需要 nutch 的 ParseMetatags 功能。
    • 另外,在日志中,我注意到不是我使用的集群名称(即 cluster1),而是显示其他名称 [在上面的日志中,它显示“Layla Miller” ]。 inet地址也和我分配的不一样。我想知道这是否有助于解决这个问题。
    • 本地elasticsearch服务器的默认集群名称是elasticsearch。你能分享你的 nutch-elasticsearch 配置吗?最后,运行 bin/crawl 而不是 bin/nutch
    • 我几乎没有碰ES设置,默认集群名称保持不变。至于 nutch,除了我的 nutch-site.xml,我也没有改变任何东西,我在其中明确规定了以下内容:elastic.clusterelasticsearch要发现的集群名称。必须定义主机和 potr 或集群。elastic.indexglobe_search将文档发送到的默认索引。
    • 我还规定我的 elastic.host 是 localhost,elastic.port 是 9300,但它似乎没有遵循这一点。它指向错误的弹性主机,如上所示。
    猜你喜欢
    • 1970-01-01
    • 2012-05-07
    • 2011-08-26
    • 2016-10-24
    • 1970-01-01
    相关资源
    最近更新 更多