【发布时间】:2017-10-18 01:54:49
【问题描述】:
我是 Solr 搜索的新手,目前正在努力让 solr Cell 与 Tika 合作。考虑以下文本文件:
Name: Popeye
Nationality: American
我希望 Solr 向我返回名为“姓名”和“国籍”的两个字段,其值为 popeye 和 American。为此,我在 schema.xml 文件中将两个字段定义为
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="nationality" type="text_general" indexed="true" stored="true"/>
text_general 字段定义为
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
在 solrconfig.xml 文件中,我定义了更新/提取方法
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
最后,我运行命令将文档索引为
curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&commit=true' -F "myfile=@/tmp/popeye_bio.txt"
文档被正确编入索引。当我使用查询命令作为
curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/select?q=*%3A*&wt=json&indent=true'
我得到的输出为
{
"responseHeader":{
"status":0,
"QTime":3,
"params":{
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":1,"start":0,"docs":[
{
"attr_meta":["stream_source_info",
"myfile",
"stream_content_type",
"text/plain",
"stream_size",
"206",
"Content-Encoding",
"windows-1252",
"stream_name",
"popeye_bio.txt",
"Content-Type",
"text/plain; charset=windows-1252"],
"id":"doc1",
"attr_stream_source_info":["myfile"],
"attr_stream_content_type":["text/plain"],
"attr_stream_size":["206"],
"attr_content_encoding":["windows-1252"],
"attr_stream_name":["popeye_bio.txt"],
"attr_content_type":["text/plain; charset=windows-1252"],
"attr_content":[" \n \n \n \n \n \n \n \n \n \n Name: Popeye\r\nNationality: American\r\n \n "],
"_version_":1567726521681969152}]
}}
如您所见,popeye 和 American 没有在我在 schema.xml 文件中定义的字段中建立索引。我在这里做错了什么?我尝试将标记器更改为 text_general 字段类型为<tokenizer class="solr.PatternTokenizerFactory" pattern=": "/>。但这没有任何区别。我将不胜感激这方面的任何帮助!
【问题讨论】:
标签: indexing solr apache-tika cloudera-manager