使用 solrcell 和 tika 索引丰富的文档答案

【问题标题】：index rich documents using solrcell and tika使用 solrcell 和 tika 索引丰富的文档
【发布时间】：2017-10-18 01:54:49
【问题描述】：

我是 Solr 搜索的新手，目前正在努力让 solr Cell 与 Tika 合作。考虑以下文本文件：

Name:                    Popeye
Nationality:             American

我希望 Solr 向我返回名为“姓名”和“国籍”的两个字段，其值为 popeye 和 American。为此，我在 schema.xml 文件中将两个字段定义为

   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="nationality" type="text_general" indexed="true" stored="true"/>

text_general 字段定义为

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
                 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

在 solrconfig.xml 文件中，我定义了更新/提取方法

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>

最后，我运行命令将文档索引为

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&commit=true' -F "myfile=@/tmp/popeye_bio.txt"

文档被正确编入索引。当我使用查询命令作为

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/select?q=*%3A*&wt=json&indent=true'

我得到的输出为

    {
    "responseHeader":{
    "status":0,
    "QTime":3,
    "params":{
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
      "response":{"numFound":1,"start":0,"docs":[
      {
        "attr_meta":["stream_source_info",
          "myfile",
          "stream_content_type",
          "text/plain",
          "stream_size",
          "206",
          "Content-Encoding",
          "windows-1252",
          "stream_name",
          "popeye_bio.txt",
          "Content-Type",
          "text/plain; charset=windows-1252"],
        "id":"doc1",
        "attr_stream_source_info":["myfile"],
        "attr_stream_content_type":["text/plain"],
        "attr_stream_size":["206"],
        "attr_content_encoding":["windows-1252"],
        "attr_stream_name":["popeye_bio.txt"],
        "attr_content_type":["text/plain; charset=windows-1252"],
        "attr_content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Name:                    Popeye\r\nNationality:             American\r\n \n  "],
        "_version_":1567726521681969152}]
  }}

如您所见，popeye 和 American 没有在我在 schema.xml 文件中定义的字段中建立索引。我在这里做错了什么？我尝试将标记器更改为 text_general 字段类型为<tokenizer class="solr.PatternTokenizerFactory" pattern=": "/>。但这没有任何区别。我将不胜感激这方面的任何帮助！

【问题讨论】：

标签： indexing solr apache-tika cloudera-manager

【解决方案1】：

当您定义标记器时，您只是向 Solr 指示所有在该字段中发送的数据应该使用您的配置进行标记/处理，但最终，您发送的是您的所有信息都放入一个字段。

Solr 假定您的数据是结构化的（1 个包含字段的文档）。因此，一个分析器/标记器无法创建更多字段。分析器/分词器的功能基本上只是对要进入倒排索引的文本进行分词和转换以进行搜索。

您可以做的是在文本进入标记器之前使用ScriptUpdateProcessor 并定义一个管道来进行修改（将一个字段拆分为多个）。比如：

<processor class="solr.StatelessScriptUpdateProcessorFactory">
    <str name="script">splitField.js</str>
</processor>

splitField.js 文件可能包含以下内容：

function processAdd(cmd) {
    doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
    field = doc.getFieldValue("attr_content");

    // split your attr_content text into two variables:
    // name and nationality, then

    doc.setField("name", name);
    doc.setField("nationality", nationality);
}

在理想情况下，这应该在 Solr 之外处理，但使用 ScriptUpdateProcessor 您可以完成您想要的。

【讨论】：

很高兴它有帮助！这种方法的唯一缺点是您需要自己使用 javascript 拆分值:)
您好 Jorge，我已经实现了您建议的想法。编制索引时，我收到错误ReferenceError: "name" is not defined. 任何想法，可能是什么原因造成的？ doc.setField 如何识别姓名和国籍？也许我们应该在某处使用field 变量。我在 update/extract 方法中调用 java 脚本。
是的，但您是在提取后发送数据，对吗？ name 和 nationality 字段需要在您的 schema.xml 上定义。当然，在我的情况下，我将变量 name 和 nationality 设置为从字段中提取的变量示例，因此 field = doc.getFieldValue("attr_content") 将提取 attr_content 字段的内容，然后将此文本值拆分为两个不同的变量：name 和 nationality。
嗨豪尔赫！请在答案部分找到我的回复。

【解决方案2】：

我目前的做法是在更新/提取方法中定义一个“update.chain”

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
<str name="update.chain">mychain</str>
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>

mychain 在哪里

<updateRequestProcessorChain name="mychain">
     <processor class="solr.StatelessScriptUpdateProcessorFactory">
            <str name="script">splitField.js</str>
     </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

我将它包含在更新/提取方法中，以便调用处理器。如果我理解正确，我应该在 update/extract 方法之后和文本发送到标记器之前调用处理器。如果是这样，那么如何调用处理器？

我还尝试从update/extract 中删除<str name="update.chain">mychain</str> 行，然后调用

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&update.chain=mychain&commit=true' -F "myfile=@/tmp/popeye_bio.txt"

我得到同样的错误。 splitFiled.js 被定义为

function processAdd(cmd) {
doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument
field = doc.getFieldValue("attr_content");
// split your attr_content text into two variables:
// name and nationality, then
doc.setField("name", name);
doc.setField("nationality", nationality);
}

function processDelete(cmd) {
}

function processMergeIndexes(cmd) {
}

function processCommit(cmd) {
}

function processRollback(cmd) {
}

function finish() {
}

错误发生在setField行中。有什么办法可以在控制台中打印“字段”吗？也许，“console.log”方法？

【讨论】：

检查[此文档]，您总是需要以：RunUpdateProcessorFactory 结束您的链，以便将文档推送到索引，否则永远不会添加文档。你能分享你的splitField.js的代码吗？您应该已经添加了将文本解析为 name 和 nationality 的逻辑。
嗨 Jorge，我想通了 RunUpdateProcessorFactory。我还添加了LogUpdateProcessorFactory。我看不到你发给我的链接。能不能再发给我。我目前正在关注这个link。请在答案部分找到 splitField.js 代码。
这是信息 cwiki.apache.org/confluence/display/solr/… 和 lucene.apache.org/solr/6_5_0/solr-core/org/apache/solr/update/… 但您的错误是您需要在脚本中定义 name 和 nationality 变量，field 变量包含 " \n \n \n \n \n \n \n \n \n \n Name: Popeye\r\nNationality: American\r\n \n "您需要将此字符串拆分为变量name 和nationality，只需执行一些正则表达式/拆分并定义变量即可。
类似：parts = field.split('\r\n'); var name = parts[0].trim().split(":")[1].trim(); var nationality = parts[1].trim().split(":")[1].trim() 但请记住，这必须适用于您发送到 Solr 的每个文档
很高兴成功了！请记住，我只是使用您的示例文档来证明一个观点，您应该将其概括为更通用的文档解决方案