elasticsearch：如何索引仅是停用词的术语？答案

【问题标题】：elasticsearch: how to index terms which are stopwords only?elasticsearch：如何索引仅是停用词的术语？
【发布时间】：2013-01-16 10:29:01
【问题描述】：

我在后台使用 elasticsearch 构建自己的小型搜索非常成功。但是我在文档中找不到一件事。

我正在索引音乐家和乐队的名字。有一个乐队叫做“The The”，由于停用词列表，这个乐队永远不会被索引。

我知道我可以完全忽略停用词列表，但这不是我想要的，因为搜索“谁”等其他乐队的结果会爆炸。

那么，是否可以在索引中保存“The The”而不禁用停用词？

【问题讨论】：

嗨，卡斯滕。我之前的回答是不正确的，因为我认为同义词过滤器不能对多个标记起作用，但它可以。答案已更新。

标签： indexing elasticsearch stop-words

【解决方案1】：

您可以使用synonym filter 将The The 转换为单个标记，例如thethe，它不会被停用词过滤器删除。

首先，配置分析器：

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "syn" : {
               "synonyms" : [
                  "the the => thethe"
               ],
               "type" : "synonym"
            }
         },
         "analyzer" : {
            "syn" : {
               "filter" : [
                  "lowercase",
                  "syn",
                  "stop"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

然后使用字符串"The The The Who" 对其进行测试。

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn' 

{
   "tokens" : [
      {
         "end_offset" : 7,
         "position" : 1,
         "start_offset" : 0,
         "type" : "SYNONYM",
         "token" : "thethe"
      },
      {
         "end_offset" : 15,
         "position" : 3,
         "start_offset" : 12,
         "type" : "<ALPHANUM>",
         "token" : "who"
      }
   ]
}

"The The" 已标记为 "the the"，"The Who" 标记为 "who"，因为前面的 "the" 已被停用词过滤器删除。

停还是不停

这让我们回到是否应该包含停用词？你说：

I know I can ignore the stop words list completely 
but this is not what I want since the results searching 
for other bands like "the who" would explode.

你这是什么意思？怎么爆？索引大小？性能？

最初引入停用词是为了通过删除可能对查询的相关性影响不大的常用词来提高搜索引擎的性能。然而，从那时起，我们已经走了很长一段路。我们的服务器比 80 年代更强大。

索引停用词不会对索引大小产生巨大影响。例如，索引单词the 意味着将单个术语添加到索引中。您已经有数千个术语 - 索引停用词也不会对大小或性能产生太大影响。

实际上，更大的问题是the 很常见，因此对相关性的影响很小，因此搜索"The The concert Madrid" 会更喜欢Madrid 而不是其他术语。这可以通过使用shingle 过滤器来缓解，这将导致这些令牌：

['the the','the concert','concert madrid']

虽然the 可能很常见，但the the 并不常见，因此排名会更高。

您不会单独查询 shingled 字段，但您可以将针对由标准分析器（不带停用词）标记的字段的查询与针对 shingled 字段的查询结合起来。

我们可以使用多字段以两种不同的方式分析text字段：

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "text" : {
               "fields" : {
                  "shingle" : {
                     "type" : "string",
                     "analyzer" : "shingle"
                  },
                  "text" : {
                     "type" : "string",
                     "analyzer" : "no_stop"
                  }
               },
               "type" : "multi_field"
            }
         }
      }
   },
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_stop" : {
               "stopwords" : "",
               "type" : "standard"
            },
            "shingle" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "shingle"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

然后使用multi_match 查询来查询该字段的两个版本，从而为叠瓦版本提供更多“提升”/相关性。在此示例中，text.shingle^2 表示我们希望将该字段提升 2：

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "multi_match" : {
         "fields" : [
            "text",
            "text.shingle^2"
         ],
         "query" : "the the concert madrid"
      }
   }
}
'

【讨论】：