Elasticsearch Completion Suggester：真正的标记化可能吗？答案

【问题标题】：Elasticsearch Completion Suggester: Real tokenization possible?Elasticsearch Completion Suggester：真正的标记化可能吗？
【发布时间】：2019-07-28 10:09:30
【问题描述】：

如here 所述，Elasticsearch 中定义为“完成”类型以及某个分析器+标记器的字段首先根据这些部分的底层逻辑进行拆分，然后再次“缝合”在一起。但是我对这种行为非常不满意。

这是我当前的映射设置（摘录）：

"mappings": {
    "movie": {
      "properties": {
        "title": {
          "analyzer": "standard",
          "fields": {
            "autocomplete": {
              "type": "completion"
              "analyzer": "whitespace",
            }
          },
          "type": "string"
        }
      }
    }
}

我们以标题为Harry Potter 的电影为例：

当我输入前缀 Har 时，我得到了建议 Harry Potter。当我改为输入Pot 时，我根本没有得到任何结果，因为在分析/标记化之后，各个标记Harry 和Potter 立即被拼接到Harry Potter。

现在我想要的是以下行为：当我输入 Pot 时，我希望 Completion Suggester 返回 Potter。不是Harry Potter，而只是Potter。这有可能吗？警告：我什至不需要对创建建议的文档的引用。因此，如果有可能以某种方式将所有生成的令牌放入一个容器中，然后从那里检索建议，那就太棒了（因为我必须做一些其他的事情）。

【问题讨论】：

你在这方面有什么收获吗？我也想做同样的事情。
是的，我做到了。明天上班的时候我会告诉你，当我身边有代码时；）
@XDAF 你能分享一下解决方案吗？我在这里问过同样的问题 - stackoverflow.com/questions/70355182/…

标签： elasticsearch search autocomplete

【解决方案1】：

我正在使用 edge_ngram 标记器做一些非常相似的事情。这是official documentation

您的设置需要包括以下内容：

{
  "settings" : {
    "index" : {
      "number_of_shards" : "5",
      "analysis" : {
        "analyzer" : {
          "autocomplete": {
            "type": "custom",
            "tokenizer": "autocomplete",
            "filter": [
                "lowercase"
            ]
          }
        },
        "tokenizer": {
          "autocomplete": {
            "type": "edge_ngram",
            "min_gram": 3,
            "max_gram": 20,
            "token_chars": [
              "letter",
              "digit"
            ]
          }
        }
      }
    }
  }
}

您的映射需要改进，以便“分析器”：“自动完成”

【讨论】：

然后标题为Harry Potter 的电影被划分为标记Har, Harr, Harry, Pot, Pott, Potte, Potter。我只想将其拆分为两个标记，然后使用完成建议器的二叉搜索树的逻辑。