Elasticsearch 6.8 中模糊搜索（如 MySQL 中的“%aaa%”）的最佳实践是什么答案

【问题标题】：What is the best practice of fuzzy search (like '%aaa%' in MySQL) in Elasticsearch 6.8Elasticsearch 6.8 中模糊搜索（如 MySQL 中的“%aaa%”）的最佳实践是什么
【发布时间】：2020-09-16 02:51:27
【问题描述】：

背景：我用的是Mysql，有几百万条数据，每行有二十列，我们有一些复杂的搜索和某些列使用模糊匹配，例如username like '%aaa%'，除非删除第一个%，否则它不能使用mysql索引，但是我们需要模糊匹配才能像Satckoverflow搜索一样进行搜索，我也检查了Mysql fulltext index，但是如果使用其他索引，它不支持一个sql的复杂搜索。

我的解决方案：添加Elasticsearch作为我们的搜索引擎，将数据插入Mysql和Es，只在Elasticsearch中搜索数据

我检查了Elasticsearch模糊搜索，wildcard可以，但是很多人不建议在单词开头使用*，这样会导致搜索很慢。

例如：用户名：'John_Snow'

wildcard 有效，但可能很慢

GET /user/_search
{
  "query": {
    "wildcard": {
      "username": "*hn*"
    }
  }
}

match_phrase 不起作用似乎只适用于短语“John Snow”这样的 Tokenizer

{
  "query": {
      "match_phrase":{
      "dbName": "hn"
      }
  }
}

我的问题：是否有更好的解决方案来执行包含模糊匹配的复杂查询，例如“%no%”或“%hn_Sn%”。

【问题讨论】：

标签： elasticsearch wildcard

【解决方案1】：

您可以使用 ngram tokenizer 将文本首先分解为每当遇到指定字符列表中的一个时，然后它发出指定长度的每个单词的 N-gram。

添加一个包含索引数据、映射、搜索查询和结果的工作示例。

索引映射：

     {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        },
        "max_ngram_diff": 50
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

分析 API

POST/ _analyze

{
  "analyzer": "my_analyzer",
  "text": "John_Snow"
}

令牌是：

   {
    "tokens": [
        {
            "token": "Jo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "Joh",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "John",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "oh",
            "start_offset": 1,
            "end_offset": 3,
            "type": "word",
            "position": 3
        },
        {
            "token": "ohn",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 4
        },
        {
            "token": "hn",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 5
        },
        {
            "token": "Sn",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 6
        },
        {
            "token": "Sno",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 7
        },
        {
            "token": "Snow",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 8
        },
        {
            "token": "no",
            "start_offset": 6,
            "end_offset": 8,
            "type": "word",
            "position": 9
        },
        {
            "token": "now",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 10
        },
        {
            "token": "ow",
            "start_offset": 7,
            "end_offset": 9,
            "type": "word",
            "position": 11
        }
    ]
}

索引数据：

{
  "title":"John_Snow"
}

搜索查询：

{
    "query": {
        "match" : {
            "title" : "hn"
        }
    }
}

搜索结果：

"hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "title": "John_Snow"
                }
            }
        ]

另一个搜索查询

{
    "query": {
        "match" : {
            "title" : "ohr"
        }
    }
}

上面的搜索查询没有结果

【讨论】：

嗨@Bhavya，感谢您的回答，tokenizer ngram 有效，稍后我将测试数百万数据查询性能
@KDFinal 很高兴它对你有用 :) 你能否接受并支持我的回答，因为它帮助你解决了你的问题 :)
嗨@Bhavya，我还有一个问题，n-gram 标记器将通过其min_gram 、max_gram 分隔单词，例如name="John_Snow" 并设置min_gram=2, max_gram=2，如果我搜索“hn ", "John_Snow" 会返回，但是当我搜索 "ohr" 时，它也会返回 "John_Snow" ，比如 MySQL like '%ohr%'，'ohr' 应该返回空结果。你有什么想法改变分词器策略吗？
@KDFinal 请检查我更新的索引映射，现在如果您搜索ohr，它将给出空结果（应该是）。如果您对此有进一步的疑问，请告诉我:)
添加"search_analyzer": "standard"后，所有数据都找不到了，能否帮忙查一下，非常感谢！