在弹性搜索查询中获取不匹配完整字符串的数据答案

【问题标题】：getting data without matching full string in elastic search query在弹性搜索查询中获取不匹配完整字符串的数据
【发布时间】：2019-12-20 16:18:22
【问题描述】：

我的数据存储在弹性搜索下面的格式

 {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca963f84470d852",
            "_score": 0.69321066,
            "_source": {
                "email": "test20011@gmail.com",
                "wallet": "test20011@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "7916318809",
                "name": "test20011"
            }
        },
        {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca9634d1c70d856",
            "_score": 0.69321066,
            "_source": {
                "email": "test50011@gmail.com",
                "wallet": "test50011@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "3483330496",
                "name": "test50011"
            }
        },
        {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca96304b370d857",
            "_score": 0.69321066,
            "_source": {
                "email": "test110021@gmail.com",
                "wallet": "test110021@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "2744697207",
                "name": "test110021"
            }
        }

如果我们使用下面的查询，应该找不到记录

   {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet": {
                            "query": "operatorqa2.akeodev.com",
                             "operator": "and"
                        }
                    }
                },
                {
                    "match": {
                        "email": {
                            "query": "operatorqa2.akeodev.com",
                                "operator": "and"
                        }
                    }
                }
            ]
        }
    }
}

如果我在查询下方传递，记录应该找到

    {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet": {
                            "query": "test20011@operatorqa2.akeodev.com",
                             "operator": "and"
                        }
                    }
                },
                {
                    "match": {
                        "email": {
                            "query": "test20011@operatorqa2.akeodev.com",
                                "operator": "and"
                        }
                    }
                }
            ]
        }
    }
}

我已经在电子邮件和钱包字段上创建了索引。每当用户通过电子邮件或钱包搜索数据时，我不确定用户发送的字符串是电子邮件还是钱包，所以我使用bool。

记录应该查找用户是否发送了完整的电子邮件地址或完整的电子钱包地址。请帮助我找到解决方案

【问题讨论】：

您尝试在两个不同的字段（电子邮件和钱包）中搜索相同的字符串（电子邮件地址）。不幸的是，我很难理解 if we are using below string... 。你能改写你的问题和期望吗？
您能否添加您的 ES 索引的映射并告诉我们您使用的是哪个版本的 ES？？
嗨，@DanielSchneiter 我在钱包和电子邮件字段上有一个索引。我从用户那里得到“operatorqa2.akeodev.com”，我需要从钱包或电子邮件字段中搜索值。无论从用户那里得到什么字符串，这个字符串都不会与我的 ES 中的任何数据匹配，但我正在获取数据。我的要求是，如果用户发送完整的字符串，如“test110011@operatorqa2.akeodev.com”，那么记录必须出现但如果我要控制数据“operatorqa2.akeodev.com”。来自用户然后记录不应该找到。

标签： elasticsearch tokenize

【解决方案1】：

正如其他社区成员所提到的，在提出此类问题时，您应该指定您正在使用的 Elasticsearch 的版本并提供映射。

从具有默认映射的 Elasticsearch 版本 5 开始，您只需将查询更改为针对字段的确切版本而不是分析的版本进行查询。默认情况下，Elasticsearch 将字符串映射到类型为text（已分析，用于全文搜索）和keyword（未分析，用于精确匹配搜索）的多字段。在您的查询中，您将查询<fieldname>.keyword-fields：

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet.keyword": "test20011@operatorqa2.akeodev.com"
                    }
                },
                {
                    "match": {
                        "email.keyword": "test20011@operatorqa2.akeodev.com"
                    }
                }
            ]
        }
    }
}

如果您使用的是版本 5 之前的 Elasticsearch 版本，请将 index-property 从 analyzed 更改为 not_analyzed 并重新索引您的数据。

映射sn-p：

{
  "email": {
    "type" "string",
    "index": "not_analyzed"
  }
}

您的查询仍然不需要使用and-运算符。它看起来与我在上面发布的查询相同，除了您必须查询 email 和 wallet 字段，而不是 email.keyword 和 wallet.keyword。

我可以向您推荐以下来自 Elastic 的与该主题相关的博文：Strings are dead, long live strings!

【讨论】：

谢谢，您的解决方案对我来说运行良好，我的 ES 版本是 7.4.2

【解决方案2】：

由于我没有到您的索引架构的映射，我假设您使用的是 ES 默认值（您可以使用 mapping API 获取此设置）并且在您的情况下，wallet 和 email 字段将被定义与text 一样，默认分析器是标准分析器。

此分析器不会将这些文本识别为邮件 ID，并会为 test50011@operatorqa2.akeodev.com 创建三个令牌，您可以使用 analyze APIs 进行检查。

http://localhost:9200/_analyze?text=test50011@operatorqa2.akeodev.com&tokenizer=standard

{
  "tokens": [
    {
      "token": "test50011",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "operatorqa2",
      "start_offset": 10,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "akeodev.com",
      "start_offset": 22,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

您需要的是custom analyzer for mails using UAX URI Mail tokenizer，它用于电子邮件字段。这将为test50011@operatorqa2.akeodev.com 生成一个正确的令牌（只有 1 个），如下所示：

http://localhost:9200/_analyze?text=test50011@operatorqa2.akeodev.com&tokenizer=uax_url_email

{
  "tokens": [
    {
      "token": "test50011@operatorqa2.akeodev.com",
      "start_offset": 0,
      "end_offset": 33,
      "type": "<EMAIL>",
      "position": 1
    }
  ]
}

现在您可以看到它没有拆分 test50011@operatorqa2.akeodev.com，因此当您使用相同的查询进行搜索时，它也会生成相同的标记，并且 ES 会处理标记到标记的匹配。

如果您需要任何帮助，请告诉我，它的设置和使用非常简单。

【讨论】：