如何使用弹性搜索从文本中查找相似标签答案

【问题标题】：How to find similar tags from text using elastic search如何使用弹性搜索从文本中查找相似标签
【发布时间】：2020-07-01 18:57:06
【问题描述】：

我尝试使用Elastic Search 从文本中查找最相似的标签。

例如，我创建 test_index 并插入两个文档：

POST test_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST test_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

所以，我希望从“我正在使用一些软件和应用程序”文本中找到“软件”标签（文本或ID）。

我希望有人可以提供一个示例来说明如何做到这一点，或者至少为我指明正确的方向。

谢谢。

【问题讨论】：

标签： elasticsearch lucene text-mining

【解决方案1】：

您正在寻找的只是一个名为Stemming 的概念。您需要创建一个Custom Analyzer 并使用Stemmer Token Filter。

请找到以下映射、示例文档、查询和响应：

映射：

PUT my_stem_index
{
  "settings": {
      "analysis" : {
          "analyzer" : {
              "my_analyzer" : {
                  "tokenizer" : "standard",
                  "filter" : ["lowercase", "my_stemmer"]
              }
          },
          "filter" : {
              "my_stemmer" : {
                  "type" : "stemmer",
                  "name" : "english"
              }
          }
      }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "tags":{
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

从 cmets 看来，您使用的是 type。

PUT my_stem_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "my_stemmer"
               ]
            }
         },
         "filter":{
            "my_stemmer":{
               "type":"stemmer",
               "name":"english"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "id":{
               "type":"keyword"
            },
            "tags":{
               "type":"text",
               "analyzer":"my_analyzer",
               "fields":{
                  "keyword":{
                     "type":"keyword"
                  }
               }
            }
         }
      }
   }
}

示例文件：

POST my_stem_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST my_stem_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

POST my_stem_index/_doc/21
{
  "id": 21,
  "tags": ["softwares and applications", "hardwares and storage devices"]
}

请求查询：

POST my_stem_index/_search
{
  "query": {
    "match": {
      "tags": "software"
    }
  }
}

回应：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5908618,
    "hits" : [
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.5908618,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "hardware"
          ]
        }
      },
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.35965496,
        "_source" : {
          "id" : 21,
          "tags" : [
            "softwares and applications",             <--- Note this has how `softwares` also was searchable.
            "hardwares and storage devices"
          ]
        }
      }
    ]
  }
}

请注意两个文档（即具有 _id 20 和 21 的文档是如何出现的。

补充说明：

如果您是 Elasticsearch 的新手，我建议您花点时间了解 Analysis 的概念以及 Elasticsearch 如何使用 Analyzers 实现相同的概念。

这将帮助您了解当您仅查询 software 时，带有 softwares and applications 的文档也如何返回，反之亦然。

希望这会有所帮助！

【讨论】：

对不起，我没明白，甚至默认分析器stanard也为文本软件创建了一个令牌software，那么为什么要使用词干令牌，它不是词干文本。我认为 OP 的问题不清楚。
@es-enthu 如果你仔细注意到，即使他搜索software（单数），他也希望返回具有softwares（复数）的文档。另一个例子是，如果他搜索单词wait，他应该也会返回包含waiting, waited 等的文档。我认为这就是 OP 正在寻找的。span>
@OpsterESNinja-Kamal 您的映射设置不起作用，它有问题，原因”：“无法解析映射 [properties]：根映射定义具有不受支持的参数：[id : {type=keyword }] [tags : {analyzer=my_analyzer, type=text, fields={keyword={type=keyword}}}]，我使用的是 6.8 版本。
@OpsterESNinja-Kamal 我找到了问题的原因，您没有在映射后编写文档类型“_doc”，请编辑您的答案。谢谢
@alireza 我已经在 7.6 ES 版本上测试了该解决方案。发布版本 7，type 已被弃用，因此默认类型为 _doc。请在您下次提出的任何问题中填写所有这些详细信息，以便我们轻松回答您。无论如何，我会相应地更新我的答案。

【解决方案2】：

如果您搜索具有基本词或词根的文本，Stemming 是个好方法。

如果您需要从文本中找到最相似的单词，Ngram 是更合适的方式。

如果您在标签词中搜索文本的确切词，Shingles 是更好的方法。

【讨论】：