获取特定术语在弹性搜索字段中的出现次数答案

【问题标题】：Get the number of appearances of a particular term in an elasticsearch field获取特定术语在弹性搜索字段中的出现次数
【发布时间】：2020-03-26 10:54:04
【问题描述】：

我有一个带有以下映射的弹性搜索索引（帖子）：

{
    "id": "integer",
    "title": "text",
    "description": "text"
}

我只想在单个特定文档的描述字段中找到特定术语的出现次数（我有要查找的文档ID和术语） .

例如，我有一个类似这样的帖子 {id: 123, title:"some title", description: "my city is LA, this post description has两个单词 city"}。

我有这个帖子的文档 ID/帖子 ID，只是想知道这个特定帖子的描述中出现了多少次“城市”一词。（在这种情况下，结果应该是 2）

似乎无法找到此搜索的方法，我不希望在所有文档中出现，而只是针对单个文档并在其“一个字段内”。请为此提出查询。谢谢

Elasticsearch 版本：7.5

【问题讨论】：

标签： elasticsearch elasticsearch-dsl

【解决方案1】：

您可以在description 上使用terms 聚合，但需要确保其fielddata 设置为true。

PUT kamboh/
{
  "mappings": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text",
        "fields": {
          "simple_analyzer": {
            "type": "text",
            "fielddata": true,
            "analyzer": "simple"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

提取示例文档：

PUT kamboh/_doc/1
{
  "id": 123,
  "title": "some title",
  "description": "my city is LA, this post description has two occurrences of word city "
}

聚合：

GET kamboh/_search
{
  "size": 0,
  "aggregations": {
    "terms_agg": {
      "terms": {
        "field": "description.simple_analyzer",
        "size": 20
      }
    }
  }
}

产量：

"aggregations" : {
    "terms_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "city",
          "doc_count" : 1
        },
        {
          "key" : "description",
          "doc_count" : 1
        },
        ...
      ]
    }
  }

现在，如您所见，simple analyzer 将字符串拆分为单词并将它们变为小写，但它也消除了字符串中的重复城市！我想不出一个可以保留重复项的分析器……话虽如此，

建议在编制索引之前进行这些字数统计！

你可以用空格分割你的字符串，并将它们索引为一个单词数组而不是一个长字符串。

这在搜索时也是可能的，尽管它非常昂贵，不能很好地扩展并且你需要在你的 es.yaml 中有 script.painless.regex.enabled: true：

GET kamboh/_search
{
  "size": 0,
  "aggregations": {
    "terms_script": {
      "scripted_metric": {
        "params": {
          "word_of_interest": ""
        },
        "init_script": "state.map = [:];",
        "map_script": """
              if (!doc.containsKey('description')) return;

              def split_by_whitespace = / /.split(doc['description.keyword'].value);

              for (def word : split_by_whitespace) {  
                 if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
                   return;
                 } 

                 if (state.map.containsKey(word)) {
                   state.map[word] += 1;
                   return;
                 }

                 state.map[word] = 1;
              }
""",
        "combine_script": "return state.map;",
        "reduce_script": "return states;"
      }
    }
  }
}

屈服

...
"aggregations" : {
    "terms_script" : {
      "value" : [
        {
          "occurrences" : 1,
          "post" : 1,
          "city" : 2,  <------
          "LA," : 1,
          "of" : 1,
          "this" : 1,
          "description" : 1,
          "is" : 1,
          "has" : 1,
          "my" : 1,
          "two" : 1,
          "word" : 1
        }
      ]
    }
  }
...

【讨论】：