按令牌计数过滤搜索答案

【问题标题】：search with filter by token count按令牌计数过滤搜索
【发布时间】：2015-06-09 13:31:03
【问题描述】：

分析文档中的字段，创建令牌。

{"message":"hello world"} -> 令牌：["hello", "world"]
{"message":"hello"} -> 令牌：[“你好”]
{"message":"world"} -> 令牌：[“世界”]
{"message":"hello java"} -> 令牌：["hello", "java"]
{"message":"java"} -> 令牌：["java"]

是否有可能搜索特定字段包含给定标记和 1 个或多个标记其他标记的所有文档？

令牌“hello”的给定示例的结果将是：
- 1,4
对于“世界”：
- 1

如termvectors 中所述，可以访问令牌或有关它们的统计信息。这仅适用于特定文档，但不适用于查询或聚合的搜索过滤器。
如果有人能帮忙就好了。

【问题讨论】：

标签： elasticsearch token

【解决方案1】：

是的，您可以为此使用 token_count 类型。例如，在您的映射中，您可以将message 定义为包含消息本身（即“hello”、“hello world”等）以及消息的标记数的多字段。然后，您就可以在查询中包含字数限制。

所以message 的映射应该如下所示：

curl -XPUT localhost:9200/tests -d '
{
  "mappings": {
    "test": {
      "properties": {
        "message": {
          "type": "string",           <--- message is a normal analyzed string
          "fields": {
            "word_count": {           <--- a sub-field to include the word count
              "type": "token_count",
              "store": "yes",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

然后，您可以查询消息中包含hello 的所有文档，但仅限于message 具有多个令牌的文档。通过以下查询，您只会得到hello java 和hello world，而不是hello

curl -XPOST localhost:9200/tests/test/_search -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "message": "hello"
          }
        },
        {
          "range": {
            "message.word_count": {
              "gt": 1
            }
          }
        }
      ]
    }
  }
}

同样，如果您在上述查询中将hello 替换为world，您只会得到hello world。

【讨论】：