【问题标题】:Elasticsearch query to get the list of documents with some minimum occurrence of a propertyElasticsearch 查询以获取属性出现次数最少的文档列表
【发布时间】:2021-11-12 06:04:42
【问题描述】:

我有一个包含此类文档的索引

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-23"
   ...
 },
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 },
 {
   "customer_id" : "345",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-25"
   ...
 }
]

我想获取来自特定国家(例如美国)的所有文档的列表,在给定时间范围内至少出现 2 次相同的 customer_id。 有了上面的数据,应该会返回

[
 {
   "customer_id" : "123",
   "country": "USA",
   "department": "IT",
   "creation_date" : "2021-06-24"
   ...
 }
]

现在,我尝试了下面的 ES 查询

POST /index_name/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "creation_date": {
              "gte": "2021-06-23",
              "lte": "2021-08-23"
            }
          }
        },
        {
          "match": {
            "country": "USA"
          }
        }
      ]
    }
  },
  "aggs": {
    "customer_agg": {
      "terms": {
        "field": "customer_id",
        "min_doc_count": 2
      }
    }
  }
}

以上查询返回如下结果

"hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.5587491,
    "hits" : [...]
    ]
  },
  "aggregations" : {
    "person_agg" : {
      "doc_count_error_upper_bound" : 1,
      "sum_other_doc_count" : 1,
      "buckets" : [
        {
          "key" : "customer_id",
          "doc_count" : 2
        }
      ]
    }
  }

我不需要响应的桶列表,而只需要满足条件的文档列表。我怎样才能实现它?

【问题讨论】:

    标签: elasticsearch elasticsearch-5


    【解决方案1】:

    乍一看,我注意到在搜索查询中您正在搜索名为 creation_timestamp 的字段,但在文档的映射中,您说要进行范围检查的日期字段名为 creation_date

    我决定在 Elasticsearch 7.10 上进行本地测试,这是我使用的设置

    PUT /test-index-v1
    
    PUT /test-index-v1/_mapping
    {
            "properties": {
                "customer_id": {
                    "type": "keyword"
                },
                "country": {
                    "type": "keyword"
                },
                "department": {
                    "type": "keyword"
                },
                "creation-date": {
                  "type": "date"
                }
            }
    }
    

    如您所见,我在字段上使用keyword,以便我们可以使用 - 排序、聚合等。

    创建索引后,我导入了您提供的文档作为示例

    POST /test-index-v1/_doc
     {
       "customer_id" : "345",
       "country": "USA",
       "department": "IT",
       "creation_date" : "2021-06-25"
    }
    
    POST /test-index-v1/_doc
     {
       "customer_id" : "123",
       "country": "USA",
       "department": "IT",
       "creation_date" : "2021-06-25"
    }
    
    POST /test-index-v1/_doc
     {
       "customer_id" : "123",
       "country": "USA",
       "department": "IT",
       "creation_date" : "2021-06-24"
    }
    

    然后我执行了这个搜索查询,包括在 customer_id 上的 must match

    POST /test-index-v1/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "creation_date": {
                  "gte": "2021-06-23",
                  "lte": "2021-08-23"
                }
              }
            },
            {
              "match": {
                "country": "USA"
              }
            },
            {
              "match": {
                "customer_id": "123"
              }
            }
          ]
        }
      },
      "aggs": {
        "customer_agg": {
          "terms": {
            "field": "customer_id",
            "min_doc_count": 2
          }
        }
      }
    }
    

    此查询也会返回搜索结果。仅使用聚合不会返回 searchHits。

    这是我收到的回复:

    
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 1.6035349,
        "hits" : [
          {
            "_index" : "test-index-v1",
            "_type" : "_doc",
            "_id" : "vbVD9HsBRVWFAvvZTW-l",
            "_score" : 1.6035349,
            "_source" : {
              "customer_id" : "123",
              "country" : "USA",
              "department" : "IT",
              "creation_date" : "2021-06-25"
            }
          },
          {
            "_index" : "test-index-v1",
            "_type" : "_doc",
            "_id" : "vrVD9HsBRVWFAvvZU29q",
            "_score" : 1.6035349,
            "_source" : {
              "customer_id" : "123",
              "country" : "USA",
              "department" : "IT",
              "creation_date" : "2021-06-24"
            }
          }
        ]
      },
      "aggregations" : {
        "customer_agg" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "123",
              "doc_count" : 2
            }
          ]
        }
      }
    }
    

    希望这对您的问题有所帮助。如果您对 Elastic 有其他问题,请随时发表评论! :)

    编辑:

    关于在某个日期范围内按 customer_id 分组,我使用了这个查询:

    POST /test-index-v1/_search
    {
      "aggs": {
        "group_by_customer_id": {
          "terms": {
            "field": "customer_id"
          },
          "aggs": {
            "dates_between": {
              "filter": {
                "range": {
                  "creation_date": {
                    "gte": "2020-06-23",
                    "lte": "2021-06-24"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    响应是:

    "aggregations" : {
        "group_by_customer_id" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "123",
              "doc_count" : 2,
              "dates_between" : {
                "doc_count" : 1
              }
            },
            {
              "key" : "345",
              "doc_count" : 1,
              "dates_between" : {
                "doc_count" : 0
              }
            }
          ]
        }
      }
    

    【讨论】:

    • 感谢您的回复。这确实有效。但我需要的是最少出现 2 次的客户列表,并且我没有要在匹配查询中使用的客户 ID。它类似于 SQL 中的“GROUP BY”客户 ID 和“HAVING”计数 >=2。你能建议吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-10-22
    • 2019-02-19
    • 1970-01-01
    • 2021-09-09
    • 2021-04-10
    • 2021-12-06
    • 1970-01-01
    相关资源
    最近更新 更多