基于 Elasticsearch 嵌套对象的过滤和计数操作答案

【问题标题】：Elasticsearch nested object based filter and count operation基于 Elasticsearch 嵌套对象的过滤和计数操作
【发布时间】：2021-01-25 01:43:21
【问题描述】：

我是弹性搜索的新手。试图将其用于分析计算。我不知道，是否有可能做到这一点，但是，我试图找到购买量为 0 的客户。我将订单存储为每个客户的嵌套对象数组。在这里您可以找到客户索引的示例映射属性：

"first_name" => [
 "type" => "text"
],
"last_name" => [
    "type"=> "text"
],
"email" => [
    "type"=> "text"
],
"total_spent" => [
    "type"=> "text"
],
"aov" => [
    "type"=> "float"
],
"orders_count" => [
    "type"=> "integer"
],
"orders" => [
    "type" => "nested",
    "properties" => [
        "order_id" => [
            "type"=>"text"
        ],
        "total_price" => [
            "type"=>"float"
        ]
    ]
]

客户索引示例：

    [
   {
      "_index":"customers_index",
      "_type":"_doc",
      "_id":"1",
      "_score":1,
      "_source":{
         "first_name":"Stephen",
         "last_name":"Long",
         "email":"egnition_sample_91@egnition.com",
         "total_spent":"0.00",
         "aov":0,
         "orders":[]
      }
   },
   {
      "_index":"customers_index",
      "_type":"_doc",
      "_id":"2",
      "_score":1,
      "_source":{
         "first_name":"Reece",
         "last_name":"Dixon",
         "email":"egnition_sample_57@egnition.com",
         "total_spent":"0.10",
         "aov":"0.1",
         "orders":[
            {
               "total_price":"0.10",
               "placed_at":"2020-09-24T20:08:35.000000Z",
               "order_id":2723671867546
            }
         ]
      }
   },
   {
      "_index":"customers_index",
      "_type":"_doc",
      "_id":"3",
      "_score":1,
      "_source":{
         "first_name":"John",
         "last_name":"Marshall",
         "email":"egnition_sample_94@egnition.com",
         "total_spent":"0.10",
         "aov":"0.04",
         "orders":[
            {
               "total_price":"0.10",
               "placed_at":"2020-09-24T20:10:52.000000Z",
               "order_id":2723675930778
            },
            {
               "total_price":"0.30",
               "placed_at":"2020-09-24T20:09:45.000000Z",
               "order_id":2723673899162
            },
            {
               "total_price":"0.10",
               "placed_at":"2020-09-16T09:55:22.000000Z",
               "order_id":2704717414554
            }
         ]
      }
   }
]

首先，我想问一下，你认为这种映射符合弹性搜索的本质吗？例如，我可以按特定日期范围对客户进行分组，并将 total_spent 总和作为汇总数据。但是，我想了解的是，是否可以通过特定日期范围的过滤嵌套订单数组找到没有订单的客户？您认为这种查询是否存在一些性能问题？

我不熟悉nosql 数据库。我是一个 RDBMS 人。因此，我试图将 Elastic Search 的概念理解为分析数据库。

感谢回复

编辑：

我正在尝试计算对象之间指定日期范围的过滤器内的嵌套对象。在elasticsearch上这样做是否可行并且有意义？简单地说，我想查看在指定日期内有 1 个订单或多个订单的客户。

我知道如何获取每日客户数量，但是如果我想在一组每日报表中统计在指定日期范围内有 1 个订单的客户怎么办？

我预期的可能响应：

{
...
"aggregations":[
{
"date":"2020-09-01",
"total_customers_zero_purchased":15
}
...
]
}

【问题讨论】：

能否请您分享您的索引数据，以及预期的搜索结果
@Bhavya 感谢您的反馈。用预期的搜索结果和索引示例更新了问题

标签： php elasticsearch

【解决方案1】：

这里提出了很多问题，所以我将专注于最重要的部分。

首先，习惯上创建.keyword 类型的某些文本字段，以便我们以后可以对它们进行聚合。这意味着：

PUT customers_index
{
  "mappings": {
    "properties": {
      "email": {
        "type": "keyword"    <--
      }
    }
  }
}

之后我们可以继续查询，但必须注意当我们迭代日期范围时，我们需要指定一个日期字段。含义：

迭代范围是根据可用/当前值自动构建的（我们可以filter 限制其范围）
而且当文档确实不包含给定范围内的日期时，可以理解的是，它会被跳过。

实际上，我们无法获得每日滚动聚合（因为我们不知道我们不知道什么），而只能获得单日指标。例如

GET customers_index/_search
{
  "size": 0,
  "aggs": {
    "multibucket_simulator": {
      "filters": {
        "filters": {
          "all": {
            "match_all": {}
          }
        }
      },
      "aggs": {
        "all_customers": {
          "cardinality": {
            "field": "email"
          }
        },
        "customers_who_purchased_at_date": {
          "filter": {
            "nested": {
              "path": "orders",
              "query": {
                "range": {
                  "orders.placed_at": {
                    "gte": "2020-09-16T00:00:00.000000Z",
                    "lt": "2020-09-26T00:00:00.000000Z"
                  }
                }
              }
            }
          },
          "aggs": {
            "customer_count": {
              "cardinality": {
                "field": "email"
              }
            }
          }
        },
        "total_customers_zero_purchased": {
          "bucket_script": {
            "buckets_path": {
              "all": "all_customers.value",
              "filtered": "customers_who_purchased_at_date>customer_count.value"
            },
            "script": "params.all - params.filtered"
          }
        }
      }
    }
  }
}

屈服

"aggregations" : {
  "multibucket_simulator" : {
    ...
    "buckets" : {
      "all" : {
        ...
        "customers_who_purchased_at_date" : {
          ...
        },
        "all_customers" : {
          ...
        },
        "total_customers_zero_purchased" : {       <---
          "value" : 1.0
        }
      }
    }
  }
}

从而回答问题：

有多少客户没有在 09/16 和 09/25 之间购买任何东西？

【讨论】：

我想问的是，过滤时是否可以计算嵌套。正如你在 customers_who_purchased_at_date 聚合上实现的那样？
抱歉，不知道你在问什么。当您在 nested 过滤器上下文中时，每个嵌套项目都单独计数，并且您没有可用的父（顶级）文档计数。还是您的意思是您要计算订单而不是客户？
是的，其实我想学习elasticsearch的视角。简单地说，是否可以即时计算范围过滤的订单数组？
是和不是。可以即时计算任何内容，但是一旦您进入nested，您将无法访问父文档。这似乎是您最初问题的目标。