如何将 TOP N 文档与弹性搜索中的术语相加？答案

【问题标题】：How to sum TOP N docs with terms in elasticsearch?如何将 TOP N 文档与弹性搜索中的术语相加？
【发布时间】：2020-04-05 04:29:59
【问题描述】：

以下是 elasticsearch 的示例文档。

         {
            "_index": “social”,
            "_type": “social”,
            "_id": "1632560884596186633",
            "_score": 1,
            "_source": {
                "created_date": "2017-10-24",
                "reach": 1692,                    
                "social_id": 200
            }
        },
        {
            "_index": “social”,
            "_type": “social”,
            "_id": "1626693964184981799",
            "_score": 1,
            "_source": {
                "created_date": "2017-10-25”,
                "reach": 1692,                    
                “social_id": 100               
            }
        },
        {
            "_index": “social”,
            "_type": “social”,
            "_id": "162669396418498170",
            "_score": 1,
            "_source": {
                "created_date": "2017-10-25”,
                "reach": 1692,                    
                “social_id": 50               
            }
        },
        {
            "_index": “social”,
            "_type": “social”,
            "_id": "1626693964184981756",
            "_score": 1,
            "_source": {
                "created_date": "2017-10-25”,
                "reach": 1692,                    
                “social_id": 25               
            }
        }

问题：根据每个社交 ID 的创建日期，前 2 个文档的覆盖面总和。

我尝试过的：

{
"size": 0,
"aggs": {
    "reach_bucket": {
        "terms": {
            "size": 200,
            "field": "social_id"
        },
        "aggs": {
            "media_reach_bucket": {
                "terms": {
                    "field": "created_date",
                    "size": 200
                },
                "aggs": {
                    "top_sales_hits": {
                        "top_hits": {
                            "sort": [
                                {
                                    "created_date": {
                                        "order": "desc"
                                    }
                                }
                            ],
                            "_source": {
                                "includes": [
                                    "created_date",
                                    "reach"
                                ]
                            },
                            "size": 2
                        }
                    }
                }
            }
        }
    }
}
}

问题：

不做top_hits的子聚合。

任何建议将不胜感激。

【问题讨论】：

标签： elasticsearch

【解决方案1】：

您可能希望在每天进行分桶时使用date_histogram 而不是terms（我假设）。但更重要的是，您应该按reach 而非created_date 对top_hits 进行排序，因为这在您的每日存储桶中将是相同的。

{
  "size": 0,
  "aggs": {
    "reach_bucket": {
      "terms": {
        "size": 200,
        "field": "social_id"
      },
      "aggs": {
        "media_reach_bucket": {
          "date_histogram": {
            "field": "created_date",
            "calendar_interval": "day"
          },
          "aggs": {
            "top_sales_hits": {
              "top_hits": {
                "sort": [
                  {
                    "reach": {
                      "order": "desc"
                    }
                  }
                ],
                "_source": {
                  "includes": [
                    "reach"
                  ]
                },
                "size": 2
              }
            }
          }
        }
      }
    }
  }
}

像这样产生热门歌曲

"aggregations" : {
    "reach_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 100,
          "doc_count" : 4,
          "media_reach_bucket" : {
            "buckets" : [
              {
                "key_as_string" : "2017-10-24T00:00:00.000Z",
                "key" : 1508803200000,
                "doc_count" : 4,
                "top_sales_hits" : {
                  "hits" : {
                    "total" : {
                      "value" : 4,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "kart",
                        "_type" : "_doc",
                        "_id" : "3iLJRnEBZbobBB0NiV8R",
                        "_score" : null,
                        "_source" : {
                          "reach" : 40
                        },
                        "sort" : [
                          40
                        ]
                      },
                      {
                        "_index" : "kart",
                        "_type" : "_doc",
                        "_id" : "3SLJRnEBZbobBB0Nhl-Y",
                        "_score" : null,
                        "_source" : {
                          "reach" : 30
                        },
                        "sort" : [
                          30
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }

然后您可以在您的后期处理功能中总结其覆盖范围。

我不熟悉 top-n 总和，只有文档总和超过某个阈值——在这种情况下，我会使用 filter aggregations。

【讨论】：

如何对 top_hits 求和