如何建立计算“新”属性的日期直方图查询答案

【问题标题】：How to build date histogram query which counts "new" properties如何建立计算“新”属性的日期直方图查询
【发布时间】：2017-09-06 14:38:39
【问题描述】：

我正在从设备收集数据，我想了解新设备何时上线。文件格式为：

{
  "device_id": "ue-0000"
}

我可以通过使用嵌套术语聚合进行日期直方图聚合来查询一段时间内的活动设备，但我不知道如何表达“从索引中较早出现device_id 的存储桶中过滤掉匹配项”的逻辑.

这是我当前的查询：

{
  "query": {
    "filtered": {
      "filter": {
        "range": {
          "timestamp": {
            "gte": "2015/12/08",
            "lte": "2016/01/08"
          }
        }
      }
    }
  },
  "aggregations": {
    "over_time": {
      "aggregations": {
        "app_count": {
          "terms": {
            "field": "app"
          }
        }
      },
      "date_histogram": {
        "field": "timestamp",
        "interval": "day",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "2015/12/08",
          "max": "2016/01/08"
        }
      }
    }
  }
}

我有这样的文档：

{
    "timestamp": "2015/12/15",
    "device_id": "1"
}
{
    "timestamp": "2015/12/16",
    "device_id": "2"
}
{
    "timestamp": "2015/12/20",
    "device_id": "1"
}

我想返回类似的东西：

{
  "aggregations": {
    "over_time": {
      "buckets": [
        {
          "key_as_string":"2015/12/15 00:00:00",
          "key":1449532800000,
          "doc_count":1,
          "new_devices":{
            "doc_count_error_upper_bound":0,
            "sum_other_doc_count":0,
            "buckets":[{"device_id": "1"}]}
        },
        {
          "key_as_string":"2015/12/16 00:00:00",
          "key":1449532800000,
          "doc_count":1,
          "new_devices":{
            "doc_count_error_upper_bound":0,
            "sum_other_doc_count":0,
            "buckets":[{"device_id": "2"}]}
        },
        // [[ SNIP ]]
        {
          "key_as_string":"2015/12/20 00:00:00",
          "key":1449532800000,
          "doc_count":0, // there are no new device_ids on this date
          "new_devices":{
            "doc_count_error_upper_bound":0,
            "sum_other_doc_count":0,
            "buckets":[]}
        }
      ]
    }
  }
}

【问题讨论】：

device_id 出现在索引的前面 是什么意思？您能否举一个示例文档之类的示例以及您期望的输出类型？
@ChintanShah25 好主意，我已经添加了示例索引文档和所需的输出类型。
解决方案是否有助于解决问题？

标签： elasticsearch histogram rollup nosql

【解决方案1】：

我认为您需要在timestamp 上再添加一个terms aggregation，这只会为您提供最新独特的设备。试试这样的

{
  "query": {
    "filtered": {
      "filter": {
        "range": {
          "timestamp": {
            "gte": "2015/12/08",
            "lte": "2016/01/08"
          }
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "unique_device": {
      "terms": {
        "field": "device_id",
        "size": 10
      },
      "aggs": {
        "unique_date": {
          "terms": {
            "field": "timestamp",
            "size": 1,                   
            "order": {
              "_term": "asc"
            }
          },
          "aggs": {
            "latest_device": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "day",
                "min_doc_count": 0,
                "extended_bounds": {
                  "min": "2015/12/08",
                  "max": "2016/01/08"
                }
              }
            }
          }
        }
      }
    }
  }
}

这里的size 和order 中的timestamp aggregation 只会为您提供date histogram 的新设备。

这有帮助吗？

【讨论】：

这为我提供了非常有用的 order._term 术语聚合属性，但这还不足以使其成为一站式商店。第一个问题：我的数据集上的查询速度很慢，即使多次调用也需要 2 秒（所以它没有使用缓存？）。它也与分桶不兼容，它返回桶内最旧的时间戳。我已经使用查询进行了后处理步骤并创建了一个新的索引来存储第一次看到的值。