如何从弹性搜索中获取最大_id答案

【问题标题】：How to get max _id from elastic search如何从弹性搜索中获取最大_id
【发布时间】：2015-03-02 14:04:54
【问题描述】：

我创建了一条河流，每小时运行一次以从数据库中获取数据（使用 jdbc 河流插件）。

select * from orders

我不想选择所有记录，而是选择基于主键附加的数据。查询将是：

select * from orders where deviceid > '(Max Id in Elastic search)'

如何从弹性搜索中获取最大_id？

【问题讨论】：

标签： elasticsearch

【解决方案1】：

似乎没有办法直接使用"_id" 字段，因为ES 坚持将"_id" 值转换为字符串。但是有一种方法可以解决它。

首先我用几个文档建立了一个简单的索引，如下所示：

PUT /test_index
{
   "settings": {
      "number_of_shards": 1
   }
}

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"first doc"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"second doc"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"third doc"}

然后我尝试使用max aggregation，但出现错误，因为"_id"s 是字符串：

POST /test_index/_search?search_type=count
{
   "aggs": {
      "max_id": {
         "max": {
            "field": "_id"
         }
      }
   }
}
...
{
   "error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[bQS7TqO9SfKSPQZYVXQBag][test_index][0]: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]}]",
   "status": 500
}

所以这行不通。但稍作修改，使用"_id" field 中的"path" 参数。

所以我将索引重新定义为

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1
   },
   "mappings": {
      "doc": {
         "_id": {
            "path": "doc_id"
         }
      }
   }
}

然后使用"doc_id" 路径索引文档：

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"first doc","doc_id":1}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"second doc","doc_id":2}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"third doc","doc_id":3}

现在如果我搜索，我可以看到 "_id" 仍然是一个字符串，但 "doc_id" 是一个整数：

POST /test_index/_search
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 1,
            "_source": {
               "title": "first doc",
               "doc_id": 1
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 1,
            "_source": {
               "title": "second doc",
               "doc_id": 2
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1,
            "_source": {
               "title": "third doc",
               "doc_id": 3
            }
         }
      ]
   }
}

所以现在我可以很容易地使用 max 聚合来找到最大的 id 值：

POST /test_index/_search?search_type=count
{
   "aggs": {
      "max_id": {
         "max": {
            "field": "doc_id"
         }
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "max_id": {
         "value": 3
      }
   }
}

【讨论】：

感谢您的回答。现在我可以获得最大值。我应该如何在查询中使用它例如：- select * from orders where deviceid > '(Max Id in Elastic search)' 。我应该如何替换“弹性搜索中的最大 ID”中的值。注意： - 河流计划每小时运行一次，因此河流/查询运行的时间应该能够从弹性搜索中获得最大值
您可能必须编写某种脚本来处理该部分。也许是一个由 cron 作业或其他东西运行的 python 脚本。
如何将脚本作为 jdbc River 插件中参数选项卡的输入？
很好，但这似乎已被弃用 (elastic.co/guide/en/elasticsearch/reference/current/…)。但是“copy_to”而不是“path”应该可以工作。