【问题标题】:How to get max _id from elastic search如何从弹性搜索中获取最大_id
【发布时间】:2015-03-02 14:04:54
【问题描述】:

我创建了一条河流,每小时运行一次以从数据库中获取数据(使用 jdbc 河流插件)。

select * from orders

我不想选择所有记录,而是选择基于主键附加的数据。查询将是:

select * from orders where deviceid > '(Max Id in Elastic search)'

如何从弹性搜索中获取最大_id?

【问题讨论】:

    标签: elasticsearch


    【解决方案1】:

    似乎没有办法直接使用"_id" 字段,因为ES 坚持将"_id" 值转换为字符串。但是有一种方法可以解决它。

    首先我用几个文档建立了一个简单的索引,如下所示:

    PUT /test_index
    {
       "settings": {
          "number_of_shards": 1
       }
    }
    
    POST /test_index/_bulk
    {"index":{"_index":"test_index","_type":"doc","_id":1}}
    {"title":"first doc"}
    {"index":{"_index":"test_index","_type":"doc","_id":2}}
    {"title":"second doc"}
    {"index":{"_index":"test_index","_type":"doc","_id":3}}
    {"title":"third doc"}
    

    然后我尝试使用max aggregation,但出现错误,因为"_id"s 是字符串:

    POST /test_index/_search?search_type=count
    {
       "aggs": {
          "max_id": {
             "max": {
                "field": "_id"
             }
          }
       }
    }
    ...
    {
       "error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[bQS7TqO9SfKSPQZYVXQBag][test_index][0]: ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]}]",
       "status": 500
    }
    

    所以这行不通。但稍作修改,使用"_id" field 中的"path" 参数。

    所以我将索引重新定义为

    DELETE /test_index
    
    PUT /test_index
    {
       "settings": {
          "number_of_shards": 1
       },
       "mappings": {
          "doc": {
             "_id": {
                "path": "doc_id"
             }
          }
       }
    }
    

    然后使用"doc_id" 路径索引文档:

    POST /test_index/_bulk
    {"index":{"_index":"test_index","_type":"doc"}}
    {"title":"first doc","doc_id":1}
    {"index":{"_index":"test_index","_type":"doc"}}
    {"title":"second doc","doc_id":2}
    {"index":{"_index":"test_index","_type":"doc"}}
    {"title":"third doc","doc_id":3}
    

    现在如果我搜索,我可以看到 "_id" 仍然是一个字符串,但 "doc_id" 是一个整数:

    POST /test_index/_search
    ...
    {
       "took": 1,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "failed": 0
       },
       "hits": {
          "total": 3,
          "max_score": 1,
          "hits": [
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "1",
                "_score": 1,
                "_source": {
                   "title": "first doc",
                   "doc_id": 1
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "2",
                "_score": 1,
                "_source": {
                   "title": "second doc",
                   "doc_id": 2
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "3",
                "_score": 1,
                "_source": {
                   "title": "third doc",
                   "doc_id": 3
                }
             }
          ]
       }
    }
    

    所以现在我可以很容易地使用 max 聚合来找到最大的 id 值:

    POST /test_index/_search?search_type=count
    {
       "aggs": {
          "max_id": {
             "max": {
                "field": "doc_id"
             }
          }
       }
    }
    ...
    {
       "took": 1,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "failed": 0
       },
       "hits": {
          "total": 3,
          "max_score": 0,
          "hits": []
       },
       "aggregations": {
          "max_id": {
             "value": 3
          }
       }
    }
    

    【讨论】:

    • 感谢您的回答。现在我可以获得最大值。我应该如何在查询中使用它例如:- select * from orders where deviceid > '(Max Id in Elastic search)' 。我应该如何替换“弹性搜索中的最大 ID”中的值。注意: - 河流计划每小时运行一次,因此河流/查询运行的时间应该能够从弹性搜索中获得最大值
    • 您可能必须编写某种脚本来处理该部分。也许是一个由 cron 作业或其他东西运行的 python 脚本。
    • 如何将脚本作为 jdbc River 插件中参数选项卡的输入?
    • 很好,但这似乎已被弃用 (elastic.co/guide/en/elasticsearch/reference/current/…)。但是“copy_to”而不是“path”应该可以工作。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-09-07
    • 1970-01-01
    • 2017-01-26
    相关资源
    最近更新 更多