如何从弹性搜索索引中检索所有文档 ID答案

【问题标题】：How to retrieve all the document ids from an elasticsearch index如何从弹性搜索索引中检索所有文档 ID
【发布时间】：2014-08-26 00:55:45
【问题描述】：

如何从 Elasticsearch 索引中检索所有文档 ID（内部文档“_id”）？如果我在该索引中有 2000 万个文档，那么最好的方法是什么？

【问题讨论】：

您是否正在使用特定语言或客户端库与弹性通信？
stackoverflow.com/questions/17497075/…

标签： elasticsearch

【解决方案1】：

我会导出整个索引并读取文件系统。在处理数以百万计的查询结果集时，我在 size/from 和 scan/scroll 方面的经验是灾难性的。只是时间太长了。

如果你可以使用像背包这样的工具，你可以将索引导出到文件系统，并遍历目录。每个文档都存储在它自己的以_id 命名的目录下。无需实际打开文件。只需遍历目录即可。

背包链接： https://github.com/jprante/elasticsearch-knapsack

编辑：希望你不经常这样做......或者这可能不是一个可行的解决方案

【讨论】：

【解决方案2】：

对于这么多文档，您可能希望使用scan and scroll API。

许多客户端库都有现成的助手来使用该接口。例如，使用 elasticsearch-py 你可以：

es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
        print res['_id']

【讨论】：

scan 在 ES 2.1.0 中已弃用。因此，我们可能只需要使用滚动 API。 elastic.co/guide/en/elasticsearch/reference/current/…

【解决方案3】：

首先，您可以发出请求以获取索引中记录的完整计数。

curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'

{
  "count" : 1408,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

然后，您需要使用 size 和 from 参数的组合循环遍历集合，直到达到总数。传递一个空的field 参数将只返回您感兴趣的索引和_id。

找到一个好的page 大小，您可以在不耗尽内存的情况下使用它，并在每次迭代时增加from。

curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'

示例项目响应：

{
  "_index" : "documents",
  "_type" : "document",
  "_id" : "1341",
  "_score" : 1.0
},
...

【讨论】：

使用 size 和 from 的深度分页非常繁重。当你到达“?size=1000&from=19999000”时你就会意识到。
感谢 Anton，我还没有在这么大的数据集上尝试过。你推荐什么？
我推荐我的回答中提到的扫描和滚动 API。