Elasticsearch/Python - 更改映射后重新索引数据？答案

【问题标题】：Elasticsearch/Python - Re-index data after changing the mappings?Elasticsearch/Python - 更改映射后重新索引数据？
【发布时间】：2015-08-29 11:34:17
【问题描述】：

在映射或数据类型发生更改后，我对如何在弹性搜索中重新索引数据有点困惑。

根据弹性搜索文档

使用滚动搜索从旧索引中提取文档，并使用批量 API 将它们索引到新索引中。许多客户端 API 提供了一个 reindex() 方法，它将为您完成所有这些工作。完成后，您可以删除旧索引。

这是我的旧地图

{
  "test-index2": {
    "mappings": {
      "business": {
        "properties": {
          "address": {
            "type": "nested",
            "properties": {
              "country": {
                "type": "string"
              },
              "full_address": {
                "type": "string"
              }
            }
          }
        }
      }
    }
  }
}

新的索引映射，我正在改变full_address -> location_address

{
  "test-index2": {
    "mappings": {
      "business": {
        "properties": {
          "address": {
            "type": "nested",
            "properties": {
              "country": {
                "type": "string"
              },
              "location_address": {
                "type": "string"
              }
            }
          }
        }
      }
    }
  }
}

我正在使用 python 客户端进行弹性搜索

https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex

from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex
es = Elasticsearch(["es.node1"])

reindex(es, "source_index", "target_index")

但是，这会将数据从一个索引传输到另一个索引。

我如何使用它来更改上述案例的映射/（数据类型等）？

【问题讨论】：

标签： python elasticsearch

【解决方案1】：

如果你使用已经在 elasticsearch 的 python 客户端中实现的 scan&scroll 和 Bulk API 就很简单了

首先 -> 通过 scan&scroll 方法获取所有文档

循环并对每个文档进行必要的修改

使用 Bulk API 将修改后的文档插入新索引

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch()

# Use the scan&scroll method to fetch all documents from your old index

res = helpers.scan(es, query={
  "query": {
    "match_all": {}

  },
  "size":1000 
},index="old_index")


new_insert_data = []

# Change the mapping and everything else by looping through all your documents

for x in res:
    x['_index'] = 'new_index'
    # Change "address" to "location_address"
    x['_source']['location_address'] = x['_source']['address']
    del x['_source']['address']
    # This is a useless field
    del x['_score']
    es.indices.refresh(index="testing_index3")

    # Add the new data into a list
    new_insert_data.append(x)





es.indices.refresh(index="new_index")
print new_insert_data

#Use the Bulk API to insert the list of your modified documents into the database
helpers.bulk(es,new_insert_data)

【讨论】：

【解决方案2】：

reindex() API 只是将文档从一个索引“移动”到另一个索引。它无法检测/推断旧索引文档中的字段名称full_address 应该是新索引文档中的location_address。我怀疑标准 Elasticsearch 客户端提供的任何 API 都可以满足您的需求。我能想到实现这一点的唯一方法是通过客户端的附加自定义逻辑，它维护从旧索引到新索引的字段名称字典，然后从旧索引读取文档并将相应文档索引到具有新字段的新索引从字段名称字典中获取的名称。

【讨论】：

【解决方案3】：

更新映射后，这可以通过使用批量 API 更新现有文档来完成。

POST /_bulk {"update":{"_id":"59519","_type":"asset","_index":"assets"}} {"doc":{"facility_id":491},"detect_noop":false}

注意 - 使用 'detect_noop' 来检测 noop 更新。

【讨论】：