Elasticsearch问题总结

[2016-12-15 14:53:21,496][WARN ][monitor.jvm ] [vsp4] [gc][old][94725][4389] duration [26.9s], collections [1]/[27s], total [26.9s]/[15.9h], memory [19.7gb]->[17gb]/[19.8gb], all_pools {[young] [1.1gb]->[43.1mb]/[1.1gb]}{[survivor] [130.2mb]->[0b]/[149.7mb]}{[old] [18.5gb]->[16.9gb]/[18.5gb]}
[2016-12-15 14:53:57,117][WARN ][monitor.jvm ] [vsp4] [gc][old][94731][4390] duration [29.9s], collections [1]/[30.4s], total [29.9s]/[15.9h], memory [18.6gb]->[18gb]/[19.8gb], all_pools {[young] [71.1mb]->[51.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [18.4gb]->[18gb]/[18.5gb]}
[2016-12-15 14:54:31,246][WARN ][monitor.jvm ] [vsp4] [gc][old][94735][4391] duration [30.6s], collections [1]/[31.1s], total [30.6s]/[15.9h], memory [18.5gb]->[17.9gb]/[19.8gb], all_pools {[young] [14.3mb]->[1.3mb]/[1.1gb]}{[survivor] [22.1mb]->[0b]/[149.7mb]}{[old] [18.4gb]->[17.9gb]/[18.5gb]}

ES内存配置策略有2点：

1.不超过可用内存的50%

2.不超过32G

fielddata加载数据到内存是按index来的，不会只加载检索结果数据，indices.fielddata.cache.size（5gb or 20%）控制fielddata可用内存，内存不够时，淘汰老数据，ES默认不淘汰。设置该值并不好，这样内存不够时每次会从磁盘读取，引起大量磁盘I/O，但如果想要ES只缓存最近的数据到内存，需要配置。

监控fielddata

per-index using the indices-stats API:
```
GET /_stats/fielddata?fields=*
```

per-node using the nodes-stats API:

GET /_nodes/stats/indices/fielddata?fields=*

Or even per-index per-node:

GET /_nodes/stats/indices/fielddata?level=indices&fields=*

By setting ?fields=*, the memory usage is broken down for each field.

fielddata circuit breaker可以在fielddata加载到内存前预估内存是否够用，如果内存不够用而继续读取fielddata到内存会导致内存溢出

连接：https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html

2. 检索报错

错误日志如下：

Failed to execute phase [query], all shards failed; shardFailures {[7l4w6bMqTReFs68KxMe1LA][smart_metadata-2015010100-2015010800][0]: RemoteTransportException[[vsp4][10.17.139.128:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionExce ption[rejected execution of org.elasticsearch.transport.TransportService$4@3b702529 on EsThreadPoolExecutor[search, queue capacity = 10000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6a6693a8[Running, pool size = 37, active threads = 37, queue d tasks = 10000, completed tasks = 269010]]]; }

解决办法：

修改elasticsearch.yml

threadpool.bulk.type: fixed
threadpool.bulk.size: 120
threadpool.bulk.queue_size: -1
threadpool.search.queue_size: -1

EsRejectedExecutionException in elasticsearch for parallel search

Answer1:

Elasticsearch has a thread pool and a queue for search per node. A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU. When the workers are full and there are more search requests , the request will go to queue. The size of queue is also limited. Its by default size is say 100 and if there happens more parallel requests than this , then those requests would be rejected as you can see in the error log.

The solution to this would be to -

Increase the size of queue or threadpool - The immediate solution for this would be to increase the size of the search queue. We can also increase the size of threadpool , but then that might badly effect the performance of individual queries. So increasing the queue might be a good idea. But then remember that this queue is memory residential and increasing the queue size too much can result in Out Of Memory issues. You can get more info on the samehere.
Increase number of nodes and replicas - Remember each node has its own search threadpool/queue. Also search can happen on primary shard OR replica.

Answer2:

Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.

Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.

ES Thread Pool

A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

There are several thread pools, but the important ones include:

generic: For generic operations (e.g., background node discovery). Thread pool type is scaling.
index: For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.
search: For count/search/suggest operations. Thread pool type is fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.
get: For get operations. Thread pool type is fixed with a size of # of available processors, queue_size of1000.
bulk: For bulk operations. Thread pool type is fixed with a size of # of available processors, queue_size of50. The maximum size for this pool is 1 + # of available processors.
percolate: For percolate operations. Thread pool type is fixed with a size of # of available processors, queue_size of 1000.
snapshot: For snapshot/restore operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of available processors)/2).
warmer: For segment warm-up operations. Thread pool type is scaling with a keep-alive of 5m and a max ofmin(5, (# of available processors)/2).
refresh: For refresh operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(10, (# of available processors)/2).
listener: Mainly for java client executing of action when listener threaded is set to true. Thread pool type is scalingwith a default max of min(10, (# of available processors)/2).

Changing a specific thread pool can be done by setting its type-specific parameters; for example, changing the index thread pool to have more threads:

thread_pool:
    index:
        size: 30

ES断电索引恢复方式

1、Translog异常

Elasticsearch问题总结特别要同时关注index.translog.interval配置，该配置为检查上述三种情况的时间间隔，不合理的配置可能导致上述配置无法达到预期，默认5s
通过设置index.engine.force_new_translog: true进行测试不会再出现translog异常到shard无法分配，且验证丢失数据为最新未flush的数据。

Elasticsearch问题总结
这种异常为非segment文件损坏，可以通过使用lucene-core-5.3.1.jar(ES_HOME/lib/lucene-core-5.3.1.jar)中的checkIndex工具回复异常。具体操作如下：
java -cp lucene-core-5.3.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /mnt/disk4/data/LOCALCLUSTER/SERVICE-ELASTICSEARCH-9e6f0b06c3f54797a313ab45734c3b1a/SERVICE-ELASTICSEARCH-9e6f0b06c3f54797a313ab45734c3b1a/nodes/0/indices/blacklist_alarm_info-2016071400-2016072100/0/index/ -exorcise
恢复的index中会有数据，丢失的数据需要通过校验来恢复

3、segment文件异常

Elasticsearch问题总结 https://github.com/elastic/elasticsearch/pull/17663