elasticsearch 横向扩展-添加节点

ELK

ELK 运维都会接触到，是一个分布式日志收集平台。logstash 收集数据写到elasticsearch里，kibana从elasticsearch里读取数据。数据在elasticsearch里可以被检索，各种查询api,数据聚合等，功能很强大，不多说。

environments

目前现状：

os: centos6.6

elasticsearch: 2.4

cpu: 8

mem: 48

disk: 8T

index: 4 #一天4个

primary shards: 12

replica shards: 1

Master-eligible node: 4个 # 可以vote成master

Data node: 2个 # 数据结点，不可成为master

Client node: 0个 # 既不是maseter，也没是node,"smart node"转发集群之间的请求，和数据。

Tribe node: 0个 #用途冷热分离，读写分离。

官网介绍的比较清楚。

下面是添加节点之后的图片。

elasticsearch 横向扩展-添加节点

起因

elasticsearch 原来只有6节点，一直工作很正常，最近老是出 gc,一般gc大家还能接受，就是释放点内存java的机制，这会可能是young区清除一区不用的对象，释放空间。但最近就是出现fullgc 就是old区，一些常驻内存的对象的区，要进行压缩内存。一般自己配置32G的空间给elasticsearch。但32G压缩的很慢，要几十分钟，而且压缩完最多也就是31.5G,也就是压缩了几十分钟，正常几分钟，一有人在kibana上查询，又要fullgc了，这会这个节点基本属于不工作的状态了。一个节点挂了之前后面会增大其它节点的压力，其它五个节点也先后进行的fullgc的状态，此时的情况有可能集群还是green状态。但其它logstash已经写不数据了，kibana也查不到数据了。

尝试修复

之前几个FULLGC其它基本是晚上，没有人查询，数据写入量相对减少。做法相对简单粗暴，直接挨着重启节点。

\# 关闭自动分片

curl $ESIP:9200/_cluster/settings -d'{"transient" : {"cluster.routing.allocation.enable" : "none"}}' -XPUT

\# 重启节点

/etc/init.d/elasticsearch stop

/etc/init.d/elasticsearch start

\# 开启自动分片

curl $ESIP:9200/_cluster/settings -d'{"transient" : {"cluster.routing.allocation.enable" : "all"}}' -XPUT

等集群状态变green再操作下一台，直到整个集群节点全部重启。其是分布式的，可以热更新配置自动分片等诸多配置。

（这里我也不清楚是不是yellow状态也可以重启下一台，没试过）

前几次fullgc就是这么处理的，恢复相对快点。

惨痛

有天突然下午工作日FULLGC，我按着上面的办法操作，谁知道，集群状态刚变绿，开发查着数据。刚重启的一个节点，没两分钟又进入了fullgc的状态了。该优化了，或者横向扩展，之前已经有几任同事优化过了，致敬。

什么时候都会有gc的动作呢，下面是解释。

当这个数据大于 75% 的时候，ES 就要开始 GC。也就是说，如果你的节点这个数据长期在 75% 以上，说明你的节点内存不足，GC 可能会很慢了。更进一步，如果到 85% 或者 95% 了，估计节点一次 GC 能耗时 10s 以上，甚至可能会发生 OOM 了

curl -s -XGET 'http://xxxxx:9200/_nodes/stats'; | python -m json.tool | grep heap_used_percent

定期晚上清除cache curl -XGET 'http://xxxxxx:9200/*/_cache/clear'

添加节点

配置一样

# Cluster

cluster.name: elastic_xxxx

cluster.routing.allocation.node_concurrent_recoveries: 50

cluster.routing.allocation.node_initial_primaries_recoveries: 20

# Node

node.name: xxxxxxx-03

node.rack: xxxx-03

node.master: true

node.data: true

# Index

index.number_of_shards: 12

index.number_of_replicas: 1

#index.cache.field.max_size: 50000

index.cache.field.expire: 60m

#index.cache.field.type: soft

index.refresh_interval: 5s

index.translog.flush_threshold_size: 512mb

index.translog.flush_threshold_period: 30m

index.translog.interval: 5s

# Paths

path.data: /data/es/data

path.work: /data/es/work

path.logs: /data/es/logs

path.config: /opt/es/config

path.plugins: /opt/es/plugins

# Memory

bootstrap.memory_lock: true

indices.cache.filter.size: 20%

indices.fielddata.cache.size: 40%

# Network

network.host: 172.16.3.86

# Discovery

discovery.zen.ping.unicast.hosts: ["xxxx-01:9300", "xxxx-02:9300", "xxxxx-03:9300", "xxxxxxxxx-04:9300", "xxxxxxx-05:9300", "xxxxxxxx-06:9300"]

discovery.zen.minimum_master_nodes: 4#master node/ 2 + 1

discovery.zen.ping.timeout: 20s

# Gateway

#gateway.type: local

gateway.recover_after_nodes: 5

gateway.recover_after_time: 5m

gateway.expected_nodes: 6

# Various

node.max_local_storage_nodes: 1

action.destructive_requires_name: true

# Recovery

indices.recovery.max_bytes_per_sec: 800mb

indices.recovery.concurrent_streams: 50

# Slow log

index.indexing.slowlog.level: info

index.indexing.slowlog.source: 1000

index.search.slowlog.threshold.query.warn: 10s

index.search.slowlog.threshold.query.info: 5s

index.search.slowlog.threshold.query.debug: 2s

index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s

index.search.slowlog.threshold.fetch.info: 800ms

index.search.slowlog.threshold.fetch.debug: 500ms

index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s

index.indexing.slowlog.threshold.index.info: 5s

index.indexing.slowlog.threshold.index.debug: 2s

index.indexing.slowlog.threshold.index.trace: 500ms

配置改改名，插件安装，启动节点前，想到许多问题来了。

新加的节点磁盘空间较大，会引起之前的节点重启rebalance吗，怎么关掉?

关掉之前的indices 的allocation。和全局的cluster.routing.allocation.enable开启不冲突。

curl -XPUT xxxxx:9200/logstash_*/_settings -d '{"index.routing.allocation.disable_allocation":"true"}'
新加的两个是master节点，eslect master的时候vote的个数要变?

curl -XPUT 'http://172.16.3.87:9200/_cluster/settings' -d '{"persistent": {"discovery.zen.minimum_master_nodes": 4}}'
根据磁盘空间因素的去掉

cluster.allocation.disk.threshold_enabled: false

elasticsearch 横向扩展-添加节点

新生成的index平均分配到其它节点上

修改template,添加routing.allocation.total_shards_per_node ，限制每个新生成的index在每个节点上只能分4个shards。

更新template

elasticsearch 横向扩展-添加节点

curl -XPUT http://xxxxx:9200/_template/logstash_cdn [email protected]_index_template_cdn.txt

其它

查看分片

curl -s -XGET 'http://xxxx:9200/_cat/shards/logstash_jdev-2015.10.18' | sort -k 8

**欢迎加入QQ群一块讨论学习 1016108829**