【问题标题】:Aerospike Cluster Nodes intermittently going down and coming back upAerospike 集群节点间歇性下降和恢复
【发布时间】:2020-09-15 21:08:19
【问题描述】:

我有一个包含 15 个节点的 Aerospike 集群。该集群在正常负载 10k TPS 下表现相当不错。我今天做了一些测试,TPS 更高。我将 TPS 提高到 130k-150k TPS 左右。 我观察到一些节点间歇性地关闭,并在几秒钟后自动恢复。由于这些节点出现故障,我们会遇到心跳超时,因此也会出现读取超时。

一个集群节点配置:8核。 120GB 内存。我将数据存储在内存中。 所有节点都有足够的剩余空间。在 1.2TB (15*120) 的总集群空间中,仅使用了 275 GB 的空间。 此外,网络一点也不不稳定。所有这些机器都位于数据中心,并且是高带宽机器。

通过监测 AMC 得出的一些观察结果:

  1. 看到一些节点(大约 5-6 个)在几秒钟内处于非活动状态
  2. 这些节点中有大量​​客户端连接出现故障。例如:所有其他节点上有 6000-7000 个客户端连接。其中一个节点有异常的 25000 个客户端连接。

集群节点中的一些错误日志:

Sep 15 2020 17:00:43 GMT: WARNING (hb): (hb.c:4864) (repeated:5) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:43 GMT: WARNING (socket): (socket.c:808) (repeated:5) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:740) (repeated:3) Timeout while connecting
Sep 15 2020 17:00:53 GMT: WARNING (hb): (hb.c:4864) (repeated:3) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:808) (repeated:3) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
Sep 15 2020 17:01:03 GMT: WARNING (hb): (hb.c:4864) (repeated:1) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:808) (repeated:1) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:740) (repeated:2) Timeout while connecting
Sep 15 2020 17:01:13 GMT: WARNING (hb): (hb.c:4864) (repeated:2) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:808) (repeated:2) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:740) Timeout while connecting
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 10.33.54.144:2057
Sep 15 2020 17:02:44 GMT: WARNING (hb): (hb.c:4864) could not create heartbeat connection to node {10.33.54.144:2057}
Sep 15 2020 17:02:53 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting

我们还在正在关闭的节点中看到了其中一些错误日志:

Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9280f220a0102 on fd 4155 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b676220a0102 on fd 4149 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9fbd6200a0102 on fd 42 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb96d3d220a0102 on fd 4444 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb99036210a0102 on fd 4278 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9f102220a0102 on fd 4143 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb91822210a0102 on fd 4515 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9e5ff200a0102 on fd 4173 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb93f65200a0102 on fd 38 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9132f220a0102 on fd 4414 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb939be210a0102 on fd 4567 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b19a220a0102 on fd 4165 failed : Broken pipe

在此处附加 aerospike.conf 文件:

service {
    user root
    group root
    service-threads 12
    transaction-queues 12
    transaction-threads-per-queue 4
    proto-fd-max 50000
    migrate-threads 1
    pidfile /var/run/aerospike/asd.pid
}

logging {
        file /var/log/aerospike/aerospike.log {
        context any info
        context migrate debug
        }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode mesh
        port 2057

        mesh-seed-address-port 10.34.154.177 2057
        mesh-seed-address-port 10.34.15.40 2057
        mesh-seed-address-port 10.32.255.229 2057
        mesh-seed-address-port 10.33.54.144 2057
        mesh-seed-address-port 10.32.190.157 2057
        mesh-seed-address-port 10.32.101.63 2057
        mesh-seed-address-port 10.34.2.241 2057
        mesh-seed-address-port 10.32.214.251 2057
        mesh-seed-address-port 10.34.30.114 2057
        mesh-seed-address-port 10.33.162.134 2057
        mesh-seed-address-port 10.33.190.57 2057
        mesh-seed-address-port 10.34.61.109 2057
        mesh-seed-address-port 10.34.47.19 2057
        mesh-seed-address-port 10.33.34.24 2057
        mesh-seed-address-port 10.34.118.182 2057
        
        interval 150
        timeout 20
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace PS1 {
    replication-factor 2
    memory-size 70G
    single-bin false
    data-in-index false
    storage-engine memory   
    stop-writes-pct 90
    high-water-memory-pct 75    
}

namespace LS1 {
    replication-factor 2
    memory-size 30G
    single-bin false
    data-in-index false
    storage-engine memory   
    stop-writes-pct 90
    high-water-memory-pct 75
}

对此有什么可能的解释吗?

【问题讨论】:

  • 1 - 你的服务器版本是多少? $ asd --version 2- 尝试 $ asadm -e "show config diff" 看看是否某些节点在心跳子上下文中显示可能的配置差异。

标签: aerospike


【解决方案1】:

似乎节点在如此高的吞吐量下存在网络连接问题。这可能有不同的根本原因,从简单的网络相关瓶颈(带宽、每秒数据包数)到节点本身的某些东西妨碍与网络正确连接(软中断激增、网络队列分配不当、CPU 抖动) .这将阻止心跳连接/消息通过,导致节点离开集群直到它恢复。如果在云/虚拟化环境中运行,某些主机的邻居可能比其他主机更嘈杂,等等......

连接数的增加是一种症状,因为节点上的任何减速都会导致客户端通过增加吞吐量来进行补偿(这将增加连接数,这也可能导致螺旋式下降效应)。

最后,离开或加入集群的单个节点应该不会对读取事务产生太大影响。检查您的策略并确保正确设置了 socketTimeout / totalTimeout / maxRetries 等...以便读取可以快速重试不同的副本。

这篇文章可以帮助解决这个最新的问题:https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852/3

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-10-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多