【发布时间】:2021-09-02 13:21:28
【问题描述】:
在 Reaper 对 Cassandra 集群的 18 个节点运行修复失败后,我对每个节点进行了一次完全修复以修复失败的修复问题,在完全修复后,Reaper 执行成功,但几天后 Reaper 再次失败运行,在system.log中可以看到如下错误
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
在nodetool tpstats我可以看到一些待处理的任务
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
nodetool compactionstats 中还有 4 个待处理任务:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
我的问题是,为什么即使在完全修复之后,reaper 仍然出现故障?待修复的根本原因是什么?
PS:Reaper的版本是2.2.3,不知道是不是Reaper的bug!
【问题讨论】: