【发布时间】:2018-12-27 17:53:14
【问题描述】:
运行重新分配分区工具,将分区扩展到超过 5 个代理而不是 5 个。 卡夫卡 2.1,在 Docker 上。
它达到了一个节点行为不稳定的地步。 其他(健康)节点开始显示这些消息:
[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
at scala.Option.foreach(Option.scala:257)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
15 分钟后,健康服务器显示以下消息:
[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)
稍后我们可以看到很多这样的消息:
[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)
在其他集合中,更常见:
[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass
为分区事件 53 签名副本 1、4、6。将为该分区返回空记录。 (kafka.server.ReplicaManager)
重新分配的主题在 3 台服务器中有 128 个分区。总而言之,每台服务器大约有 2000 个分区。
现在重新分配卡住了 6 小时,显示卡住了 41% 的分区复制不足。 它有复制 3,虽然它现在有复制 5。我想这是重新平衡如何在下面发生的一部分,以便增加这些副本,然后删除那些不需要的?
但节点 3 显示以下消息:
[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)
那么,节点“3”出了点问题——我怎么知道它发生了什么?
我们尝试在两个具有相同分区大小的主题中重新分配分区时发生了两次。在前面的案例中,我们启动了另一台机器作为具有相同 ID 的新代理(重新启动容器没有帮助)并且它恢复了。但是,如何避免这种情况发生呢?
根本原因是什么?
【问题讨论】:
-
你解决了吗?
-
嗨@NehaM我添加了一个答案。 HTH
标签: apache-kafka