【发布时间】:2019-01-08 07:40:40
【问题描述】:
我使用 Kafka 0.11.0.3
我有一个 Kafka 代理和一个远程 Zookeeper 集群。我启动了 Kafka 服务器,它在 Zookeeper 中成功注册了它的 id,我什至可以使用 kafka-topic.sh 命令获取主题列表。问题是我在 Kafka 日志中反复观察到以下几行:
[2019-01-08 10:51:09,138] WARN Attempting to send response via channel for which there is no open connection, connection id 192.168.0.201:9092-192.168.0.201:58292 (kafka.network.Processor)
[2019-01-08 10:51:09,198] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,226] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,306] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,327] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,382] WARN Attempting to send response via channel for which there is no open connection, connection id 192.168.0.201:9092-192.168.0.201:58296 (kafka.network.Processor)
[2019-01-08 10:51:09,408] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,446] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,559] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
[2019-01-08 10:51:09,602] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
代理尝试连接到同一台机器(Kafka 服务器正在运行)上的端口 58292,但无法建立连接。 我还检查了 Zookeeper 上的控制器目录,它是空的。 更奇怪的是,当我在 Kafka 服务器节点上建立 TCP 连接时,我观察到这么多 TIME_WAIT 连接:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 192.168.0.201:55572 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56290 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55442 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55512 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56074 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56286 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55460 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55904 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55488 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56308 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55502 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56326 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55960 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55930 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56300 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56004 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55470 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55474 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55432 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55412 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56304 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55858 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55860 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56324 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55388 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56168 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55898 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55820 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55676 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56202 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55756 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56278 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55658 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55628 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56038 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56108 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55988 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55894 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55428 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55424 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56128 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56146 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55884 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56280 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55798 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56120 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55888 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55708 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55696 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56298 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55646 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56150 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55376 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55980 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55556 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56208 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55752 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55982 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55864 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55760 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56056 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56002 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55536 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55576 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55392 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55726 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55426 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55710 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56042 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56264 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55606 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55972 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56176 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55780 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56342 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55534 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55438 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56114 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56068 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55880 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56350 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55970 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55404 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55672 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55454 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55946 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56126 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55538 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56124 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55712 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56084 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55992 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56302 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55984 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55394 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55550 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56094 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55936 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55530 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55868 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:56294 192.168.0.201:9092 TIME_WAIT -
tcp 0 0 192.168.0.201:55876 192.168.0.201:9092 TIME_WAIT -
tcp 0 31 192.168.0.201:57552 192.168.0.204:2181 ESTABLISHED 1015/java
唯一成功建立的连接是 Zookeeper(在最后一行)。我还从远程节点检查了端口 9092,它是打开的:
Starting Nmap 7.01 ( https://nmap.org ) at 2019-01-08 11:32 +0330
Nmap scan report for (192.168.0.201)
Host is up (0.0027s latency).
PORT STATE SERVICE
9092/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.08 seconds
几点:
- broker 正常运行了大约 2 个月,但错误突然发生。
- Zookeeper 集群工作正常,因为 HDFS 等其他一些组件正在使用它并且没有错误。
- 操作系统为 CentOS7,未启用防火墙。
这里是 Kafka 服务器配置:
broker.id=100
listeners=PLAINTEXT://192.168.0.201:9092
num.partitions=24
delete.topic.enable=true
log.dirs=/data/esb
zookeeper.connect=co1:2181,co2:2181
log.retention.hours=168
zookeeper.session.timeout.ms=40000
TIME_WAIT 连接的原因可能是什么?
【问题讨论】:
-
不确定经纪人/动物园管理员之间建立了多少/多快的连接,我认为这是一个,其他人正在等待最终的 ACK 以关闭连接。也许是网络泛滥或资源不足?您可以通过获取
/var/log/nmon下的日志文件来废弃更多信息,并将它们提供给 NMOM 可视化工具 (nmonvisualizer.github.io/nmonvisualizer);还要检查 kafka/zookeeper GC 日志以查找是否存在等待资源的减速 -
我认为,例如
192.168.0.201:55388为192.168.0.201:9092建立连接时,端口 9092 也应该返回并建立到 55388 的连接(我在工作的 Kafka 代理中观察到这一点),但这并没有'不会发生,从 55388 到 9092 的连接将是 TIME_WAIT。 -
顺便说一句,您应该至少使用 3 个 Zookeeper。绝对不是两个
-
@cricket_007 使用 3 个节点的 Zookeeper,或者 Zookeeper 的节点数一般为奇数,建议不要强制!原因是 Zokeeper 集群是容错的,直到集群的 (n/2) + 1 个节点启动并工作。
-
好吧,你最好确保你不能失去其中任何一个。意外断电,甚至网络中断。这就是我要说的。
标签: tcp apache-kafka centos7