【问题标题】:Storm worker died after running for 2 hours风暴工人跑了2小时后死亡
【发布时间】:2015-11-27 02:21:19
【问题描述】:

我编写了一个拓扑来从 kKafka 读取主题,然后进行一些聚合,然后将结果存储在数据库中。拓扑在几个小时内运行良好,但随后工作人员死亡,最终主管也死亡。每次运行几个小时后都会出现此问题。

我在 3 个节点上运行 Storm 0.9.5(1 个用于 nimbus,2 个用于工作人员)。

这是我在一个工人的日志中得到的错误:

2015-08-12T04:10:38.395+0000 b.s.m.n.Client [ERROR] connection attempt 101 to Netty-Client-/10.28.18.213:6700 failed: java.lang.RuntimeException: Returned channel was actually not established
2015-08-12T04:10:38.395+0000 b.s.m.n.Client [INFO] closing Netty Client Netty-Client-/10.28.18.213:6700
2015-08-12T04:10:38.395+0000 b.s.m.n.Client [INFO] waiting up to 600000 ms to send 0 pending messages to Netty-Client-/10.28.18.213:6700
2015-08-12T04:10:38.404+0000 STDIO [ERROR] Aug 12, 2015 4:10:38 AM org.apache.storm.guava.util.concurrent.ExecutionList executeListener
SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@632ef20f with executor org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@1f15e9a8
java.lang.RuntimeException: Failed to connect to Netty-Client-/10.28.18.213:6700
        at backtype.storm.messaging.netty.Client.connect(Client.java:308)
        at backtype.storm.messaging.netty.Client.access$1100(Client.java:78)
        at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275)
        at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181)
        at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
        at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
        at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91)
        at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:380)
        at java.util.concurrent.FutureTask.set(FutureTask.java:229)
        at java.util.concurrent.FutureTask.run(FutureTask.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-/10.28.18.213:6700 after 102 failed attempts
        at backtype.storm.messaging.netty.Client.connect(Client.java:303)
        ... 19 more

这是我对每个工作节点的配置:

storm.zookeeper.servers:
- "10.28.19.230"
- "10.28.19.224"
- "10.28.19.223"
storm.zookeeper.port: 2181
nimbus.host: "10.28.18.211"
storm.local.dir: "/mnt/storm/storm-data"
storm.local.hostname: "10.28.18.213"
storm.messaging.transport: backtype.storm.messaging.netty.Context
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 300
storm.messaging.netty.max_wait_ms: 4000
storm.messaging.netty.min_wait_ms: 100
supervisor.slots.ports:
- 6700

supervisor.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346
#worker.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=1%ID%"
#supervisor.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346"
worker.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=2%ID% -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx10240m -Xms10240m -XX:MaxNewSize=6144m

【问题讨论】:

  • 我目前收到与2015-08-06 16:20:56 b.s.m.n.Client [INFO] failed to send requests to <host name removed>/<ip address removed>:6700 非常相似的错误。到目前为止,我所了解到的只是 netty 用于 Storm 中的工作间通信,我认为当拓扑负载高时会发生此错误......如果我找到更多信息,我会告诉你。
  • 您好,请看下面我的回答,看看是否对您有帮助。

标签: apache-storm


【解决方案1】:

我想我找到了这个帖子的根本原因:https://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3C20140214170209.GC55319@animetrics.com%3E

netty 客户端错误只是症状,但“根本原因是工作人员的 GC 日志记录已打开而没有 指定输出文件。结果,GC 日志没有被重定向到 logback,而是进入标准输出。最终缓冲区被填满,JVM 将挂起(从而停止心跳)并被杀死。 Worker 的持续时间取决于内存压力和分配的堆大小(显然,事后看来,GC 越多,GC 日志记录越多,缓冲区填充得越快)。"

希望对大家有所帮助。

【讨论】:

  • 感谢您用答案更新您的问题,因此如果其他人遇到此问题,他们知道出了什么问题。
猜你喜欢
  • 2017-01-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-02-22
  • 2019-09-17
  • 1970-01-01
相关资源
最近更新 更多