有个flink实时任务,上周升级了版本,早上过来看下任务,发现任务凌晨4点左右的时候重启了。flink ui查看异常日志如下

flink任务重启原因分析

 异常日志

2020-08-10 04:07:23

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)

    at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)

    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

    at java.lang.Thread.run(Thread.java:748)

关键信息

2020-08-10 04:07:23

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.

 

初步判断可能是9.150.12.175机器出了问题。

看看yarn资源管理界面,进一步判断是机器问题。

一般常见的是内存不足、磁盘空间不足,或者其他问题。

flink任务重启原因分析

登陆问题机器,jps查看进程,只有yarn nodemanager还在,但启动时间还是很早之前,没有重启过,其他任务已经被干掉了

flink任务重启原因分析

查看yarn nodemanager日志,日志提示磁盘使用率超过90%

flink任务重启原因分析

查看当前磁盘使用率

flink任务重启原因分析

跟yarn的日志一致,磁盘使用率超过yarn的配置阀值。查看日志,有历史生成的大日志文件,清理过期日志,重新启动,任务重新分配到问题机器,一切恢复正常。同时让运维同事将所有集群节点磁盘加上监控,使用率达到85%时告警。

相关文章:

  • 2021-06-03
  • 2021-09-30
  • 2021-05-23
  • 2022-02-28
  • 2021-05-31
  • 2022-01-01
  • 2022-01-03
  • 2022-12-23
猜你喜欢
  • 2022-12-23
  • 2022-12-23
  • 2022-12-23
  • 2022-01-29
  • 2022-12-23
  • 2021-12-13
  • 2021-07-05
相关资源
相似解决方案