【问题标题】:Spring-XD: Jobs automatic undeployment on zookeeper time-out in xd-singlenode modeSpring-XD:在xd-singlenode模式下zookeeper超时时作业自动取消部署
【发布时间】:2015-01-27 12:46:42
【问题描述】:

我在单节点模式下运行 spring-xd。我的工作有可能运行超过 10 分钟的工作模块。当我启动作业时,看到以下超时,然后取消部署我的作业和模块。 我需要更改任何 Zookeeper 设置吗?请指教-

第一个错误

2015-01-26 20:30:22,514 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn  - Client session timed out, have not heard from server in 40465ms for sessionid 0x14b28fc8fa80000, closing socket connection and attempting reconnect
2015-01-26 20:30:22,516 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] WARN org.apache.zookeeper.server.NIOServerCnxn  - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14b28fc8fa80000, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Unknown Source)
2015-01-26 20:30:22,518 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection for client /127.0.0.1:48599 which had sessionid 0x14b28fc8fa80000
2015-01-26 20:30:22,615 [main-EventThread] [] INFO org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED
2015-01-26 20:30:22,616 [DeploymentSupervisorCacheListener-0] [] INFO org.springframework.xd.dirt.server.InitialDeploymentListener  - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:22,616 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor  - Admin admin:default,admin,singlenode:9393 connection suspended
2015-01-26 20:30:22,628 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar  - ZooKeeper connection suspended: 9adc3b5e-1b19-4d64-9b52-a5643dc42acb
2015-01-26 20:30:22,649 [LeaderSelector-0] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor  - Leadership canceled due to thread interrupt
2015-01-26 20:30:22,650 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:22,678 [DeploymentSupervisorCacheListener-0] [] INFO org.springframework.xd.dirt.server.InitialDeploymentListener  - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:23,712 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn  - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-01-26 20:30:23,713 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn  - Socket connection established to localhost/127.0.0.1:2181, initiating session
2015-01-26 20:30:23,713 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.NIOServerCnxnFactory  - Accepted socket connection from /127.0.0.1:49181
2015-01-26 20:30:23,715 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.ZooKeeperServer  - Client attempting to renew session 0x14b28fc8fa80000 at /127.0.0.1:49181
2015-01-26 20:30:23,715 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.ZooKeeperServer  - Established session 0x14b28fc8fa80000 with negotiated timeout 60000 for client /127.0.0.1:49181
2015-01-26 20:30:23,716 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn  - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14b28fc8fa80000, negotiated timeout = 60000
2015-01-26 20:30:59,587 [main-EventThread] [] ERROR org.apache.curator.ConnectionState  - Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (36877)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
        at org.springframework.xd.dirt.server.DeploymentListener$JobModuleWatcher.process(DeploymentListener.java:527)
        at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

随后取消部署工作

2015-01-26 20:31:35,462 [LeaderSelector-1] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor  - Leader Admin admin:default,admin,singlenode:9393 is watching for stream/job deployment requests.
2015-01-26 20:31:35,463 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar  - Waiting for supervisor to clean up prior deployments (elapsed time 0 seconds)...
2015-01-26 20:31:35,463 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar  - Waiting for supervisor to clean up prior deployments (elapsed time 0 seconds)...
2015-01-26 20:31:35,478 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c1_Job.job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,478 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Undeploying module [ModuleDescriptor@593edefc moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c1_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c1', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,516 [main-EventThread] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Undeploying module [ModuleDescriptor@4a83d881 moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c2_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['fetchSize' -> '100000', 'table' -> 'counts', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,562 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c2_job.job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,562 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Undeploying module [ModuleDescriptor@6d07b3ed moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c2_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c2', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,577 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c3_job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,578 [main-EventThread] [] INFO org.springframework.xd.dirt.server.DeploymentListener  - Undeploying module [ModuleDescriptor@13097137 moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c3_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c3', 'mode' -> 'initial'], children = list[[empty]]]

【问题讨论】:

    标签: spring-batch spring-xd


    【解决方案1】:

    Spring XD 单节点模式在与 ZooKeeper“客户端”相同的 JVM 中运行嵌入式 ZooKeeper 服务器 - 即托管作业模块的应用程序上下文。如果心跳之间有 40 秒的间隔,这可能表明 JVM 正在经历繁重的 GC 和/或主机已用完物理内存并正在交换到磁盘。

    为了验证这个理论,我建议启用详细 gc。这可以通过修改xd-singlenode 脚本或在启动脚本之前设置环境变量export JAVA_OPTS=-verbose:gc 来完成。

    要修改会话和连接超时,您可以设置以下 JVM 系统属性:

    • curator-default-session-timeout - 会话超时
    • curator-default-connection-timeout - 连接超时

    【讨论】:

    • Full GC 是超时的原因。感谢您的帮助!
    • 我尝试按照上面的建议增加会话超时,但仍然在日志中看到以下条目 - org.apache.zookeeper.ZooKeeper - 启动客户端连接,connectString=localhost:2181 sessionTimeout=60000 以下是我的系统属性在 xd-singlenode 脚本中 - DEFAULT_JVM_OPTS="-verbose:gc -Dcurator-default-session-timeout=300000 -Dcurator-default-connection-timeout=300000"
    • 关于增加超时的任何建议。我在这里做错了吗?
    • 我快速尝试了一下,发现会话超时没有生效。当我发现发生了什么时,我会进一步研究并更新这篇文章。
    • 您运行的是哪个版本的 Spring XD?如果是 1.1.0(包括 RC1),您应该修改 servers.yml 中的 sessionTimeout 设置 - 请参阅 docs.spring.io/spring-xd/docs/1.1.0.RC1/reference/html/…。在以前的版本中,设置系统属性对我有用。
    【解决方案2】:

    由于这个问题让我们很头疼,我想在这里分享一下我们为解决这个问题所做的工作。我想你们中的一些人可能有类似的设置,我希望能为您节省一些时间来跟踪您的具体问题。

    尤其是当您看到诸如

    之类的日志时
    09:48:21,467 1.1.1.RELEASE  WARN SyncThread:0 persistence.FileTxnLog - fsync-ing the write ahead log in SyncThread:0 took 1123ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
    

    您可能和我们遇到同样的麻烦(只是有根据的猜测导致我们认为系统处于非常糟糕的状态)

    实际上,GC 暂停不仅仅是由我们的应用程序引起的。我们在单节点模式下运行 XD(在我们需要扩展之前作为入门选项)和服务器上的 MySQL 数据库。我们主要使用 XD 调度作业,这些作业是由外部事件触发进入系统(因此在某种程度上是不可预测的)。

    最后我们的问题是

    • 昂贵的数据库查询
    • MySQL JDBC 驱动程序将数据库结果加载到内存中
    • Java 垃圾回收

    GC 暂停是由于整体系统负载(数据库导致系统负载异常高)+ JVM 试图尽可能避免 OutOfMemory 情况的事实(有趣的附带事实 - 从那以后我很少看到 OutOfMemory 异常我们正在使用 Java 8 - 所以那里可能已经发生了变化......)+ XD 认为(并且是正确的)系统行为不稳定的事实,所以最好将其拆除,希望其他人会选择它是模块(在这种情况下不会发生)。

    追踪这件事并不容易。首先,我们必须通过在系统上触发大量事件来以某种方式给系统带来麻烦。我们使用 htop 并看到内存正常并且 CPU 正常 - 但它向我们表明系统异常繁忙。所以留下了 IO - 我们使用 iotop 来追踪 MySQL 是否是坏人。我们使用 MySQL 的“SHOW FULL PROCESSLIST”命令跟踪查询。接下来是限制 XD 容器可用的堆,以便内存转储(使用 jmap 创建)适合我们的开发人员机器并使用分析器(在我们的示例中为 YourKit)来跟踪内存问题。在我们的例子中,它是 MySQL - 我们通过切换到受影响查询的游标获取模式来解决问题,参见例如http://codewut.de/content/mysql-large-result-sets-and-outofmemory-related-headaches

    【讨论】:

      【解决方案3】:

      通过在xd-singlenode.sh 中添加以下系统属性,可以轻松修改 Zookeeper 超时设置。我在Spring-XD-1.1.1.Release 版本上验证了这一点。

      -Dzk.namespace=xd -Dzk.client.sessionTimeout=300000 
      -Dzk.client.connectionTimeout=150000 -DzkInitialRetryWait=1000 
      -Dzk.client.retryMaxAttempts=3 
      

      【讨论】:

        猜你喜欢
        • 2014-10-21
        • 1970-01-01
        • 2016-03-30
        • 2023-03-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多