【问题标题】:Oozie job stuck at START action in PREP stateOozie 作业在 PREP 状态下停留在 START 动作
【发布时间】:2015-06-22 11:08:28
【问题描述】:

我有一个 Oozie 作业,我从 java 客户端开始,它卡在 START 操作上,它说它正在运行,但 START 节点处于 PREP 状态。

为什么会这样以及如何解决问题?

Oozie 工作流仅包含一个 java 操作。集群上的Hadoop版本是2.4.0,集群上的Oozie是4.0.0。

这是工作流.xml

<workflow-app xmlns='uri:oozie:workflow:0.2' name='java-filecopy-wf'>
<start to='java1'/>
    <action name='java1'>
    <java>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>default</value>
            </property>
        </configuration>           
        <main-class>testingoozieclient.Client</main-class>
        <capture-output/>
    </java>
    <ok to="end" />
    <error to="fail" />
</action>
<kill name="fail">
    <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
</kill>
<end name='end' />

这里是java客户端

    OozieClient oozieClient = new OozieClient(args[0]);

    Properties conf = oozieClient.createConfiguration();
    conf.setProperty(OozieClient.APP_PATH, args[1]);

    conf.setProperty("nameNode", args[2]);
    conf.setProperty("jobTracker", args[3]);

    String jobId = null;

    try{
        jobId = oozieClient.run(conf);
    }
    catch(OozieClientException ex){
        Logger.getLogger(Client.class.getName()).log(Level.SEVERE, null, ex);

    }

由于我尝试了几次,现在有 5,6 个工作流都具有 RUNNING 状态,但是当我通过 Web 界面查看它时,我可以看到它们都以 PREP 状态卡在 START 节点上?


在一些提交的工作流被终止后,我能够启动另一个工作流。这次工作流从 start 转到 java 动作,但以类似的方式卡在 java 动作中 - 它保持在 PREP 状态。

这是日志的样子

2015-06-22 17:54:37,366  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] Start action [0000030-150619153616589-oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action status=DONE
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action updated in DB!
2015-06-22 17:54:37,426  INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W
2015-06-22 17:54:37,676  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Start action [0000030-150619153616589-oozie-oozi-W@java1] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:38,316  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master:8020
2015-06-22 17:54:38,501  WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] credentials is null for the action
2015-06-22 17:54:38,640  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master:8020


今天早上发现job处于SUSPENDED状态,启动节点OK,但是Java节点处于START-RETRY,报错JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal :8032 连接异常失败:java.net.ConnectException:连接被拒绝;更多详情见:http://wiki.apache.org/hadoop/ConnectionRefused

我可能应该强调,Oozie 与资源管理器在同一台机器上工作,所以它尝试在同一台机器上启动工作流,却说连接失败,这很奇怪。

这是来自 Oozie 的工作日志:

    2015-06-22 17:54:37,366  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] Start action [0000030-150619153616589-oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action status=DONE
2015-06-22 17:54:37,367  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] [***0000030-150619153616589-oozie-oozi-W@:start:***]Action updated in DB!
2015-06-22 17:54:37,426  INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W
2015-06-22 17:54:37,676  INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Start action [0000030-150619153616589-oozie-oozi-W@java1] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-06-22 17:54:38,316  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master01.novalocal:8020
2015-06-22 17:54:38,501  WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] credentials is null for the action
2015-06-22 17:54:38,640  INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] addShareLib: using FileSystem hdfs://master01.novalocal:8020
2015-06-22 20:05:33,340  WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[0000030-150619153616589-oozie-oozi-W@java1] Error starting action [java1]. ErrorType [TRANSIENT], ErrorCode [  JA006], Message [  JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
org.apache.oozie.action.ActionExecutorException:   JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:412)
    at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:837)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:988)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:215)
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:60)
    at org.apache.oozie.command.XCommand.call(XCommand.java:280)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
    at org.apache.hadoop.ipc.Client.call(Client.java:1414)
    at org.apache.hadoop.ipc.Client.call(Client.java:1363)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy42.getDelegationToken(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:282)
    at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at com.sun.proxy.$Proxy43.getDelegationToken(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:452)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:166)
    at org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:220)
    at org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:400)
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1203)
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1200)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
    at org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1199)
    at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:377)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1031)
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:786)
    ... 10 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
    at org.apache.hadoop.ipc.Client.call(Client.java:1381)
    ... 33 more

【问题讨论】:

  • 你能把oozie中的错误日志加进去吗?
  • 我会,但最奇怪的是我没有收到任何错误。当我在 Web 界面中查看 JobLog 时,它完全是空的?我应该在哪里查找错误?
  • 转到工作日志并获取工作日志!或在历史服务器中尝试!我认为 oozie 成功地将工作提交给了 hadoop!请检查历史服务器中的纱线日志。
  • 我已经编辑了我的问题,请查看。

标签: hadoop oozie


【解决方案1】:

请检查 job.properties 中的端口 这通常是 namenode 和 jobtracker 端口的问题。 确保 job.properties 文件中的 jobtracker 端口正确。

【讨论】:

    【解决方案2】:

    oozie 作业卡在 PREP 状态(最终进入 START_MANUAL 状态)的主要原因是Hadoop 服务端口配置错误

    nameNode=hdfs://localhost:9000
    jobTracker=10.71.71.15:8032
    

    如果您正在运行 YARN,则 jobtracker 的默认端口与资源管理器的端口相同。

    另外,尝试修复其他端口问题,例如 jobhistoryserver's port(如 oozie 错误消息中所述)。

    【讨论】:

      【解决方案3】:

      我敢打赌,您的 map-reduce 集群一定是用完了插槽。查看配置了多少个地图槽。

      还可以尝试确定服务是否在端口 8032 上启动。您可以使用命令 sudo netstat -netulp | grep 8032。如果没有返回输出,则服务已关闭。您还可以使用 nmap 或 telnet 检查连接性。

      【讨论】:

      • 感谢您的回答,但如果我没有运行任何 MR 作业,这可能吗?我刚刚开始一个简单的 java 类,它打印出一些文本,只是为了检查我的客户端是否工作。
      • Oozie 在 Map-Reduce 集群上运行它的作业,所以首先你应该确保你的 Map-Reduce 集群已经启动并运行,有足够的映射槽(至少两个用于运行一个作业)。
      • 如果服务在 8032 端口上运行,请尝试排序。您可以使用命令 sudo netstat -netulp | grep 8032。如果没有返回输出,则服务已关闭。您还可以使用 nmap 或 telnet 检查连接性。
      • 感谢您的回答,它似乎使用了非标准(8050)端口。我猜问题是开始了太多的工作(正如你所建议的那样),这导致了窒息。现在我收到一个 ClassNotFound 异常,但似乎工作流程正在运行,所以我接受你的答案,基于答案本身和最后的评论。 :)
      【解决方案4】:

      命令:

      netstat -ntpl | grep 8032
      

      查看端口号是否打开。 如果不是,则需要使用端口号启动服务。

      【讨论】:

        【解决方案5】:

        在我的情况下,问题是由于 YARN 被停止。我启动了它,工作流程取得了进展。

        【讨论】:

          猜你喜欢
          • 2014-12-22
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-07-24
          • 1970-01-01
          相关资源
          最近更新 更多