【问题标题】:Spark 1.3.0 on YARN: Application failed 2 times due to AM ContainerYARN 上的 Spark 1.3.0:由于 AM 容器,应用程序失败 2 次
【发布时间】:2017-02-10 13:51:37
【问题描述】:

使用以下脚本在 YARN (Hadoop 2.6.0.2.2.0.0-2041) 上运行 Spark 1.3.0 Pi 示例时:

# Run on a YARN cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar \
1000

失败并显示“由于 AM Container 导致应用程序失败 2 次”消息(请参见下文)。据我了解,此启动脚本中提供了在 YARN 模式下运行 Spark 应用程序的所有必要信息。还应该配置什么以在 YARN 上运行。什么不见​​了? YARN 启动失败的其他原因?

[test@etl-hdp-mgmt pi]$ ./run-pi.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath

15/04/01 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/01 12:59:58 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/01 12:59:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/01 12:59:58 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/01 12:59:58 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/01 12:59:58 INFO yarn.Client: Setting up container launch context for our AM
15/04/01 12:59:58 INFO yarn.Client: Preparing resources for our AM container
15/04/01 12:59:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/01 12:59:59 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-assembly-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:01 INFO yarn.Client: Uploading resource file:/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:02 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/01 13:00:03 INFO spark.SecurityManager: Changing view acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: Changing modify acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/01 13:00:03 INFO yarn.Client: Submitting application 10 to ResourceManager
15/04/01 13:00:03 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0010
15/04/01 13:00:04 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:04 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1427893202566
     final status: UNDEFINED
     tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/
     user: test
15/04/01 13:00:05 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:06 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:07 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:08 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:09 INFO yarn.Client: Application report for application_1427875242006_0010 (state: FAILED)
15/04/01 13:00:09 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1427875242006_0010 failed 2 times due to AM Container for appattempt_1427875242006_0010_000002 exited with  exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0010_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution

Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1427893202566
     final status: FAILED
     tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0010
     user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

【问题讨论】:

  • 检查跟踪网址并尝试从容器中查找日志
  • 节点日志:Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
  • 您的类路径缺少包含此类的 jar,请在启动作业时尝试使用胖 jar。
  • org.apache.spark.deploy.yarn.ApplicationMaster 应该在 spark-submit 用于创建 Spark 环境的 jar 中。我认为应用程序 jar 不应该包含此类。
  • 您是否使用 YARN 支持构建?该错误表明您没有

标签: hadoop apache-spark hadoop-yarn


【解决方案1】:

运行

yarn logs -applicationId application_1427875242006_0010 > /tmp/application_1427875242006_0010

那里的日志应该指出失败的原因。

出现“Failed 2 times”是因为当你在yarn cluster模式下运行时,驱动运行在AM中,默认重试为2次。

所以你的驱动程序被重试了两次。

【讨论】:

    【解决方案2】:

    我完全同意@SeanOwen。关注星火大厦documentation

    您需要为您的 hadoop 集群使用正确的配置(版本、hive 支持等)为 YARN 编译 spark。

    那么问题就不会存在了!

    【讨论】:

      【解决方案3】:

      这是 spark 与 Application Master 通信的问题。

      RM 和 NM 通过 RPC 相互通信,因此问题可能是 launch_container.cmd 运行不正确。提交作业时检查 NM 是否与 RM 通信

      尝试将其添加到您的 yarn-site.xml:

      <property>
        <name>yarn.nodemanager.delete.debug-delay-sec</name>
        <value>1200</value>
      </property>
      

      这将确保看到的 NM 错误中的 launch_container.cmd 不会被删除(将保留大约 20 分钟 - 如果需要,将 1200 增加到更高的数字)。现在,您可以尝试从容器目录手动运行该 launch_container.cmd 脚本,看看它在哪里退出。

      希望这会对你有所帮助。

      【讨论】:

        【解决方案4】:

        我也遇到了类似的问题。实际上,当您在集群中运行自包含应用程序时,您无需提及 --master yarn-cluster。

        Cloudera 论坛已解决此问题,请参阅https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Issue-running-spark-application-in-Yarn-cluster-mode/td-p/44570

        【讨论】:

          猜你喜欢
          • 2023-03-16
          • 2019-09-25
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2015-11-06
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多