在 m4 上运行 Spark 而不是在 AWS 上运行 m3答案

【问题标题】：Running Spark on m4 instead of m3 on AWS在 m4 上运行 Spark 而不是在 AWS 上运行 m3
【发布时间】：2016-12-28 18:27:42
【问题描述】：

我有一个小脚本，用于通过 AWS 提交作业。我已将实例类型从 m3xlarge 更改为 m4.xlarge，但我突然收到一条错误消息，集群在未完成所有步骤的情况下终止。脚本是：

aws emr create-cluster --name “XXXXXX”  --ami-version 3.7 --applications Name=Hive --use-default-roles --ec2-attributes KeyName=gattami,SubnetId=subnet-xxxxxxx \
--instance-type=m4.xlarge --instance-count 3 \
--log-uri s3://pythonpicode/ --bootstrap-actions Path=s3://eu-central-1.support.elasticmapreduce/spark/install-spark,Name=Spark,Args=[-x] --steps Name=“PythonPi”,Jar=s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,s3://pythonpicode/,s3://pythonpicode/PythonPi.py],ActionOnFailure=CONTINUE --auto-terminate

我得到的错误信息是

Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-cores, , --files, s3://pythonpicode/PythonPi.py, --primary-py-file, PythonPi.py, --class, org.apache.spark.deploy.PythonRunner)

Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
  --jar JAR_PATH           Path to your application's JAR file (required in yarn-cluster
                           mode)
  --class CLASS_NAME       Name of your application's main class (required)
  --primary-py-file        A main Python file
  --arg ARG                Argument to be passed to your application's main class.
                           Multiple invocations are possible, each will be passed in order.
  --num-executors NUM      Number of executors to start (Default: 2)
  --executor-cores NUM     Number of cores per executor (Default: 1).
  --driver-memory MEM      Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb)
  --driver-cores NUM       Number of cores used by the driver (Default: 1).
  --executor-memory MEM    Memory per executor (e.g. 1000M, 2G) (Default: 1G)
  --name NAME              The name of your application (Default: Spark)
  --queue QUEUE            The hadoop queue to use for allocation requests (Default:
                           'default')
  --addJars jars           Comma separated list of local jars that want SparkContext.addJar
                           to work with.
  --py-files PY_FILES      Comma-separated list of .zip, .egg, or .py files to
                           place on the PYTHONPATH for Python apps.
  --files files            Comma separated list of files to be distributed with the job.
  --archives archives      Comma separated list of archives to be distributed with the job.

    at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:228)
    at org.apache.spark.deploy.yarn.ClientArguments.<init>(ClientArguments.scala:56)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:646)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Command exiting with ret ‘1'

我也尝试过以下替代方法

aws emr create-cluster --name "XXXXXXX"  --release-label emr-4.7.2 --applications Name=Spark --ec2-attributes KeyName=xxxxxxx,SubnetId=subnet-xxxxxxxx \
--instance-type=m4.xlarge  --instance-count 3 \
--log-uri s3://pythonpicode/ --steps Type=CUSTOM_JAR,Name="PythonPi",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-submit,--master,yarn,--deploy-mode,cluster,s3://pythonpicode/PythonPi.py] --use-default-roles --auto-terminate

我从这些步骤中得到的（部分）错误消息如下

16/08/24 11:57:39 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:40 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:41 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:42 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:43 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:44 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:45 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:46 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:47 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:48 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:49 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:50 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:51 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:52 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:53 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:54 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:55 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:56 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:57 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:58 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:57:59 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:58:00 INFO Client: Application report for application_1472039667248_0001 (state: RUNNING)
16/08/24 11:58:01 INFO Client: Application report for application_1472039667248_0001 (state: FAILED)
16/08/24 11:58:01 INFO Client: 
     client token: N/A
     diagnostics: Application application_1472039667248_0001 failed 2 times due to AM Container for appattempt_1472039667248_0001_000002 exited with  exitCode: -104
For more detailed output, check application tracking page:http://ip-172-31-21-32.eu-central-1.compute.internal:8088/cluster/app/application_1472039667248_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=5713,containerID=container_1472039667248_0001_02_000001] is running beyond physical memory limits. Current usage: 2.0 GB of 1.4 GB physical memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1472039667248_0001_02_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 5748 5721 5713 5713 (python) 301 29 1343983616 246463 python PythonPi.py 
    |- 5721 5713 5713 5713 (java) 1594 93 2031308800 265175 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1472039667248_0001/container_1472039667248_0001_02_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1472039667248_0001/container_1472039667248_0001_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file PythonPi.py --executor-memory 5120m --executor-cores 4 --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1472039667248_0001/container_1472039667248_0001_02_000001/__spark_conf__/__spark_conf__.properties 
    |- 5713 5711 5713 5713 (bash) 0 0 115810304 715 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1472039667248_0001/container_1472039667248_0001_02_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1472039667248_0001/container_1472039667248_0001_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file PythonPi.py --executor-memory 5120m --executor-cores 4 --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1472039667248_0001/container_1472039667248_0001_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1472039667248_0001/container_1472039667248_0001_02_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1472039667248_0001/container_1472039667248_0001_02_000001/stderr 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1472039815698
     final status: FAILED
     tracking URL: http://ip-172-31-21-32.eu-central-1.compute.internal:8088/cluster/app/application_1472039667248_0001
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1472039667248_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/08/24 11:58:01 INFO ShutdownHookManager: Shutdown hook called
16/08/24 11:58:01 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7adbbd9f-2f68-49e3-85e6-9fdf960af87e
Command exiting with ret '1'

【问题讨论】：

您的steps 定义中没有--executor-cores，这是您的错误消息所抱怨的。你发这个问题的时候把它拿出来了吗？
您认为缺少什么？我在 m3.xlarge 中使用了相同的脚本，它运行良好。
您确定您的“ 了吗？它们看起来不对。

标签： amazon-web-services amazon-s3 amazon-ec2 pyspark

【解决方案1】：

您需要检查您的 Spark 版本。您很可能安装了不支持这些参数的旧版本（例如 1.5）。

(--executor-cores, , --files, s3://pythonpicode/PythonPi.py, --primary-py-file, PythonPi.py, --class, org.apache.spark.deploy.PythonRunner)

我建议您尝试使用稳定的 AMI 4.7.2，并将 Spark 1.6 作为标准应用程序提供。

【讨论】：

您好帅远，感谢您的快速回复。您确定有 AMI 版本 4.7.2 吗？ 3.11 是我可以检查的最新版本。当我将 AMI 版本设置为 4.7.2 时，我收到以下错误消息：“调用 RunJobFlow 操作时发生错误 (ValidationException)：提供的 ami 版本无效。”
这太奇怪了。检查this
嗯，有一个 EMR 版本 4.7.2，您可能提到过。但是，这不支持 Spark。这是我得到的脚本和错误消息：
aws emr create-cluster --name "XXXXX" --release-label emr-4.7.2 --applications Name=Hive --use-default-roles --ec2-attributes KeyName=gattami ,SubnetId=subnet-xxxxxxx \ etc...
调用 RunJobFlow 操作时发生错误 (ValidationException)：提供的引导操作：“emr-4.7.2”不支持“Spark”。