【发布时间】:2022-02-13 02:07:05
【问题描述】:
我是 Spark 的新手。现在我遇到了一个问题:当我在独立的 spark 集群中启动程序时,命令行:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
它会抛出以下错误:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
集群主输出以下日志:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient@bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient@bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient@bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
以及对应的启动驱动程序输出的worker:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
我的spark-env.sh 是:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
而我的spark-defaults.conf 是:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
但是,当我使用 client 模式使用以下命令启动程序时,它可以正常工作。
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar
【问题讨论】:
-
主集群输出告诉您用于通信其中一个节点的 tcp 连接以某种方式断开。应该有一个让司机下降的工人的日志。检查那里的特定错误。
-
上面贴了相应的工人日志,我不知道为什么,你能更具体一点吗? @suiterdev
-
对不起,我错过了。这不是很有帮助。您是否确认了每个节点在 /etc/hosts 中映射的常规内容、iptables 和主机?有时 DNS 无法解决问题,您可以尝试将 IP 地址插入节点的 spark-env.sh。除此之外,一些搜索表明许多驱动程序解除关联错误,但出于各种原因。有时是错误,有时是环境变量。要进行测试,请关闭节点上的 iptables 并将 selinux 设置为 permissive 或关闭(如果没有)。抱歉,我无法为工作人员日志失败提供更多帮助。
-
@Neal 尝试将您的 spark 版本更新到 1.3.1。由于 spark 1.3.1 支持独立集群模式。
标签: apache-spark