【问题标题】:Can anyone explain my Apache Spark Error SparkException: Job aborted due to stage failure谁能解释我的 Apache Spark 错误 SparkException:作业因阶段故障而中止
【发布时间】:2015-07-12 20:13:23
【问题描述】:

我有一个简单的 Apache Spark 应用程序,我从 hdfs 读取文件,然后将其通过管道传输到外部进程。当我读取大量数据(在我的情况下文件大约有 241MB)并且我没有指定最小分区数或将最小数指定为 4 时,我收到以下错误:

    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ip-172-31-36-43.us-west-2.compute.internal): ExecutorLostFailure (executor 6 lost)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

当我将最小分区数指定为 10 或以上时,我没有收到此错误。谁能告诉我出了什么问题并避免它?我没有收到子进程以错误代码退出的错误,所以我认为这是 Spark 配置的问题。

来自工人的标准错误:

15/05/03 10:41:29 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/05/03 10:41:30 INFO spark.SecurityManager: Changing view acls to: root
15/05/03 10:41:30 INFO spark.SecurityManager: Changing modify acls to: root
15/05/03 10:41:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/03 10:41:30 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/03 10:41:30 INFO Remoting: Starting remoting
15/05/03 10:41:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@ip-172-31-36-43.us-west-2.compute.internal:46832]
15/05/03 10:41:31 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 46832.
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/05/03 10:41:31 INFO spark.SecurityManager: Changing view acls to: root
15/05/03 10:41:31 INFO spark.SecurityManager: Changing modify acls to: root
15/05/03 10:41:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/05/03 10:41:31 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/03 10:41:31 INFO Remoting: Starting remoting
15/05/03 10:41:31 INFO util.Utils: Successfully started service 'sparkExecutor' on port 37039.
15/05/03 10:41:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@ip-172-31-36-43.us-west-2.compute.internal:37039]
15/05/03 10:41:31 INFO util.AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@ip-172-31-35-111.us-west-2.compute.internal:48730/user/MapOutputTracker
15/05/03 10:41:31 INFO util.AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@ip-172-31-35-111.us-west-2.compute.internal:48730/user/BlockManagerMaster
15/05/03 10:41:31 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-e99f55c6-5bcb-4d1b-b014-aaec94fe6cc5/blockmgr-cda1922d-ea50-4630-a834-bfb637ecdaa0
15/05/03 10:41:31 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-0c6c912f-3aa1-4c54-9970-7a75d22899e8/spark-71d64ae7-36bc-49e0-958e-e7e2c1432027/spark-56d9e077-4585-4fd7-8a48-5227943d9004/blockmgr-29c5d068-f19d-4f41-85fc-11960c77a8a3
15/05/03 10:41:31 INFO storage.MemoryStore: MemoryStore started with capacity 445.4 MB
15/05/03 10:41:32 INFO util.AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@ip-172-31-35-111.us-west-2.compute.internal:48730/user/OutputCommitCoordinator
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver@ip-172-31-35-111.us-west-2.compute.internal:48730/user/CoarseGrainedScheduler
15/05/03 10:41:32 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@ip-172-31-36-43.us-west-2.compute.internal:54983/user/Worker
15/05/03 10:41:32 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@ip-172-31-36-43.us-west-2.compute.internal:54983/user/Worker
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
15/05/03 10:41:32 INFO executor.Executor: Starting executor ID 6 on host ip-172-31-36-43.us-west-2.compute.internal
15/05/03 10:41:32 INFO netty.NettyBlockTransferService: Server created on 33000
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Registered BlockManager
15/05/03 10:41:32 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ip-172-31-35-111.us-west-2.compute.internal:48730/user/HeartbeatReceiver
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6
15/05/03 10:41:32 INFO executor.Executor: Running task 1.3 in stage 0.0 (TID 6)
15/05/03 10:41:32 INFO executor.Executor: Fetching http://172.31.35.111:34347/jars/proteinsApacheSpark-0.0.1.jar with timestamp 1430649374764
15/05/03 10:41:32 INFO util.Utils: Fetching http://172.31.35.111:34347/jars/proteinsApacheSpark-0.0.1.jar to /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-08b3b4ce-960f-488f-99ea-bd66b3277207/fetchFileTemp3079113313084659984.tmp
15/05/03 10:41:32 INFO util.Utils: Copying /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-08b3b4ce-960f-488f-99ea-bd66b3277207/9655652641430649374764_cache to /root/spark/work/app-20150503103615-0002/6/./proteinsApacheSpark-0.0.1.jar
15/05/03 10:41:32 INFO executor.Executor: Adding file:/root/spark/work/app-20150503103615-0002/6/./proteinsApacheSpark-0.0.1.jar to class loader
15/05/03 10:41:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
15/05/03 10:41:32 INFO storage.MemoryStore: ensureFreeSpace(17223) called with curMem=0, maxMem=467081625
15/05/03 10:41:32 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 16.8 KB, free 445.4 MB)
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/05/03 10:41:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 274 ms
15/05/03 10:41:32 INFO storage.MemoryStore: ensureFreeSpace(22384) called with curMem=17223, maxMem=467081625
15/05/03 10:41:32 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 21.9 KB, free 445.4 MB)
15/05/03 10:41:33 INFO spark.CacheManager: Partition rdd_0_1 not found, computing it
15/05/03 10:41:33 INFO rdd.WholeTextFileRDD: Input split: Paths:/user/root/pepnovo3/largeinputfile2/largeinputfile2_45.mgf:0+2106005,/user/root/pepnovo3/largeinputfile2/largeinputfile2_46.mgf:0+2105954,/user/root/pepnovo3/largeinputfile2/largeinputfile2_47.mgf:0+2106590,/user/root/pepnovo3/largeinputfile2/largeinputfile2_48.mgf:0+2105696,/user/root/pepnovo3/largeinputfile2/largeinputfile2_49.mgf:0+2105891,/user/root/pepnovo3/largeinputfile2/largeinputfile2_5.mgf:0+2106283,/user/root/pepnovo3/largeinputfile2/largeinputfile2_50.mgf:0+2105559,/user/root/pepnovo3/largeinputfile2/largeinputfile2_51.mgf:0+2106403,/user/root/pepnovo3/largeinputfile2/largeinputfile2_52.mgf:0+2105535,/user/root/pepnovo3/largeinputfile2/largeinputfile2_53.mgf:0+2105615,/user/root/pepnovo3/largeinputfile2/largeinputfile2_54.mgf:0+2105861,/user/root/pepnovo3/largeinputfile2/largeinputfile2_55.mgf:0+2106100,/user/root/pepnovo3/largeinputfile2/largeinputfile2_56.mgf:0+2106265,/user/root/pepnovo3/largeinputfile2/largeinputfile2_57.mgf:0+2105768,/user/root/pepnovo3/largeinputfile2/largeinputfile2_58.mgf:0+2106180,/user/root/pepnovo3/largeinputfile2/largeinputfile2_59.mgf:0+2105751,/user/root/pepnovo3/largeinputfile2/largeinputfile2_6.mgf:0+2106247,/user/root/pepnovo3/largeinputfile2/largeinputfile2_60.mgf:0+2106133,/user/root/pepnovo3/largeinputfile2/largeinputfile2_61.mgf:0+2106224,/user/root/pepnovo3/largeinputfile2/largeinputfile2_62.mgf:0+2106415,/user/root/pepnovo3/largeinputfile2/largeinputfile2_63.mgf:0+2106408,/user/root/pepnovo3/largeinputfile2/largeinputfile2_64.mgf:0+2105702,/user/root/pepnovo3/largeinputfile2/largeinputfile2_65.mgf:0+2106268,/user/root/pepnovo3/largeinputfile2/largeinputfile2_66.mgf:0+2106149,/user/root/pepnovo3/largeinputfile2/largeinputfile2_67.mgf:0+2105846,/user/root/pepnovo3/largeinputfile2/largeinputfile2_68.mgf:0+2105408,/user/root/pepnovo3/largeinputfile2/largeinputfile2_69.mgf:0+2106172,/user/root/pepnovo3/largeinputfile2/largeinputfile2_7.mgf:0+2105517,/user/root/pepnovo3/largeinputfile2/largeinputfile2_70.mgf:0+2105980,/user/root/pepnovo3/largeinputfile2/largeinputfile2_71.mgf:0+2105651,/user/root/pepnovo3/largeinputfile2/largeinputfile2_72.mgf:0+2105936,/user/root/pepnovo3/largeinputfile2/largeinputfile2_73.mgf:0+2105966,/user/root/pepnovo3/largeinputfile2/largeinputfile2_74.mgf:0+2105456,/user/root/pepnovo3/largeinputfile2/largeinputfile2_75.mgf:0+2105786,/user/root/pepnovo3/largeinputfile2/largeinputfile2_76.mgf:0+2106151,/user/root/pepnovo3/largeinputfile2/largeinputfile2_77.mgf:0+2106284,/user/root/pepnovo3/largeinputfile2/largeinputfile2_78.mgf:0+2106163,/user/root/pepnovo3/largeinputfile2/largeinputfile2_79.mgf:0+2106233,/user/root/pepnovo3/largeinputfile2/largeinputfile2_8.mgf:0+2105885,/user/root/pepnovo3/largeinputfile2/largeinputfile2_80.mgf:0+2105979,/user/root/pepnovo3/largeinputfile2/largeinputfile2_81.mgf:0+2105888,/user/root/pepnovo3/largeinputfile2/largeinputfile2_82.mgf:0+2106546,/user/root/pepnovo3/largeinputfile2/largeinputfile2_83.mgf:0+2106322,/user/root/pepnovo3/largeinputfile2/largeinputfile2_84.mgf:0+2106017,/user/root/pepnovo3/largeinputfile2/largeinputfile2_85.mgf:0+2106242,/user/root/pepnovo3/largeinputfile2/largeinputfile2_86.mgf:0+2105543,/user/root/pepnovo3/largeinputfile2/largeinputfile2_87.mgf:0+2106556,/user/root/pepnovo3/largeinputfile2/largeinputfile2_88.mgf:0+2105637,/user/root/pepnovo3/largeinputfile2/largeinputfile2_89.mgf:0+2106130,/user/root/pepnovo3/largeinputfile2/largeinputfile2_9.mgf:0+2105634,/user/root/pepnovo3/largeinputfile2/largeinputfile2_90.mgf:0+2105731,/user/root/pepnovo3/largeinputfile2/largeinputfile2_91.mgf:0+2106401,/user/root/pepnovo3/largeinputfile2/largeinputfile2_92.mgf:0+2105736,/user/root/pepnovo3/largeinputfile2/largeinputfile2_93.mgf:0+2105688,/user/root/pepnovo3/largeinputfile2/largeinputfile2_94.mgf:0+2106436,/user/root/pepnovo3/largeinputfile2/largeinputfile2_95.mgf:0+2105609,/user/root/pepnovo3/largeinputfile2/largeinputfile2_96.mgf:0+2105525,/user/root/pepnovo3/largeinputfile2/largeinputfile2_97.mgf:0+2105603,/user/root/pepnovo3/largeinputfile2/largeinputfile2_98.mgf:0+2106211,/user/root/pepnovo3/largeinputfile2/largeinputfile2_99.mgf:0+2105928
15/05/03 10:41:33 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
15/05/03 10:41:33 INFO storage.MemoryStore: ensureFreeSpace(6906) called with curMem=39607, maxMem=467081625
15/05/03 10:41:33 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.7 KB, free 445.4 MB)
15/05/03 10:41:33 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/05/03 10:41:33 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 15 ms
15/05/03 10:41:33 INFO storage.MemoryStore: ensureFreeSpace(53787) called with curMem=46513, maxMem=467081625
15/05/03 10:41:33 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 52.5 KB, free 445.3 MB)
15/05/03 10:41:33 WARN snappy.LoadSnappy: Snappy native library is available
15/05/03 10:41:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/05/03 10:41:33 INFO snappy.LoadSnappy: Snappy native library loaded
15/05/03 10:41:36 INFO storage.MemoryStore: ensureFreeSpace(252731448) called with curMem=100300, maxMem=467081625
15/05/03 10:41:36 INFO storage.MemoryStore: Block rdd_0_1 stored as values in memory (estimated size 241.0 MB, free 204.3 MB)
15/05/03 10:41:36 INFO storage.BlockManagerMaster: Updated info of block rdd_0_1

【问题讨论】:

    标签: java hadoop amazon-ec2 apache-spark


    【解决方案1】:

    答案很可能在executor log中,与worker log不同。很可能它会耗尽内存并启动 GC 抖动或死于 OOM。如果这是一个选项,您可以尝试为每个执行程序运行更多内存。

    【讨论】:

    • 在哪里可以找到 Amazon EC2 实例上具有默认配置的 exucotors 日志?
    • 它们是从 UI 链接的。
    【解决方案2】:

    检查您的系统硬盘空间、网络和内存。在 $SPARK_HOME/work.spark 中写入文件。一段时间硬盘空间已满,没有可用内存或网络问题。 如果您可以在 your_machine:4040 上看到任何异常

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-06-19
      • 2014-09-10
      • 2019-12-25
      • 2020-10-04
      • 2019-09-24
      • 2020-11-07
      • 1970-01-01
      • 2018-03-18
      相关资源
      最近更新 更多