【问题标题】:Spark: executor heartbeat timed outSpark:执行器心跳超时
【发布时间】:2020-07-11 16:05:15
【问题描述】:

我在具有240GB 内存和 64 个内核的数据块集群中工作。这是我定义的设置。

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as fs
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.functions import count
from pyspark.sql.functions import col, countDistinct
from pyspark import SparkContext
from geospark.utils import GeoSparkKryoRegistrator, KryoSerializer
from geospark.register import upload_jars
from geospark.register import GeoSparkRegistrator
spark.conf.set("spark.sql.shuffle.partitions", 1000)
#Recommended settings for using GeoSpark
spark.conf.set("spark.driver.memory", "20g")
spark.conf.set("spark.network.timeout", "1000s")
spark.conf.set("spark.driver.maxResultSize", "10g")
spark.conf.set("spark.serializer", KryoSerializer.getName)
spark.conf.set("spark.kryo.registrator", GeoSparkKryoRegistrator.getName)
upload_jars()
SparkContext.setSystemProperty("geospark.global.charset","utf8")
spark.conf.set

我正在处理大型数据集,这是我在运行数小时后遇到的错误。

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 6054, 10.17.21.12, executor 7): 

ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170684 ms

【问题讨论】:

    标签: apache-spark pyspark azure-databricks


    【解决方案1】:

    让心跳间隔默认(10s),将网络超时间隔(默认-120s)增加到300s(300000ms)看看。使用 set 和 get 。

        spark.conf.set("spark.sql.<name-of-property>", <value>)
        spark.conf.set("spark.network.timeout", 300000 )
    

    或在笔记本中运行此脚本。

        %scala
        dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
        |#!/bin/bash
        |
        |cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf 
        |[driver] {
        |  "spark.network.timeout" = "300000"
        |}
        |EOF
        """.stripMargin, true)
          
    

    【讨论】:

      【解决方案2】:

      该错误告诉您工作人员已超时,因为它花费了太长时间。 后台可能发生了一些瓶颈。检查执行程序 7、任务 3 和阶段 10 的 spark UI。您还想检查您一直在运行的查询。

      您还想检查这些设置以获得更好的配置:

      spark.conf.set("spark.databricks.io.cache.enabled", True) # delta caching
      spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True) # adaptive query execution for skewed data
      spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # setting treshhold on broadcasting 
      spark.conf.set("spark.databricks.optimizer.rangeJoin.binSize", 20) #range optimizer
      

      请随时向我们提供有关 Spark UI 的更多信息,我们可以更好地帮助您找到问题所在。另外,你在做什么查询?

      【讨论】:

        【解决方案3】:

        请您尝试以下选项,

        • 将您工作的数据帧重新分区为更多数量,例如df.repartition(1000)
        • --conf spark.network.timeout 10000000
        • --conf spark.executor.heartbeatInterval=10000000

        【讨论】:

          猜你喜欢
          • 2017-01-11
          • 2020-05-06
          • 2011-05-28
          • 1970-01-01
          • 2020-05-11
          • 1970-01-01
          • 2017-01-13
          • 1970-01-01
          • 2018-06-02
          相关资源
          最近更新 更多