【问题标题】:How to fix the exception: java.lang.OutOfMemoryError: GC overhead limit exceeded even though enough memory is given in the spark-submit? [duplicate]如何修复异常:java.lang.OutOfMemoryError:即使在 spark-submit 中提供了足够的内存,也超出了 GC 开销限制? [复制]
【发布时间】:2018-09-23 05:16:38
【问题描述】:

我正在尝试通过以下方式读取 Postgres 上的表并将数据帧插入 HDFS 上的 Hive 表中:

def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
  val execQuery = s"select ${allColumns}, 0 as ${flagCol} from analytics.xx_gl_forecast where period_year='2017'"
  val yearDF    = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017").option("user", devUserName).option("password", devPassword).option("numPartitions",20).load()
  val totalCols:List[String] = splitColumns ++ textList
  val cdt                    = new ChangeDataTypes(totalCols, dataMapper)
  hiveDataTypes              = cdt.gpDetails()
  prepareHiveTableSchema(hiveDataTypes, partition_columns)
  val allColsOrdered         = yearDF.columns.diff(partition_columns) ++ partition_columns
  val allCols                = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
  val resultDF               = yearDF.select(allCols:_*)
  val stringColumns          = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
  val finalDF                = stringColumns.foldLeft(resultDF) {
    (tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
  }
  finalDF
}

    val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
    dataDF.createOrReplaceTempView("preparedDF")
    spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
    spark.sql("set hive.exec.dynamic.partition=true")
    spark.sql(s"INSERT OVERWRITE TABLE default.xx_gl_forecast PARTITION(${prtn_String_columns}) select * from preparedDF")

我正在使用的 spark-submit 命令:

SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/username/jars/postgresql-42.1.4.jar  --jars /home/username/jars/postgresql-42.1.4.jar --num-executors 40 --executor-cores 10 --executor-memory 30g --driver-memory 20g --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/username/usr.keytab --principal usr@DEV.COM --files /username/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/username/jars/postgresql-42.1.4.jar

我有以下资源:

number of cores:51
max container memory:471040 MB
Number of executors per LLAP Daemon:39 

即使我将内存翻倍,我仍然在日志中出现这些异常:

Container exited with a non-zero exit code 143.
Killed by external signal
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.toCharArray(String.java:2899)
at java.util.zip.ZipCoder.getBytes(ZipCoder.java:78)
at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
18/09/23 04:57:20 INFO JDBCRDD: closed connection

代码中是否有任何错误导致程序崩溃? 谁能让我知道我在这里犯了什么错误,以便我可以修复它。

【问题讨论】:

    标签: apache-spark


    【解决方案1】:

    此异常告诉您您正在花费大量时间进行垃圾收集。您应该做的第一件事是在作业运行时(或在他的历史服务器中)检查 Spark UI,以查看哪些阶段正在进行大量 GC。您应该能够从 UI 中非常明显地看到它。

    我的猜测是这将是一个洗牌。现在的问题是:

    • 考虑到数据的大小,您是否有足够的分区?
    • 如果没有,请尝试使用spark.sql.shuffle.partitions 增加随机播放的默认并行度
    • 如果它们的大小已经很好,是什么导致您的堆被填满?您可能希望在作业运行时执行堆转储,然后使用转储分析工具对其进行探索。

    【讨论】:

      猜你喜欢
      • 2017-06-12
      • 1970-01-01
      • 1970-01-01
      • 2011-08-15
      • 2021-02-04
      • 2015-11-14
      • 2018-10-27
      • 2019-03-09
      相关资源
      最近更新 更多