【发布时间】:2022-01-02 04:06:33
【问题描述】:
我真的需要一些帮助:
我们正在使用 Spark3.1.2 和独立集群。 自从我们开始使用 s3a 目录提交器后,我们的 Spark 作业稳定性和性能显着提高!
然而,最近几天我们对解决这个 s3a 目录提交程序问题感到完全困惑,想知道您是否知道发生了什么?
我们的 spark 作业因 Java OOM(或者更确切地说是进程限制)错误而失败:
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
at java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:714)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.$anonfun$new$1(MessageLoop.scala:174)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.$anonfun$new$1$adapted(MessageLoop.scala:173)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.<init>(MessageLoop.scala:173)
at org.apache.spark.rpc.netty.Dispatcher.liftedTree1$1(Dispatcher.scala:75)
at org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:72)
at org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:136)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:231)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:394)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Spark Thread Dump 在 spark 驱动程序上显示超过 5000 个提交者线程! 这是一个例子:
Thread ID Thread Name Thread State Thread Locks
1047 s3-committer-pool-0 WAITING
1449 s3-committer-pool-0 WAITING
1468 s3-committer-pool-0 WAITING
1485 s3-committer-pool-0 WAITING
1505 s3-committer-pool-0 WAITING
1524 s3-committer-pool-0 WAITING
1529 s3-committer-pool-0 WAITING
1544 s3-committer-pool-0 WAITING
1549 s3-committer-pool-0 WAITING
1809 s3-committer-pool-0 WAITING
1972 s3-committer-pool-0 WAITING
1998 s3-committer-pool-0 WAITING
2022 s3-committer-pool-0 WAITING
2043 s3-committer-pool-0 WAITING
2416 s3-committer-pool-0 WAITING
2453 s3-committer-pool-0 WAITING
2470 s3-committer-pool-0 WAITING
2517 s3-committer-pool-0 WAITING
2534 s3-committer-pool-0 WAITING
2551 s3-committer-pool-0 WAITING
2580 s3-committer-pool-0 WAITING
2597 s3-committer-pool-0 WAITING
2614 s3-committer-pool-0 WAITING
2631 s3-committer-pool-0 WAITING
2726 s3-committer-pool-0 WAITING
2743 s3-committer-pool-0 WAITING
2763 s3-committer-pool-0 WAITING
2780 s3-committer-pool-0 WAITING
2819 s3-committer-pool-0 WAITING
2841 s3-committer-pool-0 WAITING
2858 s3-committer-pool-0 WAITING
2875 s3-committer-pool-0 WAITING
2925 s3-committer-pool-0 WAITING
2942 s3-committer-pool-0 WAITING
2963 s3-committer-pool-0 WAITING
2980 s3-committer-pool-0 WAITING
3020 s3-committer-pool-0 WAITING
3037 s3-committer-pool-0 WAITING
3055 s3-committer-pool-0 WAITING
3072 s3-committer-pool-0 WAITING
3127 s3-committer-pool-0 WAITING
3144 s3-committer-pool-0 WAITING
3163 s3-committer-pool-0 WAITING
3180 s3-committer-pool-0 WAITING
3222 s3-committer-pool-0 WAITING
3242 s3-committer-pool-0 WAITING
3259 s3-committer-pool-0 WAITING
3278 s3-committer-pool-0 WAITING
3418 s3-committer-pool-0 WAITING
3435 s3-committer-pool-0 WAITING
3452 s3-committer-pool-0 WAITING
3469 s3-committer-pool-0 WAITING
3486 s3-committer-pool-0 WAITING
3491 s3-committer-pool-0 WAITING
3501 s3-committer-pool-0 WAITING
3508 s3-committer-pool-0 WAITING
4029 s3-committer-pool-0 WAITING
4093 s3-committer-pool-0 WAITING
4658 s3-committer-pool-0 WAITING
4666 s3-committer-pool-0 WAITING
4907 s3-committer-pool-0 WAITING
5102 s3-committer-pool-0 WAITING
5119 s3-committer-pool-0 WAITING
5158 s3-committer-pool-0 WAITING
5175 s3-committer-pool-0 WAITING
5192 s3-committer-pool-0 WAITING
5209 s3-committer-pool-0 WAITING
5226 s3-committer-pool-0 WAITING
5395 s3-committer-pool-0 WAITING
5634 s3-committer-pool-0 WAITING
5651 s3-committer-pool-0 WAITING
5668 s3-committer-pool-0 WAITING
5685 s3-committer-pool-0 WAITING
5702 s3-committer-pool-0 WAITING
5722 s3-committer-pool-0 WAITING
5739 s3-committer-pool-0 WAITING
6144 s3-committer-pool-0 WAITING
6167 s3-committer-pool-0 WAITING
6289 s3-committer-pool-0 WAITING
6588 s3-committer-pool-0 WAITING
6628 s3-committer-pool-0 WAITING
6645 s3-committer-pool-0 WAITING
6662 s3-committer-pool-0 WAITING
6675 s3-committer-pool-0 WAITING
6692 s3-committer-pool-0 WAITING
6709 s3-committer-pool-0 WAITING
7049 s3-committer-pool-0 WAITING
这是考虑到我们的设置不允许超过 100 个线程... 或者我们有什么不明白的……
这是我们的配置和设置:
fs.s3a.threads.max 100
fs.s3a.connection.maximum 1000
fs.s3a.committer.threads 16
fs.s3a.max.total.tasks 5
fs.s3a.committer.name directory
fs.s3a.fast.upload.buffer disk
io.file.buffer.size 1048576
mapreduce.outputcommitter.factory.scheme.s3a - org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
我们尝试了不同版本的 spark Hadoop 云库,但问题始终相同。
如果您能指出我们正确的方向,我们将不胜感激????
感谢您的宝贵时间!
【问题讨论】:
-
你有多少空间:${fs.s3a.committer.staging.tmp.path}/${user}?
-
您能否在集群中使用的所有机器上添加:“ulimit -Hn”的输出。 docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.1/bk_security/…
-
ulimit -Hn 1048576 ${fs.s3a.committer.staging.tmp.path} 是 3TB 的 HDFS 文件系统 - 为空。
-
你能添加你的配置吗:fs.s3a.buffer.dir 是否和 fs.s3a.committer.staging.tmp.path 一样?
-
尝试减少:线程数 fs.s3a.threads.max 20 问题有变化吗?
标签: java apache-spark hadoop amazon-s3 pyspark