【发布时间】:2019-07-11 13:48:09
【问题描述】:
我正在尝试在 EMR Spark 上运行多个纱线应用程序,但我一次无法运行超过 5 个应用程序。
我正在为 Spark 集群使用以下配置:
Master = r5.2xlarge
工人 = r5.12xlarge 384 GB 内存 48 个虚拟核心 部署模式 = 集群
JSON
{
"Classification":"spark-defaults",
"ConfigurationProperties":{
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.scheduler.mode":"FIFO",
"spark.eventLog.enabled":"true",
"spark.serializer":"org.apache.spark.serializer.KryoSerializer",
"spark.dynamicAllocation.enabled":"false",
"spark.executor.heartbeatInterval":"60s",
"spark.network.timeout": "800s",
"spark.executor.cores": "5",
"spark.driver.cores": "5",
"spark.executor.memory": "37000M",
"spark.driver.memory": "37000M",
"spark.yarn.executor.memoryOverhead": "5000M",
"spark.yarn.driver.memoryOverhead": "5000M",
"spark.executor.instances": "17",
"spark.default.parallelism": "170",
"spark.yarn.scheduler.reporterThread.maxFailures": "5",
"spark.storage.level": "MEMORY_AND_DISK_SER",
"spark.rdd.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true"
}
}
如何增加 EMR Spark 中并行运行的 Yarn 应用程序的数量?
【问题讨论】:
-
有多少 vcore 和多少内存可用?可以添加 YARN 信息吗?
-
我认为yarn取决于你所拥有的剩余资源,否则容器将被破坏。
-
我更改了实例类型以增加资源,但它并没有增加超过 5 个并行作业
-
能否请您发布您是如何提交工作的。您在提交作业时是否管理任何参数
-
提交作业为:“spark-submit --deploy-mode cluster --master yarn --py-files s3://bucket_name/spark_standardization.zip s3://bucket_name/preprocess_driver.py”没有在命令中设置其他参数。
标签: pyspark hadoop-yarn amazon-emr