AWS Glue - 无法设置 spark.yarn.executor.memoryOverhead答案

【问题标题】：AWS Glue - can't set spark.yarn.executor.memoryOverheadAWS Glue - 无法设置 spark.yarn.executor.memoryOverhead
【发布时间】：2019-01-29 21:27:11
【问题描述】：

在 AWS Glue 中运行 python 作业时出现错误：

原因：容器因超出内存限制而被 YARN 杀死。使用了 5.6 GB 的 5.5 GB 物理内存。考虑提升 spark.yarn.executor.memoryOverhead

在脚本开头运行时：

print '--- Before Conf --'
print 'spark.yarn.driver.memory', sc._conf.get('spark.yarn.driver.memory')
print 'spark.yarn.driver.cores', sc._conf.get('spark.yarn.driver.cores')
print 'spark.yarn.executor.memory', sc._conf.get('spark.yarn.executor.memory')
print 'spark.yarn.executor.cores', sc._conf.get('spark.yarn.executor.cores')
print "spark.yarn.executor.memoryOverhead", sc._conf.get("spark.yarn.executor.memoryOverhead")

print '--- Conf --'
sc._conf.setAll([('spark.yarn.executor.memory', '15G'),('spark.yarn.executor.memoryOverhead', '10G'),('spark.yarn.driver.cores','5'),('spark.yarn.executor.cores', '5'), ('spark.yarn.cores.max', '5'), ('spark.yarn.driver.memory','15G')])

print '--- After Conf ---'
print 'spark.driver.memory', sc._conf.get('spark.driver.memory')
print 'spark.driver.cores', sc._conf.get('spark.driver.cores')
print 'spark.executor.memory', sc._conf.get('spark.executor.memory')
print 'spark.executor.cores', sc._conf.get('spark.executor.cores')
print "spark.executor.memoryOverhead", sc._conf.get("spark.executor.memoryOverhead")

我得到以下输出：

--- 会议前 --

spark.yarn.driver.memory 无

spark.yarn.driver.cores 无

spark.yarn.executor.memory 无

spark.yarn.executor.cores 无

spark.yarn.executor.memoryOverhead 无

--- 会议 --

--- 会议后---

spark.yarn.driver.memory 15G

spark.yarn.driver.cores 5

spark.yarn.executor.memory 15G

spark.yarn.executor.cores 5

spark.yarn.executor.memoryOverhead 10G

spark.yarn.executor.memoryOverhead 似乎已设置，但为什么无法识别？我仍然遇到同样的错误。

我看过其他关于设置 spark.yarn.executor.memoryOverhead 问题的帖子，但在它似乎已设置但无法正常工作时却没有？

【问题讨论】：

标签： apache-spark pyspark aws-glue

【解决方案1】：

很遗憾，当前版本的 Glue 不支持此功能。除了使用 UI 之外，您无法设置其他参数。在您的情况下，您可以使用 AWS EMR 服务，而不是使用 AWS Glue。

当我遇到类似问题时，我尝试减少 shuffle 的次数和 shuffle 的数据量，并增加 DPU。在处理这个问题的过程中，我基于以下文章。我希望它们有用。

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html

更新：2019-01-13

亚马逊最近在 AWS Glue 文档中添加了新部分，该部分描述了如何监控和优化 Glue 作业。我认为了解与内存问题相关的问题在哪里以及如何避免它是非常有用的。

https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html

【讨论】：

【解决方案2】：

Open Glue > 作业 > 编辑您的作业 > 脚本库和作业参数（可选）> 靠近底部的作业参数
设置以下 > 键：--conf 值：spark.yarn.executor.memoryOverhead=1024

【讨论】：

这似乎对我的情况有所帮助，但只是想为即将推出的 Spark 版本指出以下内容：WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.