AWS Glue 执行者死亡答案

【问题标题】：AWS Glue executors dyingAWS Glue 执行者死亡
【发布时间】：2019-07-27 00:35:45
【问题描述】：

我正在以这种方式使用 AWS Glue DynamicFrame 从 S3 读取镶木地板文件：

sources = glue_context\
    .create_dynamic_frame\
    .from_options(connection_type="s3",
        connection_options={'paths': source_paths, 'recurse': True,
                            'groupFiles': 'inPartition'},
        format="parquet",
        transformation_ctx="source")

在此操作之后，我正在转换 Spark DF 的 DynamicFrame 以应用特定的 Spark 函数。最后将这些结果再次包装在一个 DynamicFrame 中并使用它来写入 Redshift。

发生的事情是执行者因为

而不断死亡

WARN TaskSetManager: Lost task in stage ExecutorLostFailure (executor exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead.

这种行为也可以从 AWS Glue 指标中发现：

我已经阅读了这个article，不幸的是它使用了 JDBC 源（在我的例子中是 S3）并建议在任何地方使用 Glue DynamicFrames。但不幸的是，我确实需要使用 Spark DF 进行特定的数据转换。

如何解决由于 memoryOverhead 而导致执行程序不断死亡的问题？是 Spark 相关的还是 Glue 相关的？

【问题讨论】：

标签： python amazon-web-services apache-spark pyspark aws-glue

【解决方案1】：

调整 Spark 配置可能会有所帮助。

我原以为不需要通过 AWS Glue 等工具来调整 Spark 参数并且由平台处理，但不幸的是，事实并非如此。

我使用这种语法将 Parameters 传递给 Glue 作业

密钥：--conf 价值：spark.yarn.executor.memoryOverhead=2g

此外，引入一些缓存逻辑和重新分区有助于让执行程序保持忙碌。缓存的唯一问题是OOM，通过传递spark.yarn.executor.memoryOverhead=2g参数解决。

【讨论】：