Spark 处理小文件（coalesce vs CombineFileInputFormat）答案

【问题标题】：Spark handling small files (coalesce vs CombineFileInputFormat)Spark 处理小文件（coalesce vs CombineFileInputFormat）
【发布时间】：2017-01-25 23:15:14
【问题描述】：

我有一个用例，我在 S3 中有数百万个需要由 Spark 处理的小文件。我有两个选项可以减少任务数量： 1. 使用合并 2.扩展CombineFileInputFormat

但我不清楚机器人对性能的影响以及何时使用其中一种。

另外，CombineFileInputFormat 是一个抽象类，这意味着我需要提供我的实现。但是 Spark API (newAPIHadoopRDD) 将类名作为参数，我不确定如何传递可配置的 maxSplitSize

【问题讨论】：

标签： hadoop apache-spark emr amazon-emr

【解决方案1】：

在这种情况下要考虑的另一个很好的选择是SparkContext.wholeTextFiles()，它为每个文件创建一个记录，其名称为key，内容为value——请参阅Documentation

【讨论】：