Spark：使用一对（K，Collection [V]）时的RDD.saveAsTextFile答案

【问题标题】：Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])Spark：使用一对（K，Collection [V]）时的RDD.saveAsTextFile
【发布时间】：2014-07-30 13:17:46
【问题描述】：

我有一个员工数据集及其休假记录。每条记录（EmployeeRecord 类型）都包含 EmpID（String 类型）和其他字段。我从文件中读取记录，然后转换为 PairRDDFunctions：

val empRecords = sc.textFile(args(0))
....

val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)

此时，“empsGroupedByEmpID”的类型为 RDD[String,Iterable[EmployeeRecord]]。我将其转换为 PairRDDFunctions：

val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)

然后，我根据应用程序的逻辑去处理记录。最后，我得到了一个 [Iterable[EmployeeRecord]]

类型的 RDD

val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>

当我尝试使用可用的 API 将此 RDD 的内容写入文本文件时：

finalRecords.saveAsTextFile("./path/to/save")

我发现在文件中每条记录都以 ArrayBuffer(...) 开头。我需要的是每行有一个 EmployeeRecord 的文件。那不可能吗？我错过了什么吗？

【问题讨论】：

标签： scala apache-spark

【解决方案1】：

我发现了缺失的 API。很好......平面图！ :-)

通过使用带有标识的 flatMap，我可以摆脱迭代器并“解包”内容，如下所示：

finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")

这解决了我一直遇到的问题。

我也发现这个post 暗示了同样的事情。我希望我早点看到它。

【讨论】：