【发布时间】:2015-09-02 05:17:49
【问题描述】:
我的数据集是一个RDD[Array[String]],有超过 140 列。如何在不硬编码列号(.map(x => (x(0),x(3),x(6)...)) 的情况下选择列子集?
这是我迄今为止尝试过的(成功):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
但是,我需要的列不止几列,并且希望避免对它们进行硬编码。
这是我迄今为止尝试过的(我认为会更好,但失败了):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
我发现了这个问题并尝试了它,但是因为我正在查看的是 RDD 而不是数组,所以它不起作用:How can I select a non-sequential subset elements from an array using Scala and Spark?
【问题讨论】:
标签: scala apache-spark rdd