这是一个有点令人困惑的问题。如果您的数据已经在Array[(String, Int)] 集合中(大概在驱动程序的collect() 之后),那么您不需要使用任何RDD 转换。事实上,您可以使用 fold*() 运行一个绝妙的技巧来获取集合的平均值:
val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }
有点啰嗦,但它基本上汇总了分子中的字符总数和分母中的单词数。运行您的示例,我看到以下内容:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val average = ...
average: Double = 3.111111111111111
如果您的 (String, Int) 元组分布在 RDD[(String, Int)] 中,则可以使用 accumulators 轻松解决此问题:
val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
chars += count; words += count / word.length
}
val average = chars.value / words.value
在上面的示例上运行时(放置在RDD 中),我看到以下内容:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0
scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0
scala> wordsRDD.foreach { case (word: String, count: Int) =>
| chars += count; words += count / word.length
| }
...
scala> val average = chars.value / words.value
average: Double = 3.111111111111111