【问题标题】:How to perform Standard Deviation and Mean operations on a Java Spark RDD?如何在 Java Spark RDD 上执行标准偏差和均值运算?
【发布时间】:2016-09-27 08:19:43
【问题描述】:

我有一个看起来像这样的 JavaRDD,

[
[A,8]
[B,3]
[C,5]
[A,2]
[B,8]
...
...
]

我希望我的结果是 平均值

[
[A,5]
[B,5.5]
[C,5]
]

如何仅使用 Java RDD 来做到这一点。 P.S:我想避免 groupBy 操作,所以我没有使用 DataFrames。

【问题讨论】:

标签: java apache-spark rdd bigdata


【解决方案1】:

给你:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.util.StatCounter;
import scala.Tuple2;
import scala.Tuple3;

import java.util.Arrays;
import java.util.List;

public class AggregateByKeyStatCounter {

  public static void main(String[] args) {

    SparkConf conf = new SparkConf().setAppName("AggregateByKeyStatCounter").setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);

    List<Tuple2<String, Integer>> myList = Arrays.asList(new Tuple2<>("A", 8), new Tuple2<>("B", 3), new Tuple2<>("C", 5),
            new Tuple2<>("A", 2), new Tuple2<>("B", 8));

    JavaRDD<Tuple2<String, Integer>> data = sc.parallelize(myList);
    JavaPairRDD<String, Integer> pairs = JavaPairRDD.fromJavaRDD(data);

    /* I'm actually using aggregateByKey to perform StatCounter 
       aggregation, so actually you can even have more statistics available */
    JavaRDD<Tuple3<String, Double, Double>> output = pairs
                         .aggregateByKey(
                          new StatCounter(), 
                          StatCounter::merge, 
                          StatCounter::merge)
                         .map(x -> new Tuple3<String, Double, Double>(x._1(), x._2().stdev(), x._2().mean()));

    output.collect().forEach(System.out::println);
  }

}

【讨论】:

    【解决方案2】:

    您可以使用 reduceByKey 并计算每个键的总和和计数,然后将它们除以每个键,如下所示。

    val means: RDD[(String, Double)] = rdd
     .map(x => (x._1, (x._2, 1))) // add 1 for each element for the count
     .reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // create a tuple (count, sum) for each key
     .map{ case (k, v) => (k, v._1 / v._2) } // calculate mean for each key
    

    【讨论】:

      猜你喜欢
      • 2023-03-12
      • 1970-01-01
      • 1970-01-01
      • 2017-01-18
      • 2014-04-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-03-29
      相关资源
      最近更新 更多