【发布时间】:2017-07-04 02:10:43
【问题描述】:
我有以下 Pojo:
public class MyPojo {
Date startDate;
Double usageAMount;
// ... bla bla bla
}
所以我有一个MyPojo 对象列表,作为参数传递给函数:
public Map<Date, Double> getWeeklyCost(@NotNull List<MyPojo> reports) {
JavaRDD<MyPojo> rdd = context.parallelize(reports);
JavaPairRDD<Date, Double> result = rdd.mapToPair(
(PairFunction<MyPojo, Date, Double>) x ->
new Tuple2<>(x.getStartDate(), x.getUsageAmount()))
.reduceByKey((Function2<Double, Double, Double>) (x, y) -> x + y);
return result.collectAsMap();
}
但是,我返回如下内容:
"2017-06-28T22:00:00.000+0000": 0.02916666,
"2017-06-29T16:00:00.000+0000": 0.02916666,
"2017-06-27T13:00:00.000+0000": 0.03888888,
"2017-06-26T05:00:00.000+0000": 0.05833332000000001,
"2017-06-28T21:00:00.000+0000": 0.03888888,
"2017-06-27T02:00:00.000+0000": 0.03888888,
"2017-06-28T03:00:00.000+0000": 0.07777776000000002,
"2017-06-28T20:00:00.000+0000": 0.01944444,
"2017-06-30T04:00:00.000+0000": 0.00972222,
"2017-06-28T02:00:00.000+0000": 0.05833332000000001,
"2017-06-29T21:00:00.000+0000": 0.03888888,
"2017-06-29T23:00:00.000+0000": 0.06805554000000001,
"2017-06-27T00:00:00.000+0000": 0.05833332000000001,
"2017-06-26T06:00:00.000+0000": 0.03888888,
"2017-06-28T01:00:00.000+0000": 0.09722220000000002,
"2017-06-29T22:00:00.000+0000": 0.01944444,
"2017-06-28T00:00:00.000+0000": 0.11666664000000003,
"2017-06-27T12:00:00.000+0000": 0.01944444,
"2017-06-26T11:00:00.000+0000": 0.01944444,
"2017-06-29T03:00:00.000+0000": 0.01944444,
"2017-06-26T04:00:00.000+0000": 0.07777776000000002,
"2017-06-27T19:00:00.000+0000": 0.01944444,
"2017-06-29T20:00:00.000+0000": 0.048611100000000004,
"2017-06-29T02:00:00.000+0000": 0.02916666,
"2017-06-29T15:00:00.000+0000": 0.01944444,
"2017-06-27T17:00:00.000+0000": 0.01944444,
"2017-06-29T14:00:00.000+0000": 0.02916666,
"2017-06-30T01:00:00.000+0000": 0.02916666,
"2017-06-29T00:00:00.000+0000": 0.01944444,
"2017-06-27T18:00:00.000+0000": 0.03888888,
"2017-06-26T03:00:00.000+0000": 0.07777776000000002,
"2017-06-28T05:00:00.000+0000": 0.05833332000000001,
"2017-06-29T13:00:00.000+0000": 0.01944444,
"2017-06-30T03:00:00.000+0000": 0.00972222,
"2017-06-27T11:00:00.000+0000": 0.01944444,
"2017-06-28T04:00:00.000+0000": 0.05833332000000001,
"2017-06-29T12:00:00.000+0000": 0.00972222,
"2017-06-30T02:00:00.000+0000": 0.06805554000000001,
"2017-06-27T23:00:00.000+0000": 0.09722220000000002,
"2017-06-27T16:00:00.000+0000": 0.01944444,
"2017-06-26T15:00:00.000+0000": 0.01944444,
"2017-06-29T06:00:00.000+0000": 0.00972222,
"2017-06-30T07:00:00.000+0000": 0.00138889,
"2017-06-30T00:00:00.000+0000": 0.01944444,
"2017-06-27T21:00:00.000+0000": 0.01944444,
"2017-06-26T02:00:00.000+0000": 0.07777776000000002,
"2017-06-29T19:00:00.000+0000": 0.00972222,
"2017-06-27T03:00:00.000+0000": 0.03888888,
"2017-06-27T20:00:00.000+0000": 0.01944444,
"2017-06-30T05:00:00.000+0000": 74.1458333,
"2017-06-29T18:00:00.000+0000": 0.00972222,
"2017-06-29T17:00:00.000+0000": 0.01944444,
"2017-06-28T23:00:00.000+0000": 0.00972222,
"2017-06-27T01:00:00.000+0000": 0.01944444,
"2017-06-27T22:00:00.000+0000": 0.05833332000000001
我想返回它按天聚合,按日期降序排序。 例如:
"2017-06-28T03:00:00.000+0000": 0.07777776000000002,
"2017-06-28T20:00:00.000+0000": 0.01944444,
在同一天,因此应添加它们的值(usageAmount)。我只关心一天,而不是小时。如何减少或聚合我的 RDD 以获得所需的结果?
** 更新** 答案一定是Spark RDD 解决方案...
【问题讨论】:
-
你可以使用 Spark SQL 的 DataFrames 吗?这样以后写和理解起来就容易多了。
-
@JacekLaskowski 数据来自MongoDB....
-
没有接受的答案?
标签: java apache-spark apache-spark-sql