【问题标题】:Reducing a List of Case Classes to a Count of the Case Classes将案例类列表减少为案例类的计数
【发布时间】:2016-10-11 23:07:01
【问题描述】:

我目前有一个((id, code), (list of events with keys id and code)) 形式的组RDD。往下看,ID 为000406106-01,代码为496,各个事件分别为Diagnostic 案例类。我希望做的是获得((id, code), count of events) 形式的RDD。本质上,我想将Diagnostic 事件的CompactBuffer 对象折叠成事件计数。有什么建议吗?

    ID         CODE               EVENT1                                                     EVENT2
((000406106-01,496),CompactBuffer(Diagnostic(000406106-01,Sun Apr 16 02:24:00 UTC 2006,496), Diagnostic(000406106-01,Fri Jul 20 15:30:00 UTC 2012,496), Diagnostic(000406106-01,Tue Dec 23 17:00:00 UTC 2014,496), Diagnostic(000406106-01,Wed Jan 06 20:45:00 UTC 2010,496), Diagnostic(000406106-01,Fri Mar 04 16:30:00 UTC 2011,496), Diagnostic(000406106-01,Sun Aug 04 04:51:00 UTC 2013,496), Diagnostic(000406106-01,Fri Mar 11 16:00:00 UTC 2011,496), Diagnostic(000406106-01,Tue Jul 10 13:45:00 UTC 2012,496), Diagnostic(000406106-01,Wed Jun 15 20:00:00 UTC 2005,496), Diagnostic(000406106-01,Tue Dec 29 13:30:00 UTC 2009,496), Diagnostic(000406106-01,Fri Jul 13 13:30:00 UTC 2012,496), Diagnostic(000406106-01,Thu Jul 26 03:40:00 UTC 2007,496), Diagnostic(000406106-01,Mon Jun 13 14:45:00 UTC 2005,496), Diagnostic(000406106-01,Wed Dec 24 18:00:00 UTC 2014,496), Diagnostic(000406106-01,Thu Mar 03 15:45:00 UTC 2011,496), Diagnostic(000406106-01,Wed Dec 31 15:00:00 UTC 2014,496), Diagnostic(000406106-01,Sat Jul 26 04:39:00 UTC 2008,496), Diagnostic(000406106-01,Thu Dec 31 20:30:00 UTC 2009,496)))

我在寻找什么:

     ID        CODE COUNT
((000406106-01,496), 20)

编辑:为清楚起见,以下是生成上述 RDD 的方式:

val grpDiag = diagnostic.groupBy(diag => (diag.id, diag.code))

其中 diagnostic 是上述数据的未分组 RDD。

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    如果元组的第二个元素是 CompactBuffer 并且您只需要它的长度 mapValues_.size 应该会给您所需的结果:

    rdd.mapValues(_.size)
    

    一般来说,您应该避免仅仅为了找到count 而使用reduceByKey 作为替代:

    val diagnostics: RDD[Diagnostic] = ???
    diagnostics.map(d => ((d.id, d.code), 1L)).reduceByKey(_ + _)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-08-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多