【发布时间】:2014-01-23 09:09:29
【问题描述】:
目前正在处理来自 Mixpanel API 的大量 Json 数据。使用一个小数据集,这很容易,下面的代码运行得很好。但是,处理大型数据集需要相当长的时间,因此我们开始看到超时。
我的 Scala 优化技能相当差,所以我希望有人能展示一种更快的方法来处理大型数据集的以下内容。请务必解释为什么,因为这将有助于我自己对 Scala 的理解。
val people = parse[mp.data.Segmentation](o)
val list = people.data.values.map(b =>
b._2.map(p =>
Map(
"id" -> p._1,
"activity" -> p._2.foldLeft(0)(_+_._2)
)
)
)
.flatten
.filter{ behavior => behavior("activity") != 0 }
.groupBy(o => o("id"))
.map{ case (k,v) => Map("id" -> k, "activity" -> v.map( o => o("activity").asInstanceOf[Int]).sum) }
还有Segmentation类:
case class Segmentation(
val legend_size: Int,
val data: Data
)
case class Data(
val series: List[String],
val values: Map[String, Map[String, Map[String, Int]]]
)
感谢您的帮助!
编辑:按要求提供样本数据
{"legend_size": 4, "data": {"series": ["2013-12-17", "2013-12-18", "2013-12-19", "2013-12-20", "2013-12-21", "2013-12-22", "2013-12-23", "2013-12-24", "2013-12-25", "2013-12-26", "2013-12-27", "2013-12-28", "2013-12-29", "2013-12-30", "2013-12-31", "2014-01-01", "2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"], "values": {"afef4ac12a21d5c4ef679c6507fe65cd": {"id:twitter.com:194436690": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 1, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:330103796": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 1, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:216664121": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 1, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:414117608": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 1, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}}}}}
为了回答 Millhouse 的问题,其目的是总结每个日期,以提供一个数字来描述每个 ID 的“活动”总量。 “ID”的格式为id:twitter.com:923842。
【问题讨论】:
-
您能否提供一些测试 JSON 数据的 sn-p,并解释其意图是什么(看起来它是对每个
id的activity值求和?)跨度> -
在大数据的情况下,可用堆空间不足会导致显着减速。使用标准的 Java 内存分析工具来找出答案。
-
@millhouse 示例数据已按要求添加
标签: performance scala optimization scala-collections