数据框聚合答案

【问题标题】：Dataframe Aggregation数据框聚合
【发布时间】：2018-09-23 11:37:45
【问题描述】：

我有一个具有以下结构的数据框 DF：

ID, DateTime, Latitude, Longitude, otherArgs

我想按 ID 和时间窗口对我的数据进行分组，并保留有关位置的信息（例如分组纬度的平均值和分组经度的平均值）

我成功获得了一个新的数据框，其中包含按 id 和时间分组的数据：

DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))

但是这样做会丢失我的位置数据。

我正在寻找的是看起来像这样的东西，例如：

DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))

每个 ID 和时间窗口只返回一行。

编辑：

示例输入： DF : ID, DateTime, Latitude, Longitude, otherArgs

0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings

示例输出： DF : ID、窗口（日期时间）、纬度、经度

0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5

【问题讨论】：

标签： scala apache-spark apache-spark-sql aggregation

【解决方案1】：

这是您可以执行的操作，您需要使用 mean 和 aggregation。

val df = Seq(
  (0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
  (0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
  (1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
  (1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))

df.groupBy($"id", window($"DateTime", "2 minutes"))
  .agg(
    mean("Latitude").as("lat"),
    mean("Longitude").as("long")
  )
.show(false)

输出：

+---+---------------------------------------------+----+----+
|id |window                                       |lat |long|
+---+---------------------------------------------+----+----+
|1  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+

【讨论】：

【解决方案2】：

您应该使用.agg() 方法进行聚合

也许这就是你的意思？

DF
  .groupBy(
    'ID,
    window('DateTime, "2 minutes")
  )
  .agg(
    mean("latitude").as("latitudeMean"),
    mean("longitude").as("longitudeMean")        
  )

【讨论】：