【发布时间】:2020-06-08 17:39:56
【问题描述】:
我有如下示例所示的 pyspark 数据框(原始数据每天有 1.5 条记录)。它包含用户数据,包含开始时间和结束时间列以及几个人口统计变量(id、age_group、county 等)。很多记录只有1秒的时差
+--------+-------------+---------+-----------------------+-------------------+---------+
|id | date | group |start_time | end_time | duration|
+--------+-------------+---------+-----------------------+-------------------+---------+
| 78aa| 2020-04-14 | 3 | 2020-04-14 19:00:00|2020-04-14 19:23:59|24 |
| 78aa| 2020-04-14 | 3 | 2020-04-14 19:24:00|2020-04-14 19:26:59|4 |
| 78aa| 2020-04-14 | 3 | 2020-04-14 19:27:00|2020-04-14 19:35:59|8 |
| 78aa| 2020-04-14 | 3 | 2020-04-14 19:36:00|2020-04-14 19:55:00|19 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 08:00:00|2020-04-15 08:02:59|3 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 11:03:00|2020-04-15 11:11:59|9 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 11:12:00|2020-04-15 11:45:59|34 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 11:46:00|2020-04-15 11:47:00|1 |
+--------+-------+-----+---------+-----------------------+-------------------+---------+
我的尝试:全天聚合数据
from pyspark.sql.functions import sum, first
df = df.groupBy("date" , "id" ).agg(first("group"), sum("duration"))\
.toDF("data","id","group", "duration")
我还需要在白天在用户聚合级别传输数据帧。我如何用 pyspark 获得这个?我不想将我的数据转换为 pandas 数据帧,因为 pandas 会将数据加载到驱动程序的内存中,我将面临内存问题:这是所需的输出
+--------+--------------+------+-----------------------+-------------------+---------+
|id | date |group |start_time | end_time | duration|
+--------+--------------+------+-----------------------+-------------------+---------+
| 78aa| 2020-04-14 | 3 | 2020-04-14 19:00:00|2020-04-14 19:55:00|55 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 08:00:00|2020-04-15 08:02:59|3 |
| 25aa| 2020-04-15 | 7 | 2020-04-15 11:00:00|2020-04-15 11:47:00|44 |
+--------+--------------+------+-----------------------+-------------------+---------+
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql databricks