对于 Pandas 用户,几乎有一种常见的编程模式,使用 shift() + cumsum() 来设置组标签来识别与某些特定模式/条件匹配的连续行。使用 pyspark,我们可以使用窗口函数lag() + sum() 来做同样的事情并找到这个组标签(以下代码中的d2):
数据设置:
from pyspark.sql import functions as F, Window
>>> df.orderBy('timestamp').show()
+-------+----------+-----+
|trip-id| timestamp|speed|
+-------+----------+-----+
| 001|1538204192|44.55|
| 001|1538204193|47.20|
| 001|1538204194|42.14|
| 001|1538204195|39.20|
| 001|1538204196|35.30|
| 001|1538204197|32.22|
| 001|1538204198|34.80|
| 001|1538204199|37.10|
| 001|1538204221|55.30|
| 001|1538204222|57.20|
| 001|1538204223|54.60|
| 001|1538204224|52.15|
| 001|1538204225|49.27|
| 001|1538204226|47.89|
| 001|1538204227|50.57|
| 001|1538204228|53.72|
+-------+----------+-----+
>>> df.printSchema()
root
|-- trip-id: string (nullable = true)
|-- unix_timestamp: integer (nullable = true)
|-- speed: double (nullable = true)
设置两个Window Spec (w1, w2):
# Window spec used to find previous speed F.lag('speed').over(w1) and also do the cumsum() to find flag `d2`
w1 = Window.partitionBy('trip-id').orderBy('timestamp')
# Window spec used to find the minimal value of flag `d1` over the partition(`trip-id`,`d2`)
w2 = Window.partitionBy('trip-id', 'd2').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
三个标志(d1、d2、d3):
-
d1 :标识前一个速度是否大于当前速度的标志,如果为真则d1 = 0,否则为d1 = 1
-
d2 :标记连续行以使用相同的唯一编号进行速降
-
d3 : 标识分区上d1的最小值的标志('trip-id', 'd2'),只有d3 == 0的行才能属于speed-的组降低。这将用于过滤掉不相关的行
df_1 = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2))
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204192|44.55| 1| 1| 1|
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204198|34.80| 1| 3| 1|
| 001|1538204199|37.10| 1| 4| 1|
| 001|1538204221|55.30| 1| 5| 1|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
| 001|1538204227|50.57| 1| 7| 1|
| 001|1538204228|53.72| 1| 8| 1|
+-------+----------+-----+---+---+---+
删除不关心的行:
df_1 = df_1.where('d3 == 0')
>>> df_1.orderBy('timestamp').show()
+-------+----------+-----+---+---+---+
|trip-id| timestamp|speed| d1| d2| d3|
+-------+----------+-----+---+---+---+
| 001|1538204193|47.20| 1| 2| 0|
| 001|1538204194|42.14| 0| 2| 0|
| 001|1538204195|39.20| 0| 2| 0|
| 001|1538204196|35.30| 0| 2| 0|
| 001|1538204197|32.22| 0| 2| 0|
| 001|1538204222|57.20| 1| 6| 0|
| 001|1538204223|54.60| 0| 6| 0|
| 001|1538204224|52.15| 0| 6| 0|
| 001|1538204225|49.27| 0| 6| 0|
| 001|1538204226|47.89| 0| 6| 0|
+-------+----------+-----+---+---+---+
最后一步:
现在对于df_1,按trip-id和d2分组,找到F.struct('timestamp', 'speed')的最小值和最大值,这将返回组中的第一条和最后一条记录,从struct中选择相应的字段得到最终结果:
df_new = df_1.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
).select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)
>>> df_new.show()
+-------+---------------+-------------+-----------+---------+
|trip-id|start timestamp|end timestamp|start speed|end speed|
+-------+---------------+-------------+-----------+---------+
| 001| 1538204193| 1538204197| 47.20| 32.22|
| 001| 1538204222| 1538204226| 57.20| 47.89|
+-------+---------------+-------------+-----------+---------+
注意:去掉中间数据帧df_1,我们可以得到如下:
df_new = df.withColumn('d1', F.when(F.lag('speed').over(w1) > F.col('speed'), 0).otherwise(1))\
.withColumn('d2', F.sum('d1').over(w1)) \
.withColumn('d3', F.min('d1').over(w2)) \
.where('d3 == 0') \
.groupby('trip-id', 'd2').agg(
F.min(F.struct('timestamp', 'speed')).alias('start')
, F.max(F.struct('timestamp', 'speed')).alias('end')
)\
.select(
'trip-id'
, F.col('start.timestamp').alias('start timestamp')
, F.col('end.timestamp').alias('end timestamp')
, F.col('start.speed').alias('start speed')
, F.col('end.speed').alias('end speed')
)