使用 pyspark 中的条件通过聚合创建列表答案

【问题标题】：Create a list by aggregation using conditions in pyspark使用 pyspark 中的条件通过聚合创建列表
【发布时间】：2020-12-21 14:05:53
【问题描述】：

嗨，我是 pyspark 的新手。

我的数据框如下所示：

+--------------------+----------+--------------+--------------------+----------------------+-------+
|            cookieId|sessionSeq|sessionUserSeq|              time  |             keyword  |  code |
+--------------------+----------+--------------+--------------------+----------------------+-------+
|03bdc154-3261-0a9...|         4|             3|  2020-12-12 04:51  |          X-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             4|  2020-12-12 04:52  |          X-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             4|  2020-12-12 04:53  |                null  |   5027|
|03bdc154-3261-0a9...|         4|             7|  2020-12-12 04:54  |          x-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             9|  2020-12-12 04:55  |                bulb  |   null|
|017224a2-2d65-23e...|         8|             2|  2020-12-11 05:04  |          X-mas tree  |   null|
|017224a2-2d65-23e...|         8|             3|  2020-12-11 05:05  |    X-mas decoration  |   null|
|017224a2-2d65-23e...|         8|             3|  2020-12-11 05:06  |                null  |   5028|
|017224a2-2d65-23e...|         8|             8|  2020-12-11 05:07  |    X-mas decoration  |   null|
+--------------------+----------+--------------+--------------------+----------------------+-------+

我想通过按“cookieId”和“代码”对数据框进行分组来制作关键字列表。这里重要的一点是，当“code”列中有值时，“keyword_list”只会在比当时时间少的时间内创建。

预期输出：

+------------+-------------------------+-----------------------------------+
|      code  |              cookieId   |                     keyword_list  |
+--------------------------------------+-----------------------------------+
|      5027  |   03bdc154-3261-0a9...  |        [X-mas tree, X-mass tree]  |
|      5028  |   017224a2-2d65-23e...  |   [X-mas tree, X-mas decoration]  |
+------------+-------------------------------------------------------------+

我尝试了很多，但没有得到想要的结果。请帮助我...！

【问题讨论】：

标签： apache-spark pyspark apache-spark-sql aggregate

【解决方案1】：

使用first获取代码，然后使用collect_list聚合。

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'code',
    F.first('code', ignorenulls=True).over(
        Window.partitionBy('cookieId')
              .orderBy('time')
              .rowsBetween(0,Window.unboundedFollowing)
    )
).dropna().groupBy(
    'code', 'cookieId'
).agg(
    F.collect_list('keyword').alias('keyword_list')
)

df2.show(truncate=False)
+----+--------------------+------------------------------+
|code|cookieId            |keyword_list                  |
+----+--------------------+------------------------------+
|5027|03bdc154-3261-0a9...|[X-mas tree, X-mas tree]      |
|5028|017224a2-2d65-23e...|[X-mas tree, X-mas decoration]|
+----+--------------------+------------------------------+

【讨论】：