【发布时间】:2020-12-21 14:05:53
【问题描述】:
嗨,我是 pyspark 的新手。
我的数据框如下所示:
+--------------------+----------+--------------+--------------------+----------------------+-------+
| cookieId|sessionSeq|sessionUserSeq| time | keyword | code |
+--------------------+----------+--------------+--------------------+----------------------+-------+
|03bdc154-3261-0a9...| 4| 3| 2020-12-12 04:51 | X-mas tree | null|
|03bdc154-3261-0a9...| 4| 4| 2020-12-12 04:52 | X-mas tree | null|
|03bdc154-3261-0a9...| 4| 4| 2020-12-12 04:53 | null | 5027|
|03bdc154-3261-0a9...| 4| 7| 2020-12-12 04:54 | x-mas tree | null|
|03bdc154-3261-0a9...| 4| 9| 2020-12-12 04:55 | bulb | null|
|017224a2-2d65-23e...| 8| 2| 2020-12-11 05:04 | X-mas tree | null|
|017224a2-2d65-23e...| 8| 3| 2020-12-11 05:05 | X-mas decoration | null|
|017224a2-2d65-23e...| 8| 3| 2020-12-11 05:06 | null | 5028|
|017224a2-2d65-23e...| 8| 8| 2020-12-11 05:07 | X-mas decoration | null|
+--------------------+----------+--------------+--------------------+----------------------+-------+
我想通过按“cookieId”和“代码”对数据框进行分组来制作关键字列表。这里重要的一点是,当“code”列中有值时,“keyword_list”只会在比当时时间少的时间内创建。
预期输出:
+------------+-------------------------+-----------------------------------+
| code | cookieId | keyword_list |
+--------------------------------------+-----------------------------------+
| 5027 | 03bdc154-3261-0a9... | [X-mas tree, X-mass tree] |
| 5028 | 017224a2-2d65-23e... | [X-mas tree, X-mas decoration] |
+------------+-------------------------------------------------------------+
我尝试了很多,但没有得到想要的结果。请帮助我...!
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql aggregate