【问题标题】:Create a list by aggregation using conditions in pyspark使用 pyspark 中的条件通过聚合创建列表
【发布时间】:2020-12-21 14:05:53
【问题描述】:

嗨,我是 pyspark 的新手。

我的数据框如下所示:

+--------------------+----------+--------------+--------------------+----------------------+-------+
|            cookieId|sessionSeq|sessionUserSeq|              time  |             keyword  |  code |
+--------------------+----------+--------------+--------------------+----------------------+-------+
|03bdc154-3261-0a9...|         4|             3|  2020-12-12 04:51  |          X-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             4|  2020-12-12 04:52  |          X-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             4|  2020-12-12 04:53  |                null  |   5027|
|03bdc154-3261-0a9...|         4|             7|  2020-12-12 04:54  |          x-mas tree  |   null|
|03bdc154-3261-0a9...|         4|             9|  2020-12-12 04:55  |                bulb  |   null|
|017224a2-2d65-23e...|         8|             2|  2020-12-11 05:04  |          X-mas tree  |   null|
|017224a2-2d65-23e...|         8|             3|  2020-12-11 05:05  |    X-mas decoration  |   null|
|017224a2-2d65-23e...|         8|             3|  2020-12-11 05:06  |                null  |   5028|
|017224a2-2d65-23e...|         8|             8|  2020-12-11 05:07  |    X-mas decoration  |   null|
+--------------------+----------+--------------+--------------------+----------------------+-------+

我想通过按“cookieId”和“代码”对数据框进行分组来制作关键字列表。这里重要的一点是,当“code”列中有值时,“keyword_list”只会在比当时时间少的时间内创建。

预期输出:

+------------+-------------------------+-----------------------------------+
|      code  |              cookieId   |                     keyword_list  |
+--------------------------------------+-----------------------------------+
|      5027  |   03bdc154-3261-0a9...  |        [X-mas tree, X-mass tree]  |
|      5028  |   017224a2-2d65-23e...  |   [X-mas tree, X-mas decoration]  |
+------------+-------------------------------------------------------------+

我尝试了很多,但没有得到想要的结果。请帮助我...!

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql aggregate


    【解决方案1】:

    使用first获取代码,然后使用collect_list聚合。

    from pyspark.sql import functions as F, Window
    
    df2 = df.withColumn(
        'code',
        F.first('code', ignorenulls=True).over(
            Window.partitionBy('cookieId')
                  .orderBy('time')
                  .rowsBetween(0,Window.unboundedFollowing)
        )
    ).dropna().groupBy(
        'code', 'cookieId'
    ).agg(
        F.collect_list('keyword').alias('keyword_list')
    )
    
    df2.show(truncate=False)
    +----+--------------------+------------------------------+
    |code|cookieId            |keyword_list                  |
    +----+--------------------+------------------------------+
    |5027|03bdc154-3261-0a9...|[X-mas tree, X-mas tree]      |
    |5028|017224a2-2d65-23e...|[X-mas tree, X-mas decoration]|
    +----+--------------------+------------------------------+
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-07-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-01-17
      相关资源
      最近更新 更多