【问题标题】:Cumulative Sum by Group Using DataFrame - Pyspark使用 DataFrame 分组的累积和 - Pyspark
【发布时间】:2020-03-10 01:51:39
【问题描述】:

我的代码:

df=temp_df.groupBy('date','id').count()
windowval = (Window.partitionBy('date','id').orderBy('date','id').rangeBetween(Window.unboundedPreceding, 0))
final_df = df.withColumn('cum_sum', F.sum('count').over(windowval)).orderBy('date','id').show()

请更正我的代码,我认为使用 Window(rangeBetween) 有问题。

谢谢,

DF:
+-------------------+------------------+-----+
|               date|                id|count|
+-------------------+------------------+-----+
|2007-11-04 00:00:00|                 5|    4|
|2007-11-05 00:00:00|                 5|    7|
|2007-11-06 00:00:00|                 5|    3|
|2007-11-06 00:00:00|                 8|    3|
|2007-11-07 00:00:00|                 5|    7|
|2007-11-08 00:00:00|                 5|    2|
|2007-11-08 00:00:00|                 8|    4|
+-------------------+------------------+-----+

Expected output:

+-------------------+------------------+-----+-------+
|               date|                id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00|                 5|    4|      4|
|2007-11-05 00:00:00|                 5|    7|     11|
|2007-11-06 00:00:00|                 5|    3|     14|
|2007-11-06 00:00:00|                 8|    3|      3|
|2007-11-07 00:00:00|                 5|    7|     21|
|2007-11-08 00:00:00|                 5|    2|     23|
|2007-11-08 00:00:00|                 8|    4|      7|
+-------------------+------------------+-----+-------+

My Output:

+-------------------+------------------+-----+-------+
|               date|                id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00|                 5|    4|      4|
|2007-11-05 00:00:00|                 5|    7|      7|
|2007-11-06 00:00:00|                 5|    3|      3|
|2007-11-06 00:00:00|                 8|    3|      3|
|2007-11-07 00:00:00|                 5|    7|      7|
|2007-11-08 00:00:00|                 5|    2|      2|
|2007-11-08 00:00:00|                 8|    4|      4|
+-------------------+------------------+-----+-------+


【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    只需将您当前的代码更改为:

    df = temp_df.groupBy('date', 'id').count()
    
    windowval = Window.partitionBy('id').orderBy('date').rangeBetween(Window.unboundedPreceding, 0)
    
    final_df = df.withColumn('cum_sum', F.sum('count').over(windowval)).orderBy('date', 'id').show()
    

    当您按 id 和日期进行分区时,每个 (id, date) 组合都是唯一的。您需要按idorderBy date 进行分区

    【讨论】:

    • 谢谢@pissall,我在发布问题后也尝试了同样的方法。有效。感谢您的帮助
    猜你喜欢
    • 2018-02-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-01
    • 1970-01-01
    • 2022-10-21
    • 2019-05-04
    • 1970-01-01
    相关资源
    最近更新 更多