【问题标题】:Pyspark: not cumulative sum over partitionPyspark:不是分区上的累积总和
【发布时间】:2021-06-01 08:45:21
【问题描述】:
我想对一个分区求和,而不是累积总和,而是每个分区的总和:
发件人:
| Category A |
Category B |
Value |
| 1 |
2 |
100 |
| 1 |
2 |
150 |
| 2 |
1 |
110 |
| 2 |
2 |
200 |
我想要:
| Category A |
Category B |
Value |
Sum |
| 1 |
2 |
100 |
250 |
| 1 |
2 |
150 |
250 |
| 2 |
1 |
110 |
110 |
| 2 |
2 |
200 |
200 |
与:
from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))
我没有得到我想要的结果,我得到了累计和:
| Category A |
Category B |
Value |
Sum |
| 1 |
2 |
100 |
100 |
| 1 |
2 |
150 |
250 |
| 2 |
1 |
110 |
110 |
| 2 |
2 |
200 |
200 |
我该如何继续?谢谢
【问题讨论】:
标签:
pyspark
sum
window
partition
【解决方案1】:
定义窗口时,您可以为窗口指定range。
您可以指定范围(Window.unboundedPreceding, Window.unboundedFollowing) 对每个分区内的所有行求和,而不管行的顺序如何:
windowSpec = Window.partitionBy(["Category A","Category B"])\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
.orderBy("Category A", "Category B").show()
打印
+----------+----------+-----+-----+
|Category A|Category B|Value| sum|
+----------+----------+-----+-----+
| 1| 2| 100|250.0|
| 1| 2| 150|250.0|
| 2| 1| 110|110.0|
| 2| 2| 200|200.0|
+----------+----------+-----+-----+