如何用 0 填充 null 并使用 spark `pivot` 累积`count`？答案

【问题标题】：How to fill null with 0 and accumulate `count` with spark `pivot`?如何用 0 填充 null 并使用 spark `pivot` 累积`count`？
【发布时间】：2021-05-22 01:14:55
【问题描述】：

我有一些产品销售数据如下：

product	date	money
A	2020-01	60
B	2020-03	80
A	2020-05	41
B	2020-08	50
B	2020-12	76
A	2020-11	76

我想按date 对数据进行分组，并按product 进行数据透视我的代码如下

     df.groupBy("date").pivot("product").agg(
      sum("money").as("month-sum"),
      sum(sum("money")).over(Window.orderBy("date").partitionBy("product")).as("cur-cumulative")
    ).orderBy("date").show()

结果是

|   date|A_month-sum|A_cur-cumulative|B_month-sum|B_cur-cumulative|
+-------+-----------+----------------+-----------+----------------+
|2020-01|         60|              60|       null|            null|
|2020-03|       null|            null|         80|             140|
|2020-05|         41|             181|       null|            null|
|2020-08|       null|            null|         50|             231|
|2020-11|         76|             307|       null|            null|
|2020-12|       null|            null|         76|             383|

我的期望是month-sum的null可以填0，cur-cumulative的null可以填最后一行的值，就像这样：

|   date|A_month-sum|A_cur-cumulative|B_month-sum|B_cur-cumulative|
+-------+-----------+----------------+-----------+----------------+
|2020-01|         60|              60|          0|               0|
|2020-03|          0|              60|         80|              80|
|2020-05|         41|             101|          0|              80|
|2020-08|          0|             101|         50|             130|
|2020-11|         76|             177|          0|             130|
|2020-12|          0|             177|         76|             206|
+-------+-----------+----------------+-----------+----------------+

有什么建议吗？提前致谢！

【问题讨论】：

标签： sql scala apache-spark apache-spark-sql pivot

【解决方案1】：

你可以在做累计和之前做一个.na.fill(0)：

import org.apache.spark.sql.expressions.Window

val df2 = df
    .groupBy("date")
    .pivot("product")
    .agg(sum("money"))

val df3 = df2
    .na.fill(0)
    .select(
        col("date") +: 
        df2.columns.tail.flatMap(x => 
            Seq(
                col(x).as(x + "_month-sum"),
                sum(x).over(Window.orderBy("date")).as(x + "_cur-cumulative")
            )
        ): _*
    )
    .orderBy("date")

df3.show
+-------+-----------+----------------+-----------+----------------+
|   date|A_month-sum|A_cur-cumulative|B_month-sum|B_cur-cumulative|
+-------+-----------+----------------+-----------+----------------+
|2020-01|       60.0|            60.0|        0.0|             0.0|
|2020-03|        0.0|            60.0|       80.0|            80.0|
|2020-05|       41.0|           101.0|        0.0|            80.0|
|2020-08|        0.0|           101.0|       50.0|           130.0|
|2020-11|       76.0|           177.0|        0.0|           130.0|
|2020-12|        0.0|           177.0|       76.0|           206.0|
+-------+-----------+----------------+-----------+----------------+

【讨论】：