与枢轴 Spark Sql 中的第一个值的差异答案

【问题标题】：Differnce from first value in pivot Spark Sql与枢轴 Spark Sql 中的第一个值的差异
【发布时间】：2020-04-14 14:44:14
【问题描述】：

我有以下数据：

val df = Seq(("Central" , "Copy Paper" , "Benjamin Ross" , "$15.58" , "$3.91" , "126"),
     |       ("East" , "Copy Paper" , "Catherine Rose" , "$12.21" , "$0.08"  ,"412"),
     |       ("West" ,"Copy Paper" , "Patrick O'Brill" , "$2,756.66" , "$1,629.98" ,"490"),
     |       ("Central" , "Business Envelopes" , "John Britto" , "$212.74" , "$109.66" , "745"),
     |       ("East" , "Business Envelopes" , "xyz" , "$621" , "$721" ,"812")).toDF("Region" , "Product" , "Customer" , "Sales",  "Cost" , "Autonumber")

df.show()

+-------+------------------+---------------+---------+---------+----------+
| Region|           Product|       Customer|    Sales|     Cost|Autonumber|
+-------+------------------+---------------+---------+---------+----------+
|Central|        Copy Paper|  Benjamin Ross|   $15.58|    $3.91|       126|
|   East|        Copy Paper| Catherine Rose|   $12.21|    $0.08|       412|
|   West|        Copy Paper|Patrick O'Brill|$2,756.66|$1,629.98|       490|
|Central|Business Envelopes|    John Britto|  $212.74|  $109.66|       745|
|   East|Business Envelopes|            xyz|     $621|     $721|       812|
+-------+------------------+---------------+---------+---------+----------+

您可以看到对于 Business Envelopes 产品，没有关于 West 的数据。如果有 West 的数据，则结果不会为空。由于没有与区域旋转的数据导致空值，我希望它为 0 ，因此可以从 first(sum(Autonumber)) 中减去它并获得一个值。相反，现在它返回一个空值。如果我能以某种方式通过 query 获取组中 Central 的数据，事情会简单得多。

我尝试了以下查询：

spark.sql("SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) - first(sum(Autonumber)) over ( partition by product order by product , region) as new  from test1 group by r , p order by p,r) test1 pivot (sum(new) for r in ('Central' Central , 'East' East, 'West' West))").show

这是我得到的数据

+------------------+-------+-----+-----+
|                 p|Central| East| West|
+------------------+-------+-----+-----+
|Business Envelopes|    0.0| 67.0| null|
|        Copy Paper|    0.0|286.0|364.0|
+------------------+-------+-----+-----+

我希望它是这样的......

+------------------+-------+-----+------+
|                 p|Central| East|  West|
+------------------+-------+-----+------+
|Business Envelopes|       | 67.0|-745.0|
|        Copy Paper|       |286.0| 364.0|
+------------------+-------+-----+------+

这只不过是使用 sum(autonumber) 以区域为中心，然后从第一个值中减去。

关于如何获得 -745 而不是 null 的任何建议？

【问题讨论】：

标签： apache-spark apache-spark-sql pivot pivot-table

【解决方案1】：

我想这种方式是不可能的。相反，我旋转了数据集，然后从第一个值中减去。

spark.sql("select p , coalesce(Central , 0) - null as Central , coalesce(East,0) - coalesce(central,0) as East , coalesce(West , 0) - coalesce(central,0) as West from (SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) as new  from test group by r , p order by p) test pivot (sum(new) for r in ('Central' Central ,'East' East, 'West' West)))").show

【讨论】：