【发布时间】:2020-04-14 14:44:14
【问题描述】:
我有以下数据:
val df = Seq(("Central" , "Copy Paper" , "Benjamin Ross" , "$15.58" , "$3.91" , "126"),
| ("East" , "Copy Paper" , "Catherine Rose" , "$12.21" , "$0.08" ,"412"),
| ("West" ,"Copy Paper" , "Patrick O'Brill" , "$2,756.66" , "$1,629.98" ,"490"),
| ("Central" , "Business Envelopes" , "John Britto" , "$212.74" , "$109.66" , "745"),
| ("East" , "Business Envelopes" , "xyz" , "$621" , "$721" ,"812")).toDF("Region" , "Product" , "Customer" , "Sales", "Cost" , "Autonumber")
df.show()
+-------+------------------+---------------+---------+---------+----------+
| Region| Product| Customer| Sales| Cost|Autonumber|
+-------+------------------+---------------+---------+---------+----------+
|Central| Copy Paper| Benjamin Ross| $15.58| $3.91| 126|
| East| Copy Paper| Catherine Rose| $12.21| $0.08| 412|
| West| Copy Paper|Patrick O'Brill|$2,756.66|$1,629.98| 490|
|Central|Business Envelopes| John Britto| $212.74| $109.66| 745|
| East|Business Envelopes| xyz| $621| $721| 812|
+-------+------------------+---------------+---------+---------+----------+
您可以看到对于 Business Envelopes 产品,没有关于 West 的数据。如果有 West 的数据,则结果不会为空。由于没有与区域旋转的数据导致空值,我希望它为 0 ,因此可以从 first(sum(Autonumber)) 中减去它并获得一个值。相反,现在它返回一个空值。如果我能以某种方式通过 query 获取组中 Central 的数据,事情会简单得多。
我尝试了以下查询:
spark.sql("SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) - first(sum(Autonumber)) over ( partition by product order by product , region) as new from test1 group by r , p order by p,r) test1 pivot (sum(new) for r in ('Central' Central , 'East' East, 'West' West))").show
这是我得到的数据
+------------------+-------+-----+-----+
| p|Central| East| West|
+------------------+-------+-----+-----+
|Business Envelopes| 0.0| 67.0| null|
| Copy Paper| 0.0|286.0|364.0|
+------------------+-------+-----+-----+
我希望它是这样的......
+------------------+-------+-----+------+
| p|Central| East| West|
+------------------+-------+-----+------+
|Business Envelopes| | 67.0|-745.0|
| Copy Paper| |286.0| 364.0|
+------------------+-------+-----+------+
这只不过是使用 sum(autonumber) 以区域为中心,然后从第一个值中减去。
关于如何获得 -745 而不是 null 的任何建议?
【问题讨论】:
标签: apache-spark apache-spark-sql pivot pivot-table