【问题标题】:Differnce from first value in pivot Spark Sql与枢轴 Spark Sql 中的第一个值的差异
【发布时间】:2020-04-14 14:44:14
【问题描述】:

我有以下数据:

val df = Seq(("Central" , "Copy Paper" , "Benjamin Ross" , "$15.58" , "$3.91" , "126"),
     |       ("East" , "Copy Paper" , "Catherine Rose" , "$12.21" , "$0.08"  ,"412"),
     |       ("West" ,"Copy Paper" , "Patrick O'Brill" , "$2,756.66" , "$1,629.98" ,"490"),
     |       ("Central" , "Business Envelopes" , "John Britto" , "$212.74" , "$109.66" , "745"),
     |       ("East" , "Business Envelopes" , "xyz" , "$621" , "$721" ,"812")).toDF("Region" , "Product" , "Customer" , "Sales",  "Cost" , "Autonumber")

df.show()

+-------+------------------+---------------+---------+---------+----------+
| Region|           Product|       Customer|    Sales|     Cost|Autonumber|
+-------+------------------+---------------+---------+---------+----------+
|Central|        Copy Paper|  Benjamin Ross|   $15.58|    $3.91|       126|
|   East|        Copy Paper| Catherine Rose|   $12.21|    $0.08|       412|
|   West|        Copy Paper|Patrick O'Brill|$2,756.66|$1,629.98|       490|
|Central|Business Envelopes|    John Britto|  $212.74|  $109.66|       745|
|   East|Business Envelopes|            xyz|     $621|     $721|       812|
+-------+------------------+---------------+---------+---------+----------+

您可以看到对于 Business Envelopes 产品,没有关于 West 的数据。如果有 West 的数据,则结果不会为空。由于没有与区域旋转的数据导致空值,我希望它为 0 ,因此可以从 first(sum(Autonumber)) 中减去它并获得一个值。相反,现在它返回一个空值。如果我能以某种方式通过 query 获取组中 Central 的数据,事情会简单得多。

我尝试了以下查询:

spark.sql("SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) - first(sum(Autonumber)) over ( partition by product order by product , region) as new  from test1 group by r , p order by p,r) test1 pivot (sum(new) for r in ('Central' Central , 'East' East, 'West' West))").show

这是我得到的数据

+------------------+-------+-----+-----+
|                 p|Central| East| West|
+------------------+-------+-----+-----+
|Business Envelopes|    0.0| 67.0| null|
|        Copy Paper|    0.0|286.0|364.0|
+------------------+-------+-----+-----+

我希望它是这样的......

+------------------+-------+-----+------+
|                 p|Central| East|  West|
+------------------+-------+-----+------+
|Business Envelopes|       | 67.0|-745.0|
|        Copy Paper|       |286.0| 364.0|
+------------------+-------+-----+------+

这只不过是使用 sum(autonumber) 以区域为中心,然后从第一个值中减去。

关于如何获得 -745 而不是 null 的任何建议?

【问题讨论】:

    标签: apache-spark apache-spark-sql pivot pivot-table


    【解决方案1】:

    我想这种方式是不可能的。 相反,我旋转了数据集,然后从第一个值中减去。

    spark.sql("select p , coalesce(Central , 0) - null as Central , coalesce(East,0) - coalesce(central,0) as East , coalesce(West , 0) - coalesce(central,0) as West from (SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) as new  from test group by r , p order by p) test pivot (sum(new) for r in ('Central' Central ,'East' East, 'West' West)))").show
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-09-23
      • 1970-01-01
      • 2019-07-24
      • 1970-01-01
      • 2017-08-14
      • 1970-01-01
      相关资源
      最近更新 更多