【发布时间】:2018-11-06 17:16:30
【问题描述】:
我需要检查一个 Condition 是否 ReasonCode 为 "YES" ,然后使用 ProcessDate 作为 PARTITION 列之一,否则不要。
等效的 SQL 查询如下:
SELECT PNum, SUM(SIAmt) OVER (PARTITION BY PNum,
ReasonCode ,
CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END
ORDER BY ProcessDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SumAmt
from TABLE1
到目前为止,我已经尝试了以下查询,但无法合并条件
Spark Dataframes 中的“CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END”
val df = inputDF.select("PNum")
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy("PNum","ReasonCode").orderBy("ProcessDate")))
输入数据:
---------------------------------------
Pnum ReasonCode ProcessDate SIAmt
---------------------------------------
1 No 1/01/2016 200
1 No 2/01/2016 300
1 Yes 3/01/2016 -200
1 Yes 4/01/2016 200
---------------------------------------
预期输出:
---------------------------------------------
Pnum ReasonCode ProcessDate SIAmt SumAmt
---------------------------------------------
1 No 1/01/2016 200 200
1 No 2/01/2016 300 500
1 Yes 3/01/2016 -200 -200
1 Yes 4/01/2016 200 200
---------------------------------------------
关于 Spark 数据框而不是 spark-sql 查询的任何建议/帮助?
【问题讨论】:
标签: scala apache-spark apache-spark-sql