【发布时间】:2019-04-03 17:27:53
【问题描述】:
我有一个如下所示的数据框,并希望通过组合相邻的 rowa 来减少它们,即 previous.close = current.open
val df = Seq(
("Ray","2018-09-01","2018-09-10"),
("Ray","2018-09-10","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-27"),
("Ray","2018-09-27","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-28"),
("Scott","2018-09-28","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-15"),
("Scott","2018-10-15","2018-09-20")
)
所需的输出如下:
(("Ray","2018-09-01","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-20"))
到目前为止,我可以使用下面的 DF() 解决方案来压缩相邻的行。
df.alias("t1").join(df.alias("t2"),$"t1.name" === $"t2.name" and $"t1.close"=== $"t2.open" )
.select("t1.name","t1.open","t2.close")
.distinct.show(false)
|name |open |close |
+-----+----------+----------+
|Scott|2018-09-24|2018-09-30|
|Scott|2018-10-11|2018-09-20|
|Ray |2018-09-01|2018-09-15|
|Ray |2018-09-21|2018-09-30|
+-----+----------+----------+
我正在尝试使用类似的样式来获得单行,方法是给出 $"t1.close"=!= $"t2.open" 然后合并两者以获得最终结果。但是我得到了不需要的行,我无法正确过滤。如何做到这一点?
这篇文章与Spark SQL window function with complex condition 不同,后者将额外的日期列计算为新列。
【问题讨论】:
-
不完全是.. 如果您看到,日期在每个名称键中按排序顺序排列。当先前的收盘价等于当前的开盘价时,只需将它们合并即可。否则,独立行应包含在输出中。
标签: scala apache-spark apache-spark-sql