【问题标题】:Partition functions in spark scalaspark scala中的分区函数
【发布时间】:2018-07-05 00:54:08
【问题描述】:

DF:

ID col1 . .....coln....  Date
1                        1991-01-11 11:03:46.0
1                        1991-01-11 11:03:46.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0

我正在创建一个新列“identify”来查找 (ID, DATE) 的分区,并通过“identify”排序选择最顶部的组合

预期 DF:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     2 
1                        1991-02-22 12:05:58.0 .     2

代码尝试 1:

var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的操作:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     3
1                        1991-02-22 12:05:58.0 .     4
1                        1991-02-22 12:05:58.0 .     5

代码尝试 2:

 var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的操作:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     3

任何关于如何调整代码以获得所需 OP 的建议都会有所帮助

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:
    var window = Window.partitionBy("ID").orderBy("DATE")
    df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))
    

    【讨论】:

      猜你喜欢
      • 2017-05-17
      • 2020-01-26
      • 2019-06-16
      • 2016-12-02
      • 1970-01-01
      • 2015-05-17
      • 2017-01-22
      • 2020-09-18
      • 1970-01-01
      相关资源
      最近更新 更多