【问题标题】:Group contiguous blocks for aggregation in SQL (Redshift)在 SQL (Redshift) 中对连续块进行分组以进行聚合
【发布时间】:2023-04-02 23:38:01
【问题描述】:

我有一张这样的桌子:

    id time activity
 1:  1    1        a
 2:  1    2        a
 3:  1    3        b
 4:  1    4        b
 5:  1    5        a
 6:  2    1        a
 7:  2    2        b
 8:  2    3        b
 9:  2    4        b
10:  2    5        a
11:  2    6        a
12:  2    7        c
13:  2    8        c
14:  2    9        c

在每个id 中,我想按activity 的连续块进行聚合。所以基本上我想要一个像这样的grouping 列:

    id time activity grouping
 1:  1    1        a        1
 2:  1    2        a        1
 3:  1    3        b        2
 4:  1    4        b        2
 5:  1    5        a        3
 6:  2    1        a        1
 7:  2    2        b        2
 8:  2    3        b        2
 9:  2    4        b        2
10:  2    5        a        3
11:  2    6        a        3
12:  2    7        c        4
13:  2    8        c        4
14:  2    9        c        4

这样我就可以使用聚合函数并得到这样的东西:

select id
, min(time) as min_time
, max(time) as max_time
, count(*) as n_activity
from A
group by id, grouping

   id min_time max_time n_activity
1:  1        1        2          2
2:  1        3        4          2
3:  1        5        5          1
4:  2        1        1          1
5:  2        2        4          3
6:  2        5        6          2
7:  2        7        9          3

如何创建分组列?我的表很大,所以我希望尽可能避免使用游标函数。


一些样本数据:

create table A (id int, time int, activity varchar);
insert into A (id, time, activity)
values
(1,1,'a'),(1,2,'a'),(1,3,'b'),(1,4,'b'),(1,5,'a'),(2,1,'a'),
(2,2,'b'),(2,3,'b'),(2,4,'b'),(2,5,'a'),(2,6,'a'),(2,7,'c'),
(2,8,'c'),(2,9,'c')

【问题讨论】:

    标签: sql amazon-redshift


    【解决方案1】:

    使用lag 检查前一行是否与当前行具有相同的活动,如果它没有通过运行总和重置。

    select t.*,sum(case when prev_activity=activity then 0 else 1 end) over(partition by id order by time) as grp
    from (
    select t.*,lag(activity) over(partition by id order by time) as prev_activity
    from tbl t
    ) t 
    

    【讨论】:

    • 效果很好!谢谢!我一直在使用lag 拉每个块的第一行,这满足了我一半的需求,但我无法通过运行总和进行跳跃。
    • 这适用于 postgre,但我似乎无法将其翻译为 Redshift。 Redshift 需要一个frame 语句用于窗口总和,我找不到合适的解决方案。 ROWS UNBOUNDED PRECEDING 将一个组的最后一行与下一个组放在一起,ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING 将每个组的第一行留空。有什么想法吗?
    • ROWS UNBOUNDED PRECEDING 在我看来应该可以正常工作。
    • 啊,是的。我的问题是数据质量之一(与不同活动相关的时间)。谢谢!
    【解决方案2】:

    应该能够只使用time 值和来自ROW_NUMBER() 的数字的二级序列吗?

    SELECT
      *,
      time - ROW_NUMBER() OVER (PARTITION BY id, activity
                                    ORDER BY time        )   AS rownum
    FROM
      yourTable
    

    (id,activity,rownum) 字段为您的组提供复合键。

    然后,如果您确实需要将 DENSE_RANK() OVER (PARTITION BY id ORDER BY rownum, activity DESC) 包装为单个字段标识符,则可以将其包裹起来。

        id time activity   rownum  (time-rownum) (composite key) (dense_rank)
    
     1:  1    1        a    1                  0         (1,a,0)       1
     2:  1    2        a    2                  0         (1,a,0)       1
     3:  1    3        b      1                2         (1,b,2)       2
     4:  1    4        b      2                2         (1,b,2)       2
     5:  1    5        a    3                  2         (1,a,2)       3
    
     6:  2    1        a    1                  0         (2,a,0)       1
     7:  2    2        b      1                1         (2,b,1)       2
     8:  2    3        b      2                1         (2,b,1)       2
     9:  2    4        b      3                1         (2,b,1)       2
    10:  2    5        a    2                  3         (2,a,3)       3
    11:  2    6        a    3                  3         (2,a,3)       3
    12:  2    7        c        1              6         (2,c,6)       4
    13:  2    8        c        2              6         (2,c,6)       4
    14:  2    9        c        3              6         (2,c,6)       4
    

    将复合键应用于聚合示例...

    SELECT
        id
      , min(time) as min_time
      , max(time) as max_time
      , count(*) as n_activity
    FROM
    (
      SELECT
        *,
        time - ROW_NUMBER() OVER (PARTITION BY id, activity
                                      ORDER BY time        )   AS rownum
      FROM
        yourTable
    )
      partitioned
    GROUP BY
      id, activity, rownum
    

    如果时间是排序的,但并不总是连续的,那就变成了……

    SELECT
        id
      , min(time) as min_time
      , max(time) as max_time
      , count(*) as n_activity
    FROM
    (
      SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id
                               ORDER BY time        )
        -
        ROW_NUMBER() OVER (PARTITION BY id, activity
                               ORDER BY time        )   AS rownum
      FROM
        yourTable
    )
      partitioned
    GROUP BY
      id, activity, rownum
    

    【讨论】:

    • 这种使用两个序列之间的差异被称为“间隙和孤岛”问题的解决方案,并且在很多问题空间中都非常方便!
    猜你喜欢
    • 1970-01-01
    • 2020-11-09
    • 2012-09-11
    • 2020-05-03
    • 2020-08-23
    • 1970-01-01
    • 1970-01-01
    • 2021-04-18
    • 1970-01-01
    相关资源
    最近更新 更多