【问题标题】:Coalesce overlapping time ranges in PostgreSQL在 PostgreSQL 中合并重叠的时间范围
【发布时间】:2016-07-14 21:05:50
【问题描述】:

我有一个 PostgreSQL (9.4) 表,其中包含时间戳范围和用户 ID,我需要将所有重叠范围(具有相同用户 ID)折叠成一条记录。

我尝试了一组复杂的 CTE 来完成此操作,但在我们(40,000 多行)真实表中存在一些使事情复杂化的边缘情况。我得出的结论是,我可能需要一个递归 CTE,但我没有任何运气来编写它。

这里有一些代码可以创建一个测试表并用数据填充它。这不是我们表格的确切布局,但对于示例来说已经足够接近了。

CREATE TABLE public.test
(
  id serial,
  sessionrange tstzrange,
  fk_user_id integer
);

insert into test (sessionrange, fk_user_id)
values 
('[2016-01-14 11:57:01-05,2016-01-14 12:06:59-05]', 1)
,('[2016-01-14 12:06:53-05,2016-01-14 12:17:28-05]', 1)
,('[2016-01-14 12:17:24-05,2016-01-14 12:21:56-05]', 1)
,('[2016-01-14 18:18:00-05,2016-01-14 18:42:09-05]', 2)
,('[2016-01-14 18:18:08-05,2016-01-14 18:18:15-05]', 1)
,('[2016-01-14 18:38:12-05,2016-01-14 18:48:20-05]', 1)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 1)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 1)
,('[2016-01-14 18:18:12-05,2016-01-14 18:18:20-05]', 3)
,('[2016-01-14 19:32:12-05,2016-01-14 23:18:20-05]', 3)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 4)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 2);

我发现我可以这样做以按开始时间对会话进行排序:

select * from test order by fk_user_id, sessionrange

我可以使用它来确定单个记录是否与以前的记录重叠,使用窗口函数:

SELECT *, sessionrange && lag(sessionrange) OVER (PARTITION BY fk_user_id ORDER BY sessionrange)
FROM test
ORDER BY fk_user_id, sessionrange

但这仅检测单个先前记录是否与当前记录重叠(请参阅id = 6 所在的记录)。我需要一直检测到分区的开头。

之后,我需要将所有重叠的记录分组在一起,以找到最早会话的开始和要终止的最后一个会话的结束。

我确信有一种我忽略的方法可以做到这一点。如何折叠这些重叠的记录?

【问题讨论】:

    标签: postgresql range recursive-query date-range recursive-cte


    【解决方案1】:

    将重叠范围合并为数组元素相对容易。为简单起见,以下函数返回set of tstzrange

    create or replace function merge_ranges(tstzrange[])
    returns setof tstzrange language plpgsql as $$
    declare
        t tstzrange;
        r tstzrange;
    begin
        foreach t in array $1 loop
            if r && t then r:= r + t;
            else
                if r notnull then return next r;
                end if;
                r:= t;
            end if;
        end loop;
        if r notnull then return next r;
        end if;
    end $$;
    

    只需为用户聚合范围并使用函数:

    select fk_user_id, merge_ranges(array_agg(sessionrange))
    from test 
    group by 1
    order by 1, 2
    
     fk_user_id |                    merge_ranges                     
    ------------+-----------------------------------------------------
              1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"]
              1 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"]
              1 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"]
              1 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"]
              2 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"]
              3 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"]
              3 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"]
              4 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"]
    (8 rows)    
    

    或者,该算法可以在一个函数循环中应用于整个表。我不确定,但对于大型数据集,这种方法应该更快。

    create or replace function merge_ranges_in_test()
    returns setof test language plpgsql as $$
    declare
        curr test;
        prev test;
    begin
        for curr in
            select * 
            from test
            order by fk_user_id, sessionrange
        loop
            if prev notnull and prev.fk_user_id <> curr.fk_user_id then
                return next prev;
                prev:= null;
            end if;
            if prev.sessionrange && curr.sessionrange then 
                prev.sessionrange:= prev.sessionrange + curr.sessionrange;
            else
                if prev notnull then 
                    return next prev;
                end if;
                prev:= curr;
            end if;
        end loop;
        return next prev;
    end $$;
    

    结果:

    select *
    from merge_ranges_in_test();
    
     id |                    sessionrange                     | fk_user_id 
    ----+-----------------------------------------------------+------------
      1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] |          1
      5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] |          1
      7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] |          1
      6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] |          1
      4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] |          2
      9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] |          3
     10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] |          3
     11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] |          4
    (8 rows)
    

    这个问题很有趣。我试图找到一个递归解决方案,但似乎程序尝试是最自然和最有效的。


    我终于找到了递归解决方案。该查询删除 重叠 行并插入它们的压缩等效项:

    with recursive cte (user_id, ids, range) as (
        select t1.fk_user_id, array[t1.id, t2.id], t1.sessionrange + t2.sessionrange
        from test t1
        join test t2
            on t1.fk_user_id = t2.fk_user_id 
            and t1.id < t2.id
            and t1.sessionrange && t2.sessionrange
    union all
        select user_id, ids || t.id, range + sessionrange
        from cte
        join test t
            on user_id = t.fk_user_id 
            and ids[cardinality(ids)] < t.id
            and range && t.sessionrange
        ),
    list as (
        select distinct on(id) id, range, user_id
        from cte, unnest(ids) id
        order by id, upper(range)- lower(range) desc
        ),
    deleted as (
        delete from test
        where id in (select id from list)
        )
    insert into test
    select distinct on (range) id, range, user_id
    from list
    order by range, id;
    

    结果:

    select *
    from test
    order by 3, 2;
    
     id |                    sessionrange                     | fk_user_id 
    ----+-----------------------------------------------------+------------
      1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] |          1
      5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] |          1
      7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] |          1
      6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] |          1
      4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] |          2
      9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] |          3
     10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] |          3
     11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] |          4
    (8 rows)
    

    【讨论】:

    • 我最终选择了第一个解决方案,因为它根本不需要适应我的真实模式。它真的很容易使用并且看起来是正确的。我需要做一些额外的测试才能确定,但​​我想我今天晚些时候会回来接受你的回答。谢谢!
    • 设法进行了一些测试,看起来这确实以我想要的方式结合了所有内容。谢谢!
    • 你的问题对我来说是一个挑战。受不了没有功能我不能这样做;)
    猜你喜欢
    • 2018-12-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-04-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多