【问题标题】:Can window function LAG reference the column which value is being calculated?窗口函数 LAG 可以引用正在计算哪个值的列吗?
【发布时间】:2015-12-17 16:01:45
【问题描述】:

我需要根据当前记录的其他一些列和前一条记录的 X 值来计算某些列 X 的值(使用一些分区和顺序)。基本上我需要在表单中实现查询

SELECT <some fields>, 
  <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>

这是不可能的,因为只有现有的列可以在窗口函数中使用,所以我正在寻找解决这个问题的方法。

这是一个例子。我有一张有活动的桌子。每个事件都有typetime_stamp

create table event (id serial, type integer, time_stamp integer);

我不想找到“重复”事件(跳过它们)。重复是指以下内容。让我们按time_stamp 升序排列给定type 的所有事件。那么

  1. 第一个事件不是重复的
  2. 所有不重复且在其后某个时间范围内的事件(即它们的time_stamp 不大于前一个不重复的time_stamp 加上一些常量TIMEFRAME)都是重复的
  3. 下一个time_stamp 比上一个不重复的事件大于TIMEFRAME 的下一个事件不重复
  4. 等等

对于这个数据

insert into event (type, time_stamp) 
 values 
  (1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10), 
  (1,15), (1, 21), (2,13), 
  (1, 40);

TIMEFRAME=10 结果应该是

time_stamp | type | duplicate
-----------------------------
        1  |    1 | false
        2  |    1 | true     
        3  |    1 | true 
       10  |    1 | true 
       15  |    1 | false 
       21  |    1 | true
       40  |    1 | false
        2  |    2 | false
       10  |    2 | true
       13  |    2 | false

我可以根据上一个非重复事件的当前time_stamptime_stamp 计算duplicate 字段的值,如下所示:

WITH evt AS (
  SELECT 
    time_stamp, 
    CASE WHEN 
      time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
    THEN 
      time_stamp
    ELSE
      LAG(current_non_dupl_time_stamp) OVER w
    END AS current_non_dupl_time_stamp
  FROM event
  WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate

但这不起作用,因为LAG中无法引用计算的字段:

ERROR:  column "current_non_dupl_time_stamp" does not exist.

那么问题来了:我可以重写这个查询来达到我需要的效果吗?

【问题讨论】:

  • 我无法理解时间框架部分。特别是这部分:the next event which time_stamp if greater than previous non duplicate by more than TIMEFRAME is not duplicate。时间框架是常数、字段还是计算?
  • TIMEFRAME 是一些常数。基本原理是,如果它在未跳过的前一个事件之后的给定时间范围内发生,我想跳过它。
  • 您想要的输出包含时间戳 40,但您的示例数据集没有?你能澄清一下吗?
  • 你是对的,这是一个错误。

标签: postgresql gaps-and-islands


【解决方案1】:

朴素的递归链编织器:


        -- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
        SELECT e.type,e.time_stamp
        , ROW_NUMBER() OVER www as rn                   -- number the records
        , FIRST_VALUE(e.time_stamp) OVER www as fst     -- the "group leader"
        , EXISTS (SELECT * FROM event x
                WHERE x.type = e.type
                AND x.time_stamp < e.time_stamp) AS is_dup
        FROM event e
        WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
        ;

WITH RECURSIVE ttt AS (
        SELECT d0.*
        FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
    UNION ALL
        SELECT d1.type, d1.time_stamp, d1.rn
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
                 ELSE ttt.fst END AS fst   -- new "group leader"
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
                 ELSE True END AS is_dup
        FROM drag d1
        JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
        )
SELECT * FROM ttt
ORDER BY type, time_stamp
        ;

结果:


CREATE TABLE
INSERT 0 10
CREATE VIEW
 type | time_stamp | rn | fst | is_dup 
------+------------+----+-----+--------
    1 |          1 |  1 |   1 | f
    1 |          2 |  2 |   1 | t
    1 |          3 |  3 |   1 | t
    1 |         10 |  4 |   1 | t
    1 |         15 |  5 |   1 | t
    1 |         21 |  6 |   1 | t
    1 |         40 |  7 |  40 | f
    2 |          2 |  1 |   2 | f
    2 |         10 |  2 |   2 | t
    2 |         13 |  3 |   2 | t
(10 rows)

【讨论】:

    【解决方案2】:

    递归方法的替代方法是自定义聚合。一旦掌握了编写自己的聚合的技术,创建转换函数和最终函数就变得容易且合乎逻辑。

    状态转换函数:

    create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
    returns int[] language plpgsql as $$
    begin
        if st is null or st[1] + timeframe <= time_stamp
        then 
            st[1] := time_stamp;
        end if;
        st[2] := time_stamp;
        return st;
    end $$;
    

    最终功能:

    create or replace function is_duplicate_final(st int[])
    returns boolean language sql as $$
        select st[1] <> st[2];
    $$;
    

    聚合:

    create aggregate is_duplicate_agg(time_stamp int, timeframe int)
    (
        sfunc = is_duplicate,
        stype = int[],
        finalfunc = is_duplicate_final
    );
    

    查询:

    select *, is_duplicate_agg(time_stamp, 10) over w
    from event
    window w as (partition by type order by time_stamp asc)
    order by type, time_stamp;
    
     id | type | time_stamp | is_duplicate_agg 
    ----+------+------------+------------------
      1 |    1 |          1 | f
      2 |    1 |          2 | t
      4 |    1 |          3 | t
      5 |    1 |         10 | t
      7 |    1 |         15 | f
      8 |    1 |         21 | t
     10 |    1 |         40 | f
      3 |    2 |          2 | f
      6 |    2 |         10 | t
      9 |    2 |         13 | f
    (10 rows)   
    

    阅读文档:37.10. User-defined AggregatesCREATE AGGREGATE.

    【讨论】:

      【解决方案3】:

      这感觉更像是一个递归问题,而不是窗口函数。以下查询获得了预期的结果:

      WITH RECURSIVE base(type, time_stamp) AS (
      
        -- 3. base of recursive query
        SELECT x.type, x.time_stamp, y.next_time_stamp
          FROM 
               -- 1. start with the initial records of each type   
               ( SELECT type, min(time_stamp) AS time_stamp
                   FROM event
                   GROUP BY type
               ) x
               LEFT JOIN LATERAL
               -- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
               ( SELECT MIN(time_stamp) next_time_stamp
                   FROM event
                   WHERE type = x.type
                     AND time_stamp > (x.time_stamp + 10)
               ) y ON true
      
        UNION ALL
      
        -- 4. recursive join, same logic as base
        SELECT e.type, e.time_stamp, z.next_time_stamp
          FROM event e
          JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
          LEFT JOIN LATERAL
          ( SELECT MIN(time_stamp) next_time_stamp
             FROM event
             WHERE type = e.type
               AND time_stamp > (e.time_stamp + 10)
          ) z ON true
      
      )
      
      -- The actual query:
      
      -- 5a. All records from base are not duplicates
      SELECT time_stamp, type, false
        FROM base
      
      UNION
      
      -- 5b. All records from event that are not in base are duplicates
      SELECT time_stamp, type, true
        FROM event
        WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base) 
      
      ORDER BY type, time_stamp
      

      对此有很多警告。它假定给定的type 没有重复的time_stamp。实际上,连接应该基于唯一的 id 而不是 typetime_stamp。我没有测试这么多,但它至少可以建议一种方法。

      这是我第一次尝试LATERAL 加入。所以可能有一种方法可以简化那个moe。我真正想做的是递归 CTE,其递归部分使用基于 time_stamp &gt; (x.time_stamp + 10)MIN(time_stamp),但 CTE 中不允许以这种方式聚合函数。不过好像横向连接可以用在CTE中。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-08-01
        • 1970-01-01
        • 2020-11-13
        • 1970-01-01
        • 2022-01-18
        • 2022-12-07
        • 2021-03-01
        • 1970-01-01
        相关资源
        最近更新 更多