【问题标题】:Aggregating a list of dates to start and end date聚合日期列表以开始和结束日期
【发布时间】:2010-06-08 05:23:40
【问题描述】:

我有一个日期和 ID 列表,我想在每个 ID 中将它们汇总成连续的日期段。

对于名为“data”的表中包含“testid”和“pulldate”列的表:

| A79 | 2010-06-02 |
| A79 | 2010-06-03 |
| A79 | 2010-06-04 |
| B72 | 2010-04-22 |
| B72 | 2010-06-03 |
| B72 | 2010-06-04 |
| C94 | 2010-04-09 |
| C94 | 2010-04-10 |
| C94 | 2010-04-11 |
| C94 | 2010-04-12 |
| C94 | 2010-04-13 |
| C94 | 2010-04-14 |
| C94 | 2010-06-02 |
| C94 | 2010-06-03 |
| C94 | 2010-06-04 |

我想生成一个包含“testid”、“group”、“start_date”、“end_date”列的表:

| A79 | 1 | 2010-06-02 | 2010-06-04 |
| B72 | 2 | 2010-04-22 | 2010-04-22 |
| B72 | 3 | 2010-06-03 | 2010-06-04 |
| C94 | 4 | 2010-04-09 | 2010-04-14 |
| C94 | 5 | 2010-06-02 | 2010-06-04 |

这是我想出的代码:

SELECT t2.testid,
  t2.group,
  MIN(t2.pulldate) AS start_date,
  MAX(t2.pulldate) AS end_date
FROM(SELECT t1.pulldate,
  t1.testid,
  SUM(t1.check) OVER (ORDER BY t1.testid,t1.pulldate) AS group
FROM(SELECT data.pulldate,
  data.testid,
  CASE
  WHEN data.testid=LAG(data.testid,1) 
    OVER (ORDER BY data.testid,data.pulldate)
  AND data.pulldate=date (LAG(data.pulldate,1) 
    OVER (PARTITION BY data.testid 
    ORDER BY data.pulldate)) + integer '1'
  THEN 0
  ELSE 1
  END AS check
FROM data 
ORDER BY data.testid, data.pulldate) AS t1) AS t2
GROUP BY t2.testid,t2.group
ORDER BY t2.group;

我使用 LAG 窗口函数将每一行与前一行进行比较,如果需要递增以开始一个新组,则输入 1,然后对该列进行运行求和,然后聚合到“组”的组合" 和 "testid"。

有没有更好的方法来实现我的目标,或者这个操作有名字吗?

我使用的是 PostgreSQL 8.4

【问题讨论】:

    标签: postgresql


    【解决方案1】:

    这是另一种方法:

    WITH TEMP_TAB AS (
    SELECT testid, pulldate,
           (pulldate + (row_number || ' days')::interval)::date AS dummydate
     FROM ( SELECT *, row_number() OVER () FROM
        ( SELECT * FROM data ORDER BY testid,pulldate DESC
        ) AS tab1 
     ) AS tab2 
    )
    SELECT * FROM (
      SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
        FROM TEMP_TAB GROUP BY testid,dummydate 
      )  AS tab3 
    ORDER BY testid, mindate
    

    警告:如果有重复的(testid, pulldate) 对,则此策略将失效。在这种情况下,应该首先对这些字段执行 DISTINCT。

    解释:中间表有一个dummydate,通过添加等于“行号”的天数获得(在有序选择中);它的唯一含义是具有相同dummydate 的行位于同一组连续日期中。例如:中间结果:

    test=#  SELECT *, row_number() OVER  () FROM
    test-#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1;
     testid |  pulldate  | row_number
    --------+------------+------------
     A79    | 2010-06-04 |          1
     A79    | 2010-06-03 |          2
     A79    | 2010-06-02 |          3
     B72    | 2010-06-04 |          4
     B72    | 2010-06-03 |          5
     B72    | 2010-04-22 |          6
     C94    | 2010-06-04 |          7
     C94    | 2010-06-03 |          8
     C94    | 2010-06-02 |          9
     C94    | 2010-04-14 |         10
     C94    | 2010-04-13 |         11
     C94    | 2010-04-12 |         12
     C94    | 2010-04-11 |         13
     C94    | 2010-04-10 |         14
     C94    | 2010-04-09 |         15
    
    
    
    test=# SELECT
    test-#  testid,pulldate,(pulldate + (row_number || 'days')::interval)::date AS dummydate
    test-#  FROM ( SELECT *, row_number() OVER  () FROM
    test(#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
    test-#  AS tab2;
     testid |  pulldate  | dummydate
    --------+------------+------------
     A79    | 2010-06-04 | 2010-06-05
     A79    | 2010-06-03 | 2010-06-05
     A79    | 2010-06-02 | 2010-06-05
     B72    | 2010-06-04 | 2010-06-08
     B72    | 2010-06-03 | 2010-06-08
     B72    | 2010-04-22 | 2010-04-28
     C94    | 2010-06-04 | 2010-06-11
     C94    | 2010-06-03 | 2010-06-11
     C94    | 2010-06-02 | 2010-06-11
     C94    | 2010-04-14 | 2010-04-24
     C94    | 2010-04-13 | 2010-04-24
     C94    | 2010-04-12 | 2010-04-24
     C94    | 2010-04-11 | 2010-04-24
     C94    | 2010-04-10 | 2010-04-24
     C94    | 2010-04-09 | 2010-04-24
    

    编辑:这里不需要 WITH(但我还是喜欢它),这是一样的:

    SELECT * FROM (
      SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
      FROM (
        SELECT
          testid,pulldate,
          (pulldate + (row_number || ' days')::interval)::date AS dummydate
        FROM ( SELECT *, row_number() OVER  () FROM
          ( 
           SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )  
           AS tab2 
        ) as temp_tab
      GROUP BY testid,dummydate 
    )  AS tab3
    ORDER BY testid, mindate
    

    【讨论】:

    • 我认为ORDER BY 放在了第一个中间位置的错误位置,即它应该是:SELECT *, row_number() OVER (ORDER BY testid,pulldate DESC) FROM data
    【解决方案2】:

    我不知道这种技术的任何已知名称。我尝试自己编写它并想出了与您的基本相同的东西——不同之处仅在于少了一个 WindowAgg。

    select testid, group_num as group,
           min(pulldate) as start_date,
           max(pulldate) as end_date
    from (select testid,
                 pulldate,
                 sum(case when projected_pulldate is null or pulldate <> projected_pulldate
                          then 1 else 0 end) over (order by testid, pulldate) as group_num
          from (select testid, pulldate,
                       (lag(pulldate, 1) over (partition by testid order by pulldate)
                       ) + 1 as projected_pulldate
                from data) x
         ) grouped
    group by testid, group_num
    order by 1, 2
    

    这并不漂亮,我想知道这是否只是使用 plpgsql 或类似方法可能更合适的情况。

    create or replace function data_extents()
     returns table(testid char(3), "group" int, start_date date, end_date date)
     language plpgsql
     stable as $$
    declare
      rec data%rowtype;
    begin
      "group" := 1;
      for rec in select * from data order by testid, pulldate loop
        if testid is null then
          -- first row
          testid := rec.testid;
          start_date := rec.pulldate;
          end_date := rec.pulldate;
        elsif rec.testid <> testid or rec.pulldate <> (end_date + 1) then
          -- discontinuity
          return next;
          testid := rec.testid;
          start_date := rec.pulldate;
          end_date := rec.pulldate;
          "group" := "group" + 1;
        else
          end_date := end_date + 1;
        end if;
      end loop;
      if testid is not null then
        return next;
      end if;
    end;
    $$;
    

    这也不是很漂亮......虽然它原则上是从一次扫描中导出输出而不进行几个不同的聚合,至少感觉更好。在小数据集上花费的时间是一样的;更大的数据集?我还没试过。

    由于我们的解决方案都不允许将诸如“testid = XXX”之类的谓词带入对数据的扫描(afact),因此函数可能是进行有效过滤的唯一方法?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-08-19
      • 2023-03-17
      • 1970-01-01
      • 2022-01-23
      • 1970-01-01
      • 2018-11-03
      • 1970-01-01
      相关资源
      最近更新 更多