【问题标题】:SQL Server Compressing Adjacent Date RangesSQL Server 压缩相邻日期范围
【发布时间】:2017-05-04 14:44:33
【问题描述】:

我有一个包含人员 ID 和日期范围(开始日期和停止日期)的表格。每个人可能有多个行,有多个开始和结束日期。

create table #DateRanges (
   tableID   int not null,
   personID  int not null,
   startDate date,
   endDate   date
);
insert #DateRanges (tableID, personID, startDate, endDate)
values (1, 100, '2011-01-01', '2011-01-31') -- Just January
     , (2, 100, '2011-02-01', '2011-02-28') -- Just February
     , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
     , (4, 100, '2011-05-01', '2011-05-31') -- May
     , (5, 100, '2011-06-01', '2011-12-31') -- June through December

我需要一种方法来折叠相邻的日期范围(前一行的结束日期正好是下一行的开始日期前一天)。但它必须包括所有连续的范围,仅当端到端差距大于一天时才拆分。以上数据需要压缩成:

+-----------+----------+--------------+------------+
| SomeNewID | PersonID | NewStartDate | NewEndDate |
+-----------+----------+--------------+------------+
|        1  |     100  |   2011-01-01 | 2011-02-28 |
+-----------+----------+--------------+------------+
|        2  |     100  |   2011-04-01 | 2011-12-31 |
+-----------+----------+--------------+------------+

只有两行,因为唯一缺少的范围是三月。现在,如果所有 March 都存在,无论是一排还是多排,压缩将导致只有一排。但如果只有 3 月中旬的两天,我们将在第三行显示 3 月的日期。

我一直在使用 SQL 2016 中的 LEAD 和 LAG 函数来尝试将其作为记录集操作来完成,但到目前为止都是空的。我希望能够在没有循环和 RBAR 的情况下做到这一点,但我没有看到解决方案。

【问题讨论】:

    标签: sql-server compression range lag lead


    【解决方案1】:

    您可以使用 lag 并获取正确的存储桶,然后按如下方式进行分组:

    ;with cte1 as (
        select *,dtdiff = datediff(day, lag(startdate, 1, null) over (partition by personid order by startdate), startDate) --Getting date difference for grouping
         from #DateRanges
            ),
    cte2 as (
        select *, grp = sum(case when dtdiff is null or dtdiff>50 then 1 else 0 end) over (order by startdate) -- Creating bucket for min/max
            from cte1
            )
            select SomeNewId = Row_Number() over (order by (select null)), Personid, NewStartDate = min(startdate), NewEndDate = max(enddate) --Getting min/max based on bucket
                from cte2 group by PersonId, grp
    

    你的输出:

    +-----------+----------+--------------+------------+
    | SomeNewId | Personid | NewStartDate | NewEndDate |
    +-----------+----------+--------------+------------+
    |         1 |      100 | 2011-01-01   | 2011-02-28 |
    |         2 |      100 | 2011-04-01   | 2011-12-31 |
    +-----------+----------+--------------+------------+
    

    我的测试输入:

    insert #DateRanges (tableID, personID, startDate, endDate)
    values (1, 100, '2011-01-01', '2011-01-31') -- Just January
         , (2, 100, '2011-02-01', '2011-02-28') -- Just February
         , (3, 100, '2011-04-01', '2011-04-30') -- April - Skipped March
         , (4, 100, '2011-05-01', '2011-05-31') -- May
         , (5, 100, '2011-06-01', '2011-06-30') -- More gaps
         , (6, 100, '2011-07-01', '2011-07-31') -- More gaps
         , (7, 100, '2011-08-01', '2011-08-31') -- More gaps
         , (8, 100, '2011-10-01', '2011-10-31') -- More gaps
         , (9, 100, '2011-11-01', '2011-11-30') -- More gaps
    

    测试数据的输出:

    +-----------+----------+--------------+------------+
    | SomeNewId | Personid | NewStartDate | NewEndDate |
    +-----------+----------+--------------+------------+
    |         1 |      100 | 2011-01-01   | 2011-02-28 |
    |         2 |      100 | 2011-04-01   | 2011-08-31 |
    |         3 |      100 | 2011-10-01   | 2011-11-30 |
    +-----------+----------+--------------+------------+
    

    【讨论】:

    • 感谢您的回复,但由于 dtdiff>50,它会下降。处理一个月的范围似乎可以正常工作,但我需要处理几天的分辨率。
    • 50 只是假设超过 1 个月...你能提供失败的示例数据吗?
    • 当然 - 在 #DateRanges 中再添加两行插入 #DateRanges (tableID, personID, startDate, endDate) 值 (10, 100, '2011-03-01', '2011-03-05' ) , (11, 100, '2011-03-06', '2011-03-10')。现在包括了 5 月,但不完全包括在内(在 10 日结束),但是您的 CTE 会生成一行,就好像没有间隙一样。
    【解决方案2】:

    经过几天的努力,我想我想分享一个解决方案,以防其他人需要类似的东西。我使用了一些 CTE 来查找超前、滞后和间隔时间,将行提取到仅重要的开始和停止日期,然后使用更多的超前和滞后来找到压缩的开始和停止日期。可能有更简单的方法,但我认为这可以很好地处理日级分辨率。

    with LeadAndLagAndGap as (
       select
          tableid,
          personID,
          startDate,
          endDate,
          lag(endDate) over (partition by personID order by startDate) as previousEnd,
          lead(startDate) over (partition by personID order by startDate) as nextStart,
          coalesce(datediff(day,endDate,lead(startDate) over (partition by personID order by startDate))-1,0) as gap
       from
          #DateRanges
    ), OnlyStartAndEndRows as (
       select
          tableid,
          personID,
          startDate,
          endDate,
          previousEnd,
          nextStart,
          gap
       from
          LeadAndLagAndGap
       where
          previousEnd is null  -- Definitely FIRST record in a range
          or nextStart is null -- Definitely LAST record in a range
          or gap > 0           -- Definitely an end of a range, nextStart is definitely the start of a range.
    ), PreCollapseReaggregate as (
       select
          tableid,
          personID,
          startDate,
          endDate,
          previousEnd,
          nextStart,
          gap,
          case
             when previousEnd is null then startDate
             when gap > 0 then nextStart
          end as DefiniteStart,
          case
             when nextStart is null then endDate
             when gap > 0 then endDate
          end as DefiniteEnd
       from
          OnlyStartAndEndRows
    ), Collapsed as (
       select
          tableid,
          personID,
          DefiniteStart as startDate,
          case
             when definiteEnd is null or gap > 0 then lead(definiteEnd) over (partition by personid order by startdate)
             when definiteStart is not null and DefiniteEnd is not null then definiteEnd
          end as endDate
         from PreCollapseReaggregate
    )
    select * from Collapsed
    where enddate is not null
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-07-26
      • 2013-12-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多