【问题标题】:Find first occurrence of consecutive start / end columns查找第一次出现的连续开始/结束列
【发布时间】:2018-10-22 22:34:58
【问题描述】:

我有一张包含工作日期的表格。每次发生变化时都会添加一个新行 - 工资变化是最常见的变化。因此,新行将等于所述人的最后一个 TO-date 加一 (1)。如果我的薪水在 2014 年 4 月 1 日发生变化,我的前一行将以 2013-03-31 结束其 TO 日期,而我的新行将以 2014-04-01 开始其 FROM 日期。

我只想获得工作日期,而不是因变化而出现的日期。看看这张表:

SSN         FROM        TO
----------------------------------
0987654321  2011-01-01  2011-12-31
0987654321  2012-01-01  2012-12-31
1234567890  2012-01-01  2012-12-31
0987654321  2013-01-01  2013-12-31
1234567890  2013-01-01  2013-06-30
0987654321  2014-01-01  2014-08-31
1234567890  2016-01-01  2016-12-31
1234567890  2017-01-01  2017-12-31
1234567890  2018-01-01  null

我想要的输出:

SSN         FROM        TO
----------------------------------
0987654321  2011-01-01  2014-08-31
1234567890  2012-01-01  2013-06-30
1234567890  2016-01-01  null

我想我可以创建一个比TO 多一天的字段:

SELECT 
    SSN, TO, FROM, DATEADD(DAY, 1, TO) AS NEW 
FROM 
    table

但我不知道如何在不同的行上继续匹配 NEWTO。也许WHERE NOT EXISTS 或什么?我不能让它工作。

然后我想我可以使用LAG,但是表中的上一行默认与下一行无关,我不能在子查询中使用ORDER BY。我不允许,不知道为什么(T-SQL?)。

仅供参考,我不能CREATE TABLEINSERT INTO TABLE 等,我也不能声明变量。我们将得到一个允许所有这些的模块,但现在我没有这些权限。

更新: 第一个答案实际上是正确的,但我注意到另一个干扰它的领域。一个 SSN 可以包含多个 ID,因此 ID 也必须拆分。这是我表中的实际数据。

CREATE TABLE Samples
    (
     SSN varchar(10), 
     ID varchar(4),
     FromDate Date, 
     ToDate Date
    );

INSERT INTO Samples
(SSN, ID, FromDate, ToDate)
VALUES
( '6612140000', '1000', '2005-01-01', '2005-03-31' ),
( '6612140000', '1000', '2005-04-01', '2005-09-30' ),
( '6612140000', '1000', '2005-10-01', '2006-03-31' ), 
( '6612140000', '2000', '2005-10-01', '2006-04-30' ),
( '6612140000', '1000', '2006-04-01', '2007-03-31' ),
( '6612140000', '1000', '2007-04-01', '2008-03-31' ),
( '6612140000', '1000', '2008-04-01', '2009-03-31' ),
( '6612140000', '1000', '2009-04-01', '2010-03-31' ),
( '6612140000', '1000', '2010-04-01', '2010-11-30' ),
( '6612140000', '1000', '2010-12-01', '2011-03-31' ),
( '6612140000', '1000', '2011-04-01', '2011-08-21' ),
( '6612140000', '1000', '2011-08-22', '2011-11-13' ),
( '6612140000', '1000', '2011-11-14', '2011-11-30' ),
( '6612140000', '1000', '2011-12-01', '2012-01-31' ),
( '6612140000', '1000', '2016-07-01', '2017-03-31' ),
( '6612140000', '1000', '2017-04-01', '2017-11-30' ),
( '6612140000', '1000', '2017-12-01', '2018-03-31' ),
( '6612140000', '1000', '2018-04-01', null ),
( '7605140000', '1000', '2013-11-01', '2013-11-30' ),
( '7605140000', '1000', '2013-12-01', '2013-12-31' ),
( '7605140000', '1000', '2014-01-01', '2014-03-31' ),
( '7605140000', '1000', '2014-04-01', '2014-12-31' ),
( '7605140000', '1000', '2015-05-01', '2015-05-31' ),
( '7605140000', '1000', '2015-06-01', '2015-09-30' ),
( '7605140000', '1000', '2015-10-01', '2015-10-31' ),
( '7605140000', '1000', '2016-01-25', '2016-07-24' ),
( '7605140000', '1000', '2016-07-25', '2016-08-31' ),
( '7605140000', '1000', '2016-09-01', '2017-03-31' ),
( '7605140000', '1000', '2017-04-01', '2017-11-30' ),
( '7605140000', '1000', '2017-12-01', null );

以及答案中的代码,我尝试将ID 字段添加到其中,但没有运气:

with

  FromDates as (
    -- All of the   FromDates   for each   SSN   for which there is not
    --   a contiguous preceding period.
    select SO.SSN, SO.ID, SO.FromDate, SO.ToDate,
      Row_Number() over ( partition by SO.SSN order by SO.FromDate ) as RN
      from Samples as SO
      where not exists (
        select 42 from Samples as SI where SI.SSN = SO.SSN and SI.ID = SO.ID and
          SI.ToDate = DateAdd( day, -1, SO.FromDate ) ) ),

  ToDates as (
    -- All of the   ToDates   for each   SSN   for which there is not
    --   a contiguous following period.
    select SSN, ID, FromDate, ToDate, Row_Number() over ( partition by SSN order by FromDate ) as RN
      from Samples as SO
      where not exists (
        select 42 from Samples as SI where SI.SSN = SO.SSN and SI.ID = SO.ID and
          SI.FromDate = DateAdd( day, 1, SO.ToDate ) ) ),

  Ranges as (
    -- Pair the   FromDate   and   ToDate   entries for each   SSN .
    select F.SSN, F.ID, F.FromDate, T.ToDate
      from FromDates as F inner join
        ToDates as T on T.SSN = F.SSN and T.ID = F.ID and T.RN = F.RN ) 

-- Use any ONE of the following   select   statements to see what is going on:
-- select * from FromDates
--  select * from ToDates
  select * from Ranges 
  -- where SSN = '6612140000'
  order by SSN, ID, FromDate

返回:

SSN         ID      FromDate    ToDate
6612140000  1000    2016-07-01  (null)
7605140000  1000    2013-11-01  2014-12-31
7605140000  1000    2014-03-01  2014-12-31
7605140000  1000    2015-05-01  2015-10-31
7605140000  1000    2015-05-01  2015-10-31
7605140000  1000    2016-01-25  (null)

【问题讨论】:

  • 您的样本数据包含重复和重叠的范围,例如( '7605140000', '1000', '2014-03-01', '2014-12-31' )( '7605140000', '1000', '2014-04-01', '2014-12-31' )。这是正确的还是您的实际数据是干净的(这简化了查询)?
  • @dnoeth 很好地发现了。我发出了嘘声。我现在已经删除了该 SSN 的错误行。

标签: mysql sql-server tsql


【解决方案1】:

这是一个间隙和孤岛问题,标准解决方案基于嵌套的分析函数:

#1:将每一行与前一行进行比较,并在新组开始时将其标记为 1。

#2:计算标志的累积和,为每组行分配一个数字。

#3:现在你可以对这些组做任何你想做的事情。

-- data must be correct, i.e. a Slowly Changing Dimension without gaps or overlapping periods
with calcFlag as
 (
   select SSN, Id, FromDate, ToDate,
      -- new group starts when the previous end date
      -- is not the current start date -1
      case when lag(ToDate)
                over (partition by SSN, Id
                      order by FromDate ) = DateAdd( day, -1, FromDate )
           then 0
           else 1
      end as flag
   from samples
 ),
calcGroup as 
 (
   select SSN, Id, FromDate, ToDate, flag,
      -- Cumulative Sum to dynamically assign group number
      sum(flag)
      over ( partition by SSN, Id 
             order by FromDate 
             rows unbounded preceding ) as grp#
   from calcFlag
 )
select SSN, Id, 
   min(FromDate), 
   -- either max date or NULL 
   nullif(max(coalesce(ToDate, '9999-12-31')), '9999-12-31')
from calcGroup
group by SSN, Id, grp# -- include dynamically calculated group number
order by SSN, Id, min(FromDate)
;

【讨论】:

    【解决方案2】:

    以下示例根据您的数据组装岛屿。通过更改启用/注释的最终select 语句中的哪一个,您可以看到过程中的中间结果。

    更新:更改了 CTE 中的日期比较,以便它们可以从 SSN, FromDateSSN, ToDate 的索引中受益。

    -- Sample data.
    declare @Samples table ( SSN VarChar(10), FromDate Date, ToDate Date );
    insert into @Samples ( SSN, FromDate, ToDate ) values
      ( '0987654321', '2011-01-01', '2011-12-31' ),
      ( '0987654321', '2012-01-01', '2012-12-31' ),
      ( '1234567890', '2012-01-01', '2012-12-31' ),
      ( '0987654321', '2013-01-01', '2013-12-31' ),
      ( '1234567890', '2013-01-01', '2013-06-30' ),
      ( '0987654321', '2014-01-01', '2014-08-31' ),
      ( '1234567890', '2016-01-01', '2016-12-31' ),
      ( '1234567890', '2017-01-01', '2017-12-31' ),
      ( '1234567890', '2018-01-01', null );
    select *
      from @Samples;
    
    -- Sample data made a little easier to read.
    select *,
      case when exists (
        select 42 from @Samples as SI where SI.SSN = S.SSN and
          DateDiff( day, S.ToDate, SI.FromDate ) = 1 ) then 1 else 0 end as Continued
      from @Samples as S
      order by SSN, FromDate;
    
    -- Process the data.
    with
      FromDates as (
        -- All of the   FromDates   for each   SSN   for which there is not
        --   a contiguous preceding period.
        select SO.SSN, SO.FromDate, SO.ToDate,
          Row_Number() over ( partition by SO.SSN order by SO.FromDate ) as RN
          from @Samples as SO
          where not exists (
            select 42 from @Samples as SI where SI.SSN = SO.SSN and
              SI.ToDate = DateAdd( day, -1, SO.FromDate ) ) ),
      ToDates as (
        -- All of the   ToDates   for each   SSN   for which there is not
        --   a contiguous following period.
        select SSN, FromDate, ToDate, Row_Number() over ( partition by SSN order by FromDate ) as RN
          from @Samples as SO
          where not exists (
            select 42 from @Samples as SI where SI.SSN = SO.SSN and
              SI.FromDate = DateAdd( day, 1, SO.ToDate ) ) ),
      Ranges as (
        -- Pair the   FromDate   and   ToDate   entries for each   SSN .
        select F.SSN, F.FromDate, T.ToDate
          from FromDates as F inner join
            ToDates as T on T.SSN = F.SSN and T.RN = F.RN )
      -- Use any ONE of the following   select   statements to see what is going on:
    --  select * from FromDates order by SSN, FromDate;
    --  select * from ToDates order by SSN, FromDate;
      select * from Ranges order by SSN, FromDate;
    

    当然,如果SSNs 中实际上有单独的Id 值需要独立处理,那么答案会变成这样:

    -- Sample data.
    declare @Samples as Table ( SSN VarChar(10), Id VarChar(4), FromDate Date, ToDate Date );
    insert into @Samples ( SSN, ID, FromDate, ToDate ) values
        ( '6612140000', '1000', '2005-01-01', '2005-03-31' ),
        ( '6612140000', '1000', '2005-04-01', '2005-09-30' ),
        ( '6612140000', '1000', '2005-10-01', '2006-03-31' ), 
        ( '6612140000', '2000', '2005-10-01', '2006-04-30' ),
        ( '6612140000', '1000', '2006-04-01', '2007-03-31' ),
        ( '6612140000', '1000', '2007-04-01', '2008-03-31' ),
        ( '6612140000', '1000', '2008-04-01', '2009-03-31' ),
        ( '6612140000', '1000', '2009-04-01', '2010-03-31' ),
        ( '6612140000', '1000', '2010-04-01', '2010-11-30' ),
        ( '6612140000', '1000', '2010-12-01', '2011-03-31' ),
        ( '6612140000', '1000', '2011-04-01', '2011-08-21' ),
        ( '6612140000', '1000', '2011-08-22', '2011-11-13' ),
        ( '6612140000', '1000', '2011-11-14', '2011-11-30' ),
        ( '6612140000', '1000', '2011-12-01', '2012-01-31' ),
        ( '6612140000', '1000', '2016-07-01', '2017-03-31' ),
        ( '6612140000', '1000', '2017-04-01', '2017-11-30' ),
        ( '6612140000', '1000', '2017-12-01', '2018-03-31' ),
        ( '6612140000', '1000', '2018-04-01', null ),
        ( '7605140000', '1000', '2013-11-01', '2013-11-30' ),
        ( '7605140000', '1000', '2013-12-01', '2013-12-31' ),
        ( '7605140000', '1000', '2014-01-01', '2014-03-31' ),
        ( '7605140000', '1000', '2014-03-01', '2014-12-31' ),
        ( '7605140000', '1000', '2014-04-01', '2014-12-31' ),
        ( '7605140000', '1000', '2015-05-01', '2015-05-31' ),
    --  ( '7605140000', '1000', '2015-05-01', '2015-05-31' ), -- Duplicate row?!
        ( '7605140000', '1000', '2015-06-01', '2015-09-30' ),
    --  ( '7605140000', '1000', '2015-06-01', '2015-09-30' ), -- Duplicate row?!
        ( '7605140000', '1000', '2015-10-01', '2015-10-31' ),
    --  ( '7605140000', '1000', '2015-10-01', '2015-10-31' ), -- Duplicate row?!
        ( '7605140000', '1000', '2016-01-25', '2016-07-24' ),
        ( '7605140000', '1000', '2016-07-25', '2016-08-31' ),
        ( '7605140000', '1000', '2016-09-01', '2017-03-31' ),
        ( '7605140000', '1000', '2017-04-01', '2017-11-30' ),
        ( '7605140000', '1000', '2017-12-01', null );
    select *
      from @Samples;
    
    -- Sample data made a little easier to read.
    select *,
      case when exists (
        select 42 from @Samples as SI where SI.SSN = S.SSN and SI.Id = S.Id and
          DateDiff( day, S.ToDate, SI.FromDate ) = 1 ) then 1 else 0 end as Continued
      from @Samples as S
      order by SSN, Id, FromDate;
    
    -- Process the data.
    with
      FromDates as (
        -- All of the   FromDates   for each   SSN   for which there is not
        --   a contiguous preceding period.
        select SO.SSN, SO.Id, SO.FromDate, SO.ToDate,
          Row_Number() over ( partition by SO.SSN, SO.Id order by SO.FromDate ) as RN
          from @Samples as SO
          where not exists (
            select 42 from @Samples as SI where SI.SSN = SO.SSN and SI.Id = SO.Id and
              SI.ToDate = DateAdd( day, -1, SO.FromDate ) ) ),
      ToDates as (
        -- All of the   ToDates   for each   SSN   for which there is not
        --   a contiguous following period.
        select SO.SSN, SO.Id, SO.FromDate, SO.ToDate,
          Row_Number() over ( partition by SSN, SO.Id order by FromDate ) as RN
          from @Samples as SO
          where not exists (
            select 42 from @Samples as SI where SI.SSN = SO.SSN and SI.Id = SO.Id and
              SI.FromDate = DateAdd( day, 1, SO.ToDate ) ) ),
      Ranges as (
        -- Pair the   FromDate   and   ToDate   entries for each   SSN .
        select F.SSN, F.Id, F.FromDate, T.ToDate
          from FromDates as F inner join
            ToDates as T on T.SSN = F.SSN and T.Id = F.Id and T.RN = F.RN )
      -- Use any ONE of the following   select   statements to see what is going on:
    --  select * from FromDates order by SSN, Id, FromDate;
    --  select * from ToDates order by SSN, Id, FromDate;
      select * from Ranges order by SSN, Id, FromDate;
    

    【讨论】:

    • 非常好。它似乎有效,但在查看 SSN 1234567890 时,我在 FromDates 中得到 3 行,在 ToDates 中得到 4 行,在 Ranges 中得到 3 行。因此,在使用 select * from Ranges 时,ToDate 没有意义。有什么想法吗?
    • @TAKL 我从 CTE 的每个部分返回三行,SSN "0987654321" 的一行和 "1234567890" 的两行。您使用的是同一组样本数据吗? (我使用了发布的代码,发现更新中缺少括号。它导致语法错误,而不是不正确的结果。已修复。还测试了更多边缘情况,例如 SSN 只有一行有 NULL ToDate。)
    • 您当然是正确的。我混淆了 ToDates/FromDates 选择。它确实奏效了。但我也弄乱了我原来的表格,因为它表明有一个 ID 字段会干扰你的答案。一个 SSN 可以有多个 ID,因此我无法得到正确的响应。我希望您也可以帮助我完成最后一篇文章,尽管您对原始问题的回答已经是正确的。
    • 请参阅答案中的第二个答案,以获得您选择提出的问题的答案,而不是您提出的问题。 (嗯?)我冒昧地忽略了新的和改进的示例数据中的重复行,或者我的眼睛因为这里太晚了而被交叉。
    • 完美。谢谢!你是救生员/光剑。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-10-08
    • 2023-04-02
    • 2018-02-15
    • 1970-01-01
    相关资源
    最近更新 更多