【问题标题】:Snowflake server, split duration by hour?雪花服务器,按小时拆分持续时间?
【发布时间】:2019-12-09 04:25:21
【问题描述】:

我在一个具有以下结构的表中有数据,其中所有id 都是唯一的,并且有许多不同的开始和结束组合,开始和结束之间的差异从不到一分钟到数百分钟不等分钟。我只需要拆分大于 60 分钟的开始/结束差异的持续时间。

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 |

我想创建一个表格,按小时拆分持续时间,并附上原始条目的 ID,如下所示:

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
| 0abc23 | 2019-06-29 01:08:00.000 | 2019-06-29 02:07:59.000 |
| 0abc23 | 2019-06-29 02:08:00.000 | 2019-06-29 03:07:59.000 |
| 0abc23 | 2019-06-29 03:08:00.000 | 2019-06-29 04:07:59.000 |
| 0abc23 | 2019-06-29 04:08:00.000 | 2019-06-29 05:18:00.000 |

我从similar question for a different database engine 尝试了以下代码:

with table1 as (
      select id
          , duration_start as fromdate
          , duration_end as todate
      from test_data
),
tab1 as (
      select id
        , fromdate
        , dateadd(second, 60*60 - 1, fromdate) as todate1 // add an hour minus a second
        , todate
      from table1

      union all

      select id
        , dateadd(minutes, 60, todate1) as fromdate // add an hour to todate1
        , dateadd(second, 2*60*60-1, todate1) as todate1 // add 1 hours to todate1
        , todate
      from tab1 where date_trunc(hour, todate1) < date_trunc(hour, todate)
),
tab2 as (
      select id
        , fromdate
        , case when todate1 > todate then todate else todate1 end as todate
      from tab1

但这给了我

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
| 0abc23 | 2019-06-29 02:07:59.000 | 2019-06-29 03:07:58.000 |
| 0abc23 | 2019-06-29 04:07:58.000 | 2019-06-29 05:07:57.000 |
| 0abc23 | 2019-06-29 06:07:57.000 | 2019-06-29 07:07:56.000 |
| 0abc23 | 2019-06-29 08:07:56.000 | 2019-06-29 09:07:55.000 |

我尝试调整添加的时间量,但无法在开始和结束之间获得 1 小时的偏移量。我也尝试使用connect bylevel(建议使用similar question but on oracle),但无法编译。我刚刚收到 LEVEL 的无效标识符错误。

select dateadd(hour,24,duration_start)
      , greatest(duration_start, date_trunc(hour,dateadd(hour,((level-1)::int/24)::int,duration_start)))
      , least(duration_start, date_trunc(hour,dateadd(hour,((level)::int/24)::int,duration_start)))
from test_data
connect by prior LEVEL = floor(datediff(hour, duration_start, duration_end)::int*24)+1;

即使只是尝试使用 connect by 也会给我一个错误:

select *
from test_data
connect by prior LEVEL = floor(datediff(hour, duration_start, duration_end)::int*24)+1;

错误: SQL compilation error: error line 0 at position -1 invalid identifier 'HOUR'

如果有人能告诉我如何调整我的雪花方法,将不胜感激,谢谢!

【问题讨论】:

    标签: sql snowflake-cloud-data-platform


    【解决方案1】:

    对于这种类型的项目,使用单独的日期和间隔表会有所帮助。

    这是实现您需要的脚本。

    根据您需要的任何日期范围创建日历表。

    我从 '1970-01-01' 这个 unix 纪元开始,因为我保留了一个日历表以备不时之需。

    create or replace table calendar(calendar_date DATE)
    
    insert into calendar(calendar_date)
    select
    dateadd(d,rid2,to_timestamp_ntz('1970-01-01')) as calendar_date
    from
    (
    select 0 as rid2 union
    select row_number() over (order by null) as rid2
    from table (generator(rowcount => 100000))
    ) t
    where dateadd(d,rid2,to_timestamp_ntz('1970-01-01')) < '2030-01-01'
    order by 1 asc;
    

    接下来创建一个时间/间隔表。 区间表。在您的示例中,您的间隔持续时间为 1 小时。因此创建了以下内容。

    create or replace table interval(id integer, interval_start time);
    
    insert into interval (id,interval_start)
    select
    id,
    to_time(dateadd(hour,id,to_timestamp_ntz('1970-01-01')))
    from
    (
    select 0 as id union
    select row_number() over (order by null) as id
    from table (generator(rowcount => 23))
    ) t;
    

    接下来,我使用您的示例数据以及其他几个值创建了一个表格,以便可以在不同的场景中验证计算。

    create or replace table example1(id varchar(10), DURATION_START datetime, DURATION_END datetime);
    -- drop table example1
    truncate table example1;
    --
    insert into  example1 values('0abc23','2019-06-29 00:08:00.000','2019-06-29 09:18:00.000');
    insert into  example1 values('0abc24','2019-06-28 11:07:45.000','2019-06-28  12:08:45.000');
    insert into  example1 values('0abc25','2019-06-28 01:00:00.000','2019-06-29 02:15:00.000');
    insert into  example1 values('0abc26','2019-06-28 00:08:00.000','2019-06-29 15:18:00.000');
    

    鉴于一切都已设置好,下面的查询将为您提供所需的结果。

    select 
          f.id
         ,f.DURATION_START
         ,f.DURATION_END
         ,f.start_time_HOUR_START
         ,f.end_time_HOUR_START
         ,q.CALENDAR_DATE
         ,q.HOUR_START
         ,q.HOUR_END
         ,case
        -- when starts during interval and ends after interval
               when f.DURATION_START >= q.HOUR_START and f.DURATION_END >= dateadd(hour, 1, q.HOUR_START)
                   then datediff(s, f.DURATION_START, dateadd(hour, 1, q.HOUR_START))
        -- when starts during interval and ends during interval
               when f.DURATION_START >= q.HOUR_START and f.DURATION_END <= dateadd(hour, 1, q.HOUR_START)
                   then datediff(s, f.DURATION_START, f.DURATION_END)
        -- when starts before interval and ends during interval
               when f.DURATION_START <= q.HOUR_START and f.DURATION_END <= dateadd(hour, 1, q.HOUR_START)
                   then datediff(s, q.HOUR_START, f.DURATION_END)
        -- entire interval , starts before, and ends after
               when
                   f.DURATION_START <= q.HOUR_START and f.DURATION_END >= dateadd(hour, 1, q.HOUR_START)
                   then datediff(s, q.HOUR_START, dateadd(hour, 1, q.HOUR_START))
               else 0 end as seconds_elapsed
    from (
             select *
                  , to_timestamp(
                     dateadd(s, datediff(s, '1970-01-01', DURATION_START) - (datediff(s, '1970-01-01', DURATION_START) % 3600),
                             '1970-01-01')) as start_time_HOUR_START
                  , to_timestamp(
                     dateadd(s, datediff(s, '1970-01-01', DURATION_END) - (datediff(s, '1970-01-01', DURATION_END) % 3600),
                             '1970-01-01')) as end_time_HOUR_START
             from example1
         ) f
             inner join
    
         (
             select
                    distinct
                    q1.calendar_date
                  -- , t2.rid2
                  , dateadd(hour, t2.id, to_timestamp(q1.calendar_date)) as HOUR_START
                  , dateadd(hour, t2.id + 1, to_timestamp(q1.calendar_date)) as HOUR_END
             from (
                      select calendar_date
                      from calendar
                      where calendar_date between (select to_date(min(DURATION_START)) from example1) and (select to_date(max(DURATION_END)) from example1)
                  ) q1
                      cross join
                  interval as t2
            -- order by HOUR_START
         ) q on q.HOUR_START between f.start_time_HOUR_START and f.end_time_HOUR_START
    
    ORDER BY f.id
           , f.DURATION_START
           , f.DURATION_END
           , q.CALENDAR_DATE
           , q.HOUR_START
    ;
    

    下面的示例输出。运行脚本得到最终结果:

    +--------+-------------------------+-------------------------+-------------------------+-------------------------+---------------+-------------------------+-------------------------+-----------------+
    | ID     | DURATION_START          | DURATION_END            | START_TIME_HOUR_START   | END_TIME_HOUR_START     | CALENDAR_DATE | HOUR_START              | HOUR_END                | SECONDS_ELAPSED |
    |--------+-------------------------+-------------------------+-------------------------+-------------------------+---------------+-------------------------+-------------------------+-----------------|
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 00:00:00.000 | 2019-06-29 01:00:00.000 |            3120 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 01:00:00.000 | 2019-06-29 02:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 02:00:00.000 | 2019-06-29 03:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 03:00:00.000 | 2019-06-29 04:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 04:00:00.000 | 2019-06-29 05:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 05:00:00.000 | 2019-06-29 06:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 06:00:00.000 | 2019-06-29 07:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 07:00:00.000 | 2019-06-29 08:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 08:00:00.000 | 2019-06-29 09:00:00.000 |            3600 |
    | 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 09:00:00.000 | 2019-06-29 10:00:00.000 |            1080 |
    | 0abc24 | 2019-06-28 11:07:45.000 | 2019-06-28 12:08:45.000 | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 | 2019-06-28    | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 |            3135 |
    | 0abc24 | 2019-06-28 11:07:45.000 | 2019-06-28 12:08:45.000 | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 | 2019-06-28    | 2019-06-28 12:00:00.000 | 2019-06-28 13:00:00.000 |             525 |
    

    获取代码click的链接

    【讨论】:

      【解决方案2】:

      可以通过各种技巧在纯 SQL 中执行此操作,但 IMO 迄今为止最简单的方法是为此创建一个简短的 JavaScript UDTF

      下面有一些额外的测试行的完整示例。请注意,可能需要额外覆盖一些极端情况,例如如果您的结束时间早于开始时间,或者其中之一为空,该怎么办。但这应该会有所帮助。

      create or replace table x(
        id varchar, 
        duration_start timestamp_ntz,
        duration_end timestamp_ntz
      ) as 
      select * from values
        ('0abc23', '2019-06-29 00:08:00.000', '2019-06-29 09:18:00.000'),
        ('id_2__', '2002-02-02 02:00:00.000', '2002-02-02 02:00:00.000'),
        ('id_3__', '2003-03-03 03:00:00.000', '2003-03-03 04:00:00.000'),
        ('id_4__', '2004-04-04 04:00:00.000', '2004-04-04 04:59:59.000');
      
      create or replace function magic(
        id varchar, 
        duration_start timestamp_ntz,
        duration_end timestamp_ntz
      ) 
      returns table (
        hour_start timestamp_ntz, 
        hour_end timestamp_ntz
      ) language javascript as 
      $$
      {
        processRow: function(row, rowWriter, context) {
          let msSecond = 1000;
          let msHour = 60 * 60 * msSecond;
      
          let msStart = row.DURATION_START.getTime();
          let msEnd = row.DURATION_END.getTime();
      
          while (msStart <= msEnd) {
            let curEnd = Math.min(msEnd, msStart + msHour - msSecond);
            rowWriter.writeRow({
              HOUR_START: new Date(msStart),
              HOUR_END: new Date(curEnd)
            });
            msStart = curEnd + msSecond;
          }
        }
      }
      $$;
      
      select * from x, table(magic(id, duration_start, duration_end)) ;
      --------+-------------------------+-------------------------+-------------------------+-------------------------+
         ID   |     DURATION_START      |      DURATION_END       |       HOUR_START        |        HOUR_END         |
      --------+-------------------------+-------------------------+-------------------------+-------------------------+
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 01:08:00.000 | 2019-06-29 02:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 02:08:00.000 | 2019-06-29 03:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 03:08:00.000 | 2019-06-29 04:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 04:08:00.000 | 2019-06-29 05:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 05:08:00.000 | 2019-06-29 06:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 06:08:00.000 | 2019-06-29 07:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 07:08:00.000 | 2019-06-29 08:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 08:08:00.000 | 2019-06-29 09:07:59.000 |
       0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 09:08:00.000 | 2019-06-29 09:18:00.000 |
       id_2__ | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 |
       id_3__ | 2003-03-03 03:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 03:00:00.000 | 2003-03-03 03:59:59.000 |
       id_3__ | 2003-03-03 03:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 04:00:00.000 |
       id_4__ | 2004-04-04 04:00:00.000 | 2004-04-04 04:59:59.000 | 2004-04-04 04:00:00.000 | 2004-04-04 04:59:59.000 |
      --------+-------------------------+-------------------------+-------------------------+-------------------------+
      

      请注意,结果包括原始开始/结束,但不是必须的。

      【讨论】:

      • 这个 vs sql 使用 JS UDTF 效率如何?
      • 视情况而定。对于简单的操作,它比较慢。但是,如果您的 SQL 中的逻辑很难或很棘手,它通常会更快(并且更容易实现)。在这种情况下,由于我不确定如何在纯 SQL 中执行此操作,因此很难确定。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-10-23
      • 1970-01-01
      相关资源
      最近更新 更多