【问题标题】:Calculate alarm flood in snowflake计算雪花中的警报洪水
【发布时间】:2020-07-04 17:05:12
【问题描述】:

我正在尝试在雪花中进行警报洪水计算。我使用雪花窗口函数创建了以下数据集。因此,如果该值大于或等于 3,则警报泛洪将开始,对于下一个 0 值,它将结束。所以在下面的例子中,警报洪水开始于“9:51”,结束于“9:54”,持续了3分钟。下一次洪水开始于“9:57”,结束于“10:02”,即for 5 minutes.FYI, value at 9:59 is 3, but as a flood is already started,我们不必考虑它。下一次洪水是在10:03但没有0值,所以我们必须考虑边缘值 10:06。 所以洪水的总时间是 3+5+4= 12 分钟。

   DateTime    Value
3/10/2020 9:50  1
3/10/2020 9:51  3
3/10/2020 9:52  1
3/10/2020 9:53  2
3/10/2020 9:54  0
3/10/2020 9:55  0
3/10/2020 9:56  1
3/10/2020 9:57  3
3/10/2020 9:58  2
3/10/2020 9:59  3
3/10/2020 10:00 2
3/10/2020 10:01 2
3/10/2020 10:02 0
3/10/2020 10:03 3
3/10/2020 10:04 1
3/10/2020 10:05 1
3/10/2020 10:06 1

所以,简而言之,我期待低于输出

我在 SQL 下尝试过,但它没有给我正确的输出,它在第二次洪水时间失败(因为在下一个 0 之前再次出现值 3)

select t.*,
       (case when value >= 3
             then datediff(minute,
                           datetime,
                           min(case when value = 0 then datetime end) over (order by datetime desc)
                          )
        end) as diff_minutes
from t;

【问题讨论】:

  • 您到底在寻找什么?您是否正在为实现此目的的 SQL 语句而苦苦挣扎?如果是这样,您尝试过什么 SQL?您可能还想为此添加一个通用 SQL 标记,因为我认为该解决方案不会是 Snowflake 特定的。

标签: sql snowflake-cloud-data-platform snowflake-schema


【解决方案1】:

javascript udf 版本:

select d, v, iff(3<=v and 1=row_number() over (partition by N order by d),
    count(*) over (partition by N), null) trig_duration
from t, lateral flood_count(t.v::float) 
order by d;

其中 flood_count() 定义为:

create or replace function flood_count(V float) 
returns table (N float)
language javascript AS
$${

  initialize: function() { 
    this.n = 0 
    this.flood = false
  },

  processRow: function(row, rowWriter) { 
    if (3<=row.V && !this.flood) {
        this.flood = true
        this.n++
    }
    else if (0==row.V) this.flood=false
    rowWriter.writeRow({ N: this.flood ? this.n : null })  
  },

}$$;

假设这个输入:

create or replace table t as
select to_timestamp(d, 'mm/dd/yyyy hh:mi') d, v 
from values
    ('3/10/2020 9:50',  1),
    ('3/10/2020 9:51',  3),
    ('3/10/2020 9:52',  1),
    ('3/10/2020 9:53',  2),
    ('3/10/2020 9:54',  0),
    ('3/10/2020 9:55',  0),
    ('3/10/2020 9:56',  1),
    ('3/10/2020 9:57',  3),
    ('3/10/2020 9:58',  2),
    ('3/10/2020 9:59',  3),
    ('3/10/2020 10:00', 2),
    ('3/10/2020 10:01', 2),
    ('3/10/2020 10:02', 0),
    ('3/10/2020 10:03', 3),
    ('3/10/2020 10:04', 1),
    ('3/10/2020 10:05', 1),
    ('3/10/2020 10:06', 1)
    t(d,v)
;

【讨论】:

  • So if the value is greater or equal to 3 你可能想在你的代码中添加一些&gt;
  • 接受此解决方案作为其特定于雪花数据库的解决方案。
【解决方案2】:
WITH data as (
  select time::timestamp as time, value from values
    ('2020-03-10 9:50', 1 ),
    ('2020-03-10 9:51', 3 ),
    ('2020-03-10 9:52', 1 ),
    ('2020-03-10 9:53', 2 ),
    ('2020-03-10 9:54', 0 ),
    ('2020-03-10 9:55', 0 ),
    ('2020-03-10 9:56', 1 ),
    ('2020-03-10 9:57', 3 ),
    ('2020-03-10 9:58', 2 ),
    ('2020-03-10 9:59', 3 ),
    ('2020-03-10 10:00', 2 ),
    ('2020-03-10 10:01', 2 ),
    ('2020-03-10 10:02', 0 ),
    ('2020-03-10 10:03', 3 ),
    ('2020-03-10 10:04', 1 ),
    ('2020-03-10 10:05', 1 ),
    ('2020-03-10 10:06', 1 )
     s( time, value)
) 
select 
    a.time
    ,a.value
    ,min(trig_time)over(partition by reset_time_group order by time) as first_trigger_time
    ,iff(a.time=first_trigger_time, datediff('minute', first_trigger_time, reset_time_group), null) as trig_duration
from (
select d.time
   ,d.value 
   ,iff(d.value>=3,d.time,null) as trig_time
   ,iff(d.value=0,d.time,null) as reset_time
   ,max(time)over(order by time ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as max_time
   ,coalesce(lead(reset_time)ignore nulls over(order by d.time), max_time) as lead_reset_time
   ,coalesce(reset_time,lead_reset_time) as reset_time_group
from data as d
) as a
order by time;

这给出了您似乎期望/描述的结果..

TIME                     VALUE  FIRST_TRIGGER_TIME         TRIG_DURATION
2020-03-10 09:50:00.000    1        
2020-03-10 09:51:00.000    3    2020-03-10 09:51:00.000    3
2020-03-10 09:52:00.000    1    2020-03-10 09:51:00.000    
2020-03-10 09:53:00.000    2    2020-03-10 09:51:00.000    
2020-03-10 09:54:00.000    0    2020-03-10 09:51:00.000    
2020-03-10 09:55:00.000    0        
2020-03-10 09:56:00.000    1        
2020-03-10 09:57:00.000    3    2020-03-10 09:57:00.000    5
2020-03-10 09:58:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 09:59:00.000    3    2020-03-10 09:57:00.000    
2020-03-10 10:00:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 10:01:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 10:02:00.000    0    2020-03-10 09:57:00.000    
2020-03-10 10:03:00.000    3    2020-03-10 10:03:00.000    3
2020-03-10 10:04:00.000    1    2020-03-10 10:03:00.000    
2020-03-10 10:05:00.000    1    2020-03-10 10:03:00.000    
2020-03-10 10:06:00.000    1    2020-03-10 10:03:00.000    

所以它的工作原理是我们找到触发时间和重置时间,然后计算出最后一行边缘情况的 max_time。之后我们找到下一个reset_time向前,如果没有就使用max_time,然后选择当前的reset时间或之前的lead_reset_time,对于你在这里做的工作,这一步可以忽略,因为你的数据不能触发和重置同一行。鉴于我们正在对触发行进行数学运算,重置行知道它属于哪个组并不重要。

然后我们进入一个新的选择层,因为我们已经达到了嵌套/相关 SQL 的雪花限制,并在 reset_group 上做一分钟以找到第一个触发时间,然后我们将其与行时间进行比较并做一个日期差异。

附注 date_diff 的数学有点幼稚,'2020-01-01 23:59:59' '2020-01-02 00:00:01' 相隔 2 秒,但那是 1 分钟相隔 1 小时和 1 天,因为该函数将时间戳转换为选定的单位(并截断),然后对这些结果进行区分..

要获得请求中要求的值为 4 的最终批次,请将lead_reset_time 行更改为:

,coalesce(lead(reset_time)ignore nulls over(order by d.time), dateadd('minute', 1, max_time)) as lead_reset_time

将此 max_time 向前移动一分钟,如果您想假设在未来有数据之外,10:06 的现有行状态有效 1 分钟。这不是我会怎么做的......但是你想要的代码......

【讨论】:

    【解决方案3】:

    我对这段代码不是最自豪的,但它确实有效并提供了一个起点。我相信它可以被清理或简化。而且我还没有评估大型表的性能。

    我使用的关键见解是,如果您将 date_diff 添加到日期,那么您会发现它们都添加到相同值的情况,这意味着它们都计数到相同的“0”记录。希望这个概念对您有所帮助。

    此外,第一个 cte 是在结果结束时获得 4 的一种半骇人听闻的方式。

    --Add a fake zero at the end of the table to provide a value for
    -- comparing high values that have not been resolved
    -- added a flag so this fake value can be removed later
    with fakezero as
    (
    SELECT datetime, value, 1 flag
    FROM test
    
    UNION ALL
    
    SELECT dateadd(minute, 1, max(datetime)) datetime, 0 value, 0 flag
    FROM test  
    )
    
    -- Find date diffs between high values and subsequent low values
    ,diffs as (
    select t.*,
           (case when value >= 3
                 then datediff(minute,
                               datetime,
                               min(case when value = 0 then datetime end) over (order by datetime desc)
                              )
            end) as diff_minutes
    from fakezero t
    )
    
    --Fix cases where two High values are "resolved" by the same low value
    --i.e. when adding the date_diff to the datetime results in the same timestamp
    -- this means that the prior high value record that still hasn't been "resolved"
    select
      datetime
      ,value
      ,case when 
          lag(dateadd(minute, diff_minutes, datetime)) over(partition by value order by datetime)
          = dateadd(minute, diff_minutes, datetime)
        then null 
        else diff_minutes 
      end as diff_minutes
    from diffs
    where flag = 1
    order by datetime;
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-08-19
      • 1970-01-01
      • 2020-11-09
      • 1970-01-01
      • 2022-01-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多