Oracle 窗口函数 - 检索元素直到下一次出现值答案

【问题标题】：Oracle window function - retrieve elements up to next occurence of valueOracle 窗口函数 - 检索元素直到下一次出现值
【发布时间】：2019-12-02 18:52:22
【问题描述】：

给定一个开始日期，我有很多 id 的事件日志列表。对于每个 id 和起点，选择以“成功”条目开头的按时间排序的序列。我只需要检索失败的 id 作为下一个事件，并且只检索“成功”之前的事务（或者如果只有失败，则检索到最后一个条目）。 “失败” - 条目很少。潜在后续失败的数量没有“自然”界限。

简化输入（日期格式 dd.mm.yyyy）：

id          timestamp            event
123         12.09.2019          success
123         13.09.2019          success
124         12.09.2019          success
124         15.09.2019          failure
124         16.09.2019          success
124         17.09.2019          success
124         18.09.2019          failure
126         12.09.2019          success
126         16.09.2019          failure
126         17.09.2019          failure
128         …           

Expected Output:
    124         12.09.2019          success
    124         15.09.2019          failure
    124         16.09.2019          success
    126         12.09.2019          success
    126         16.09.2019          failure
    126         17.09.2019          failure

123 被丢弃，因为下一个事件是成功的。 124 第一次成功后的所有内容都将被丢弃。 126 永远不会到达“成功”阶段，因此所有内容都会被检索。

我可以通过滞后/领先测试下一个交易是否成功并排除那些 - 但是如何找到下一个“成功”行（如果它甚至可能不存在？）。我在 python 中轻松地解决了这个问题，计算每个 id-group 的“成功”条目，但我在传输所有数据时产生了很多 IO。有没有办法计算每个 id 可能在分区 over- 子句中出现的成功次数并在 2 之后剪切？

我通过 cx_oracle/python 从 jupyter-notebook 访问 oracle 11g（即，我将 sql 语句传递给 db）。每天大约有 5 万个 ID 和高达数百万的交易。

【问题讨论】：

标签： oracle oracle11g time-series window-functions

【解决方案1】：

这很难看，但会返回你想要的结果：

SQL> with
  2  test (id, timestamp, event) as
  3    -- sample data
  4    (select 123, '12.09.2019', 'success' from dual union all
  5     select 123, '13.09.2019', 'success' from dual union all
  6     --
  7     select 124, '12.09.2019', 'success' from dual union all
  8     select 124, '15.09.2019', 'failure' from dual union all
  9     select 124, '16.09.2019', 'success' from dual union all
 10     select 124, '17.09.2019', 'success' from dual union all
 11     select 124, '18.09.2019', 'failure' from dual union all
 12     --
 13     select 126, '12.09.2019', 'success' from dual union all
 14     select 126, '16.09.2019', 'failure' from dual union all
 15     select 126, '17.09.2019', 'failure' from dual
 16    ),
 17  valids_both as
 18    -- IDs have to have both success and failure events to be valid
 19    -- (eliminates 123)
 20    (select id
 21     from test
 22     group by id
 23     having count(distinct event) = 2
 24    ),
 25  valids_succ as
 26    -- search for timestamp of success which is not the starting success
 27    (select t.id, min(t.timestamp) timestamp
 28     from test t join valids_both v on v.id = t.id
 29     where t.event = 'success'
 30       and t.timestamp > (select min(t1.timestamp) From test t1
 31                          where t1.id = t.id
 32                         )
 33     group by t.id
 34    )
 35  -- this is ID = 124
 36  select t.id, t.timestamp, t.event
 37    from test t join valids_succ v on v.id = t.id
 38      and t.timestamp <= v.timestamp
 39  union
 40  -- this is ID = 126
 41  select t.id, t.timestamp, t.event
 42    from test t join valids_both v on v.id = t.id
 43    where not exists (select null from valids_succ v1
 44                      where v1.id = v.id
 45                     )
 46  order by id, timestamp;

        ID TIMESTAMP  EVENT
---------- ---------- -------
       124 12.09.2019 success
       124 15.09.2019 failure
       124 16.09.2019 success
       126 12.09.2019 success
       126 16.09.2019 failure
       126 17.09.2019 failure

6 rows selected.

SQL>

如何处理大量数据？我不敢问（一旦你测试它）。

【讨论】：

谢谢；事实上，我希望能够避免零件“缝合”在一起，这对我来说看起来非常昂贵，尽管据我所知是正确的。转移到我的真实数据并不容易，因为它们要复杂得多（不仅仅是两个事件）——如果 Ponder Stibbons 的建议比我目前认为的更难修复，我会尝试。
不客气。你（和我）所说的一切都站得住脚......它很丑陋，可能不会很快奏效。无论如何，祝你好运！
valids_both 中的 count(distinct) 部分存在问题：如果在多次成功后发生故障 - 例如为 123 添加第三个“失败”行 15,9,2019 - 重新出现 123 的条目。我发布了一个没有联合的解决方案 - 在我看来它仍然可以调整，但据我所知似乎可以工作并且足够快。

【解决方案2】：

select id, timestamp, event
  from (
    select id, timestamp, event, 
           lag(event, 1, 'x') over (partition by id order by timestamp) lg_event,
           count(case event when 'failure' then 1 end) over (partition by id) cf 
      from t)
  where cf <> 0 and (event = 'failure' or lg_event <> event)

^{dbfiddle demo}

使用解析count 查找有故障的ids。仅显示失败的行或事件发生变化的行，忽略连续成功。

【讨论】：

谢谢，这完全朝着正确的方向发展。我不知道 count(case ...) 的可能性。但是，它目前没有为案例 124 提供正确的结果 - 输出第一次成功后的失败事件。我会尝试解决这个问题，它看起来可以解决。

【解决方案3】：

基于 Ponder Stibbons 方法，我形成了两个标志。最简单的是下一个事件已经成功的情况。在这种情况下，所有东西都应该被丢弃，即使随后发生了一些失败。 success_flag 比较棘手，我使用 row_number 来识别将“失败”组合为当前事件和成功在下一个（引导）行中的行。取这些行中的最小值会产生我需要剪切和丢弃以下行的位置“critical_line”。如果没有成功，关键行为空。性能还可以，不到3分钟。由于最终代码与 Ponder Stibbons 有很大不同，我选择将其作为答案。

with
  test (id, timestamp, event) as
    -- sample data
    (select 123, '12.09.2019', 'success' from dual union all
     select 123, '13.09.2019', 'success' from dual union all
     select 123, '15.09.2019', 'failure' from dual union all
     select 124, '12.09.2019', 'success' from dual union all
     select 124, '15.09.2019', 'failure' from dual union all
     select 124, '16.09.2019', 'success' from dual union all
     select 124, '17.09.2019', 'success' from dual union all
     select 124, '18.09.2019', 'failure' from dual union all
     select 126, '12.09.2019', 'success' from dual union all
     select 126, '16.09.2019', 'failure' from dual union all
     select 126, '17.09.2019', 'failure' from dual
    )
select
    id, timestamp, event  
from
(
  select id, rn, timestamp, event,  
    count(case when rn=2 and event='success' then 1 end) over (partition by id) as flag_simple,
    success_flag,
    min(case when success_flag>0 then success_flag end) over (partition by id) as critical_line
  from
  (
    select id, rn, timestamp, event, success_flag
      from (
                select 
                    id, 
                    timestamp, 
                    event, 
                    row_number() over (partition by id order by timestamp) as rn,
                    case when event='failure' and 
                        lead(event,1,'x') over (partition by id order by timestamp)='success'
                        then 1+row_number() over (partition by id order by timestamp) else 0 end 
                        as success_flag
                from 
                    test
            )
    )
  )
  where 
    flag_simple=0
    and
    (rn<=critical_line or critical_line is NULL)

【讨论】：