在 Postgresql 中，如何找到在一个时间范围内仅发生第一个事件的 3 个连续事件？答案

【问题标题】：In Postgresql, how can I find 3 consecutive events where only the first one occurs within a time frame?在 Postgresql 中，如何找到在一个时间范围内仅发生第一个事件的 3 个连续事件？
【发布时间】：2019-09-04 15:50:48
【问题描述】：

我有下表，其中包含 user_id、timestamp 和 event_id。 “tag”列表示这是否是所需的行（tag = 1）或不是（tag = 0）：

user_id | timestamp                 | event_id | tag 

    46  | 2018-12-21 08:42:35.000   | 1        | 0
    46  | 2018-12-21 09:58:35.000   | 2        | 1
    46  | 2018-12-22 06:42:35.000   | 3        | 0
    46  | 2018-12-22 07:18:35.000   | 4        | 1
    46  | 2018-12-22 08:30:35.000   | 5        | 1
    46  | 2018-12-23 06:42:35.000   | 6        | 0
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1
    46  | 2018-12-23 09:42:35.000   | 14       | 0
    46  | 2018-12-23 10:17:35.000   | 15       | 1   
    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0
    46  | 2018-12-24 11:51:35.000   | 19       | 1
    122 | 2018-12-22 08:30:35.000   | 1        | 1
    122 | 2018-12-23 06:42:35.000   | 2        | 0
    122 | 2018-12-23 06:11:35.000   | 3        | 1
    122 | 2018-12-23 07:51:35.000   | 4        | 1
    122 | 2018-12-23 07:26:35.000   | 5        | 1
    122 | 2018-12-23 07:37:35.000   | 6        | 1
    122 | 2018-12-28 06:42:35.000   | 1        | 0
    122 | 2018-12-28 06:38:35.000   | 2        | 1
    122 | 2018-12-28 07:51:35.000   | 3        | 1
    122 | 2018-12-28 07:26:35.000   | 4        | 1
    122 | 2018-12-28 08:42:35.000   | 5        | 0
    122 | 2018-12-28 09:38:35.000   | 6        | 0
    122 | 2018-12-28 10:51:35.000   | 7        | 0
    122 | 2018-12-28 11:26:35.000   | 8        | 0

所以我想找到：

用户在同一日期发生 3 个正确（标签 = 1）连续事件（即三胞胎）的次数。
这 3 个连续事件的第一个事件的时间戳。

理想情况下，返回的表应如下所示：

user_id | first_occurrence           |event_id | consecutive_events 
     46 | 2018-12-23 06:11:35.000    | 7       | 2  <-- 2 consecutive triplets 
     46 | 2018-12-23 07:37:35.000    | 10      | 2  <-- this has 4 consecutive events  but I am only interested in triplets of events.
     122| 2018-12-23 06:11:35.000    | 4       | 1
     122| 2018-12-28 06:38:35.000    | 2       | 1

换句话说，constant_events 列必须显示用户每天的所有三元组，而 first_occurrence 和 event_id 列应该显示每个用户 ID 和日期每个三元组的第一个时间戳和 event_id。

注意： user_id 46 有一个三元组 0 (tag = 0)。应该排除这些三元组。

    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0

我尝试使用DENSE_RANK() 函数，但结果远非最佳：

dense_rank() over (partition by user_id, date(timestamp) order by tag,date(timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

[更新]

我在戈登回答的第一条评论中提到的例子如下。对于这些连续事件：

user_id | timestamp                 | event_id | tag 
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1

查询返回：

 user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2

它也应该返回

user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2
     46  | 2018-12-23 07:37:35.000   | 10           | 2

你认为这也可以获取吗？

【问题讨论】：

连续如何定义？通过event_id 或timestamp？
@GordonLinoff 按时间戳。非常感谢戈登。这是我第一次遇到差距和孤岛问题，我真的很喜欢你的回答。但是，它解决了几乎 90% 的问题。唯一的问题是 min(timestamp) 不返回第二个三元组的“第一个”时间戳。有关此案例的详细示例，请参阅我更新的问题。再次感谢！

标签： sql postgresql

【解决方案1】：

这是一个空白和孤岛问题。行号的差异似乎是最好的方法：

获取所有相邻的值：

select user_id, min(timestamp) as timestamp,
       count(*) as num_consecutive,
       min(event_id) as min_event_id
from (select t.*,
             row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
             row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
      from t
     ) t
group by user_id, timestamp::date, tag, (seqnum - seqnum_t);

如果您想要单独的每个序列，只需添加 where tag = 1 和 having count(*) >= 3 这个查询。

要将其转换为您想要的结果集，请使用子查询：

select user_id, min(event_id), min(timestamp),
       (sum(num_consecutive) / 3)
from (select user_id, min(timestamp) as timestamp,
             count(*) as num_consecutive,
             min(event_id) as min_event_id
      from (select t.*,
                   row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
                   row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
            from t
           ) t
      where tag = 1
      group by user_id, timestamp::date, tag, (seqnum - seqnum_t)
     ) t
where num_consecutive >= 3
group by user_id, timestamp::date;

【讨论】：

感谢您的更新！我已经尝试在第一个子查询中添加where tag = 1 and having count(*) >= 3，但它获取了错误的结果。除了一些三胞胎丢失的事实之外；这个查询 - 而不是第二个三元组的第一个时间戳 - 它获取最后一个连续事件的时间戳（不是三元组）