【问题标题】:In Postgresql, how can I find 3 consecutive events where only the first one occurs within a time frame?在 Postgresql 中,如何找到在一个时间范围内仅发生第一个事件的 3 个连续事件?
【发布时间】:2019-09-04 15:50:48
【问题描述】:

我有下表,其中包含 user_id、timestamp 和 event_id。 “tag”列表示这是否是所需的行(tag = 1)或不是(tag = 0):

user_id | timestamp                 | event_id | tag 

    46  | 2018-12-21 08:42:35.000   | 1        | 0
    46  | 2018-12-21 09:58:35.000   | 2        | 1
    46  | 2018-12-22 06:42:35.000   | 3        | 0
    46  | 2018-12-22 07:18:35.000   | 4        | 1
    46  | 2018-12-22 08:30:35.000   | 5        | 1
    46  | 2018-12-23 06:42:35.000   | 6        | 0
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1
    46  | 2018-12-23 09:42:35.000   | 14       | 0
    46  | 2018-12-23 10:17:35.000   | 15       | 1   
    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0
    46  | 2018-12-24 11:51:35.000   | 19       | 1
    122 | 2018-12-22 08:30:35.000   | 1        | 1
    122 | 2018-12-23 06:42:35.000   | 2        | 0
    122 | 2018-12-23 06:11:35.000   | 3        | 1
    122 | 2018-12-23 07:51:35.000   | 4        | 1
    122 | 2018-12-23 07:26:35.000   | 5        | 1
    122 | 2018-12-23 07:37:35.000   | 6        | 1
    122 | 2018-12-28 06:42:35.000   | 1        | 0
    122 | 2018-12-28 06:38:35.000   | 2        | 1
    122 | 2018-12-28 07:51:35.000   | 3        | 1
    122 | 2018-12-28 07:26:35.000   | 4        | 1
    122 | 2018-12-28 08:42:35.000   | 5        | 0
    122 | 2018-12-28 09:38:35.000   | 6        | 0
    122 | 2018-12-28 10:51:35.000   | 7        | 0
    122 | 2018-12-28 11:26:35.000   | 8        | 0

所以我想找到:

  1. 用户在同一日期发生 3 个正确(标签 = 1)连续事件(即三胞胎)的次数。
  2. 这 3 个连续事件的第一个事件的时间戳。

理想情况下,返回的表应如下所示:

user_id | first_occurrence           |event_id | consecutive_events 
     46 | 2018-12-23 06:11:35.000    | 7       | 2  <-- 2 consecutive triplets 
     46 | 2018-12-23 07:37:35.000    | 10      | 2  <-- this has 4 consecutive events  but I am only interested in triplets of events.
     122| 2018-12-23 06:11:35.000    | 4       | 1
     122| 2018-12-28 06:38:35.000    | 2       | 1  

换句话说,constant_events 列必须显示用户每天的所有三元组,而 first_occurrence 和 event_id 列应该显示每个用户 ID 和日期每个三元组的第一个时间戳和 event_id。

注意: user_id 46 有一个三元组 0 (tag = 0)。应该排除这些三元组。

    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0

我尝试使用DENSE_RANK() 函数,但结果远非最佳:

dense_rank() over (partition by user_id, date(timestamp) order by tag,date(timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

[更新]

我在戈登回答的第一条评论中提到的例子如下。对于这些连续事件:

user_id | timestamp                 | event_id | tag 
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1

查询返回:

 user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2

它也应该返回

user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2
     46  | 2018-12-23 07:37:35.000   | 10           | 2

你认为这也可以获取吗?

【问题讨论】:

  • 连续如何定义?通过event_idtimestamp
  • @GordonLinoff 按时间戳。非常感谢戈登。这是我第一次遇到差距和孤岛问题,我真的很喜欢你的回答。但是,它解决了几乎 90% 的问题。唯一的问题是 min(timestamp) 不返回第二个三元组的“第一个”时间戳。有关此案例的详细示例,请参阅我更新的问题。再次感谢!

标签: sql postgresql


【解决方案1】:

这是一个空白和孤岛问题。行号的差异似乎是最好的方法:

获取所有相邻的值:

select user_id, min(timestamp) as timestamp,
       count(*) as num_consecutive,
       min(event_id) as min_event_id
from (select t.*,
             row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
             row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
      from t
     ) t
group by user_id, timestamp::date, tag, (seqnum - seqnum_t);

如果您想要单独的每个序列,只需添加 where tag = 1having count(*) &gt;= 3 这个查询。

要将其转换为您想要的结果集,请使用子查询:

select user_id, min(event_id), min(timestamp),
       (sum(num_consecutive) / 3)
from (select user_id, min(timestamp) as timestamp,
             count(*) as num_consecutive,
             min(event_id) as min_event_id
      from (select t.*,
                   row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
                   row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
            from t
           ) t
      where tag = 1
      group by user_id, timestamp::date, tag, (seqnum - seqnum_t)
     ) t
where num_consecutive >= 3
group by user_id, timestamp::date;

【讨论】:

  • 感谢您的更新!我已经尝试在第一个子查询中添加where tag = 1 and having count(*) &gt;= 3,但它获取了错误的结果。除了一些三胞胎丢失的事实之外;这个查询 - 而不是第二个三元组的第一个时间戳 - 它获取最后一个连续事件的时间戳(不是三元组)
猜你喜欢
  • 2020-09-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-11-21
  • 1970-01-01
  • 2018-03-24
相关资源
最近更新 更多