【问题标题】:Count number of less than 5 minute apart intervals per group计算每组间隔少于 5 分钟的次数
【发布时间】:2019-05-02 07:19:57
【问题描述】:

我有一张如下表:

我需要按 Cat 和 Timestamp 对数据进行分组,并为每组提供一个计数。一个组被定义为一个 5 分钟的动态时间窗口,这意味着它可以跨越不同的时间。

查询结果应如下:

查看黄色的第一个表组。这些组应被检测到并计为一个,而未突出显示的组也应计为一个

现在我在 Stackoverflow 上阅读了许多解决方案,以下是我尝试过的相关解决方案:

  • 创建 5 分钟的时间间隔组 - 这不起作用,因为跨越不同小时的时间戳不匹配为同一组
  • 使用 ROWNUMBER() OVER(按类别按时间戳进行分区)并加入 t1.Cat = t2.Cat 和 t1.rn + 1 = t2.rn 。按 DATEDIFF 过滤。这不起作用,因为只能检测到两对。如果在 5 分钟内依次出现 5 个时间戳怎么办?

我将非常感谢您对此的任何帮助

ascii 表中的原始数据见下文

原始数据

+---------------------+----------+
|      Timestamp      | Category |
+---------------------+----------+
| 2018-10-01 04:06:12 | Cat1     |
| 2018-10-01 05:07:18 | Cat1     |
| 2018-10-01 05:07:19 | Cat1     |
| 2018-10-01 05:07:20 | Cat1     |
| 2018-10-01 06:09:29 | Cat1     |
| 2018-10-01 07:24:12 | Cat2     |
| 2018-10-01 07:30:43 | Cat2     |
| 2018-10-01 07:59:13 | Cat2     |
| 2018-10-01 08:02:15 | Cat2     |
| 2018-10-01 10:09:25 | Cat2     |
| 2018-10-01 11:13:42 | Cat2     |
+---------------------+----------+

【问题讨论】:

  • 是否应该考虑第一个记录“2018-10-01 05:06:12”,因为下一个可用值是 05:07,在 5 分钟窗口内?
  • 是的,这是正确的。抱歉,会修复图片
  • 我已经更改了第一个时间戳,所以表格现在应该是正确的

标签: sql sql-server datetime group-by


【解决方案1】:

这可以通过LAG轻松完成:

DECLARE @t TABLE (timestamp DATETIME, category VARCHAR(100));
INSERT INTO @t VALUES
('2018-10-01 04:06:12', 'CAT1'),
('2018-10-01 05:07:18', 'CAT1'),
('2018-10-01 05:07:19', 'CAT1'),
('2018-10-01 05:07:20', 'CAT1'),
('2018-10-01 06:09:29', 'CAT1'),
('2018-10-01 07:24:12', 'CAT2'),
('2018-10-01 07:30:43', 'CAT2'),
('2018-10-01 07:59:13', 'CAT2'),
('2018-10-01 08:02:15', 'CAT2'),
('2018-10-01 10:09:25', 'CAT2'),
('2018-10-01 11:13:42', 'CAT2');

WITH cte1 AS (
    SELECT timestamp, category, CASE WHEN LAG(timestamp) OVER (PARTITION BY category ORDER BY timestamp) > DATEADD(MINUTE, -5, timestamp) THEN 0 ELSE 1 END AS chg
    FROM @t
)
SELECT category, COUNT(CASE WHEN chg = 1 THEN 1 END)
FROM cte1
GROUP BY category

要了解它的工作原理,请关注chg 列的计算方式,并查看 cte 的结果:

timestamp                  category    chg
2018-10-01 04:06:12.000    CAT1        1
2018-10-01 05:07:18.000    CAT1        1
2018-10-01 05:07:19.000    CAT1        0
2018-10-01 05:07:20.000    CAT1        0
2018-10-01 06:09:29.000    CAT1        1
2018-10-01 07:24:12.000    CAT2        1
2018-10-01 07:30:43.000    CAT2        1
2018-10-01 07:59:13.000    CAT2        1
2018-10-01 08:02:15.000    CAT2        0
2018-10-01 10:09:25.000    CAT2        1
2018-10-01 11:13:42.000    CAT2        1

【讨论】:

    【解决方案2】:

    这是一种方法

    第一步根据前一个时间戳值是否在 5 分钟内对记录进行分类。 如果是,则为其分配一个 row_number。

    这样做是为了让你的价值观如下

    +---------------------+----------+-----------+
    |     timestamp1      | category | grps_of_5 |
    +---------------------+----------+-----------+
    | 01/10/2018 05:06:12 | Cat1     |         1 |
    | 01/10/2018 05:07:18 | Cat1     |           |
    | 01/10/2018 05:07:19 | Cat1     |           |
    | 01/10/2018 05:07:20 | Cat1     |           |
    | 01/10/2018 06:09:29 | Cat1     |         5 |
    | 01/10/2018 07:24:12 | Cat2     |         1 |
    | 01/10/2018 07:30:43 | Cat2     |         2 |
    | 01/10/2018 07:59:13 | Cat2     |         3 |
    | 01/10/2018 08:02:15 | Cat2     |           |
    | 01/10/2018 10:09:25 | Cat2     |         5 |
    | 01/10/2018 11:13:42 | Cat2     |         6 |
    +---------------------+----------+-----------+
    
    
    After that i "copy" the values to fill up the nulls in groups using
    max(grps_of_5) over(partition by category order by timestamp1)
    
    
    This is done in the curated_data block and will look like this
    
    +---------------------+----------+-----------+---------+
    |     timestamp1      | category | grps_of_5 | max_val |
    +---------------------+----------+-----------+---------+
    | 01/10/2018 04:06:12 | Cat1     |         1 |       1 |
    | 01/10/2018 05:07:18 | Cat1     |         2 |       2 |
    | 01/10/2018 05:07:19 | Cat1     |           |       2 |
    | 01/10/2018 05:07:20 | Cat1     |           |       2 |
    | 01/10/2018 06:09:29 | Cat1     |         5 |       5 |
    | 01/10/2018 07:24:12 | Cat2     |         1 |       1 |
    | 01/10/2018 07:30:43 | Cat2     |         2 |       2 |
    | 01/10/2018 07:59:13 | Cat2     |         3 |       3 |
    | 01/10/2018 08:02:15 | Cat2     |           |       3 |
    | 01/10/2018 10:09:25 | Cat2     |         5 |       5 |
    | 01/10/2018 11:13:42 | Cat2     |         6 |       6 |
    +---------------------+----------+-----------+---------+
    
    
    After that i am counting the distinct max_val which will tell count all 5 minute intervals as a single group and others seperately.
    
    with raw_data
      as(select timestamp1
                ,category
                ,case when datediff(mi,lag(timestamp1) over(partition by category order by timestamp1),timestamp1) >5 
                        or lag(timestamp1) over(partition by category order by timestamp1) is null
                      then row_number() over(partition by category order by timestamp1)                  
                  end as grps_of_5
           from t  
         )
       ,curated_data
          as (select max(grps_of_5) over(partition by category order by timestamp1) as max_val
                     ,x.*
                from raw_data x
               )
     select category,count(distinct max_val) as cnt
       from curated_data
    group by category            
    
    +----------+------+
    | category | cnt2 |
    +----------+------+
    | Cat1     |    3 |
    | Cat2     |    5 |
    +----------+------+
    

    修改后的版本

    演示链接

    https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=556e0ec16bb040b96b637e3da3e8178b

    【讨论】:

    • 第一个时间戳已更改为“2018-10-01 04:06:12”,因此请确保在构建表格时使用相同的时间戳。在你给我的链接上运行查询给出 5 和 6 计数,而不是 cat1 和 cat2 的 3 和 5 计数
    • cat-2 的预期计数为 5 的情况如何?该组中只有两行
    • Cat2中有6行。一组“重复”(5 分钟内)。因此总数应为 5
    • 5分钟内的时间戳记为一组。
    • 因此您希望在 5 分钟内获得计数总数。 Cat-2(来自 Cat1 的 3 个 + 来自 Cat-2 的 2 个)?
    【解决方案3】:

    请尝试以下代码:

    SELECT * INTO #temp
    FROM(
        SELECT '2018-10-01 05:06:12' AS Timestamp , 'Cat1' AS Category   
        UNION ALL
        SELECT '2018-10-01 05:07:18' AS Timestamp , 'Cat1' AS Category  
        UNION ALL
        SELECT '2018-10-01 05:07:19' AS Timestamp , 'Cat1' AS Category  
        UNION ALL
        SELECT '2018-10-01 05:07:20' AS Timestamp , 'Cat1' AS Category 
        UNION ALL
        SELECT '2018-10-01 06:09:29' AS Timestamp , 'Cat1' AS Category 
        UNION ALL
        SELECT '2018-10-01 07:24:12' AS Timestamp , 'Cat2' AS Category   
        UNION ALL
        SELECT '2018-10-01 07:30:43' AS Timestamp , 'Cat2' AS Category  
        UNION ALL
        SELECT '2018-10-01 07:59:13' AS Timestamp , 'Cat2' AS Category  
        UNION ALL
        SELECT '2018-10-01 08:02:15' AS Timestamp , 'Cat2' AS Category 
        UNION ALL
        SELECT '2018-10-01 10:09:25' AS Timestamp , 'Cat2' AS Category 
       UNION ALL
        SELECT '2018-10-01 11:13:42' AS Timestamp , 'Cat2' AS Category 
    ) AS T
    
    SELECT  Category AS [Group], COUNT(CONVERT(DATE,Timestamp)) AS [Count]  FROM #temp GROUP By Category
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-02-19
      • 2017-03-14
      • 2019-08-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多