【问题标题】:Identify Missing Timestamp Over BigQuery通过 BigQuery 识别缺失的时间戳
【发布时间】:2021-03-10 10:48:06
【问题描述】:

我有一个要求,我需要找到丢失的时间戳。 输入数据如下:-

Row      id       date  
1        x        2021-01-01 10:00:00 UTC
2        x        2021-01-01 10:03:00 UTC
3        x        2021-01-01 10:05:00 UTC
4        x        2021-01-01 10:08:00 UTC
5        y        2021-01-06 10:05:00 UTC
6        y        2021-01-06 10:07:00 UTC
7        y        2021-01-06 10:10:00 UTC

我想要输出为,它会在 2 个连续的时间戳之间给出缺失的时间戳:-

1        x        2021-01-01 10:01:00 UTC
2        x        2021-01-01 10:02:00 UTC
3        x        2021-01-01 10:04:00 UTC
4        x        2021-01-01 10:06:00 UTC
5        x        2021-01-01 10:07:00 UTC
6        y        2021-01-06 10:06:00 UTC
7        y        2021-01-06 10:08:00 UTC
8        y        2021-01-06 10:09:00 UTC

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    考虑下面的解决方案 - 使用较少的连接,最重要的是不会在最开始和最结束数据之间的所有分钟内生成巨大的数组 - 而是只为丢失的分钟生成如此小的数组。数组会占用内存并影响查询的性能

    select id, missing_date
    from (
      select *,
        lag(date) over(partition by id order by date) prev_date
      from `project.dataset.table` t
    ),
    unnest(generate_timestamp_array(timestamp_add(prev_date, interval 1 minute), timestamp_sub(date, interval 1 minute), interval 1 minute)) missing_date 
    where timestamp_diff(date, prev_date, minute) > 1    
    

    如果应用于您问题中的样本数据 - 输出是

    【讨论】:

      【解决方案2】:

      试试GENERATE_TIMESTAMP_ARRAY:

      with mytable as (
        select 'x' as id, timestamp '2021-01-01 10:00:00 UTC' as date union all
        select 'x', timestamp '2021-01-01 10:03:00 UTC' union all
        select 'x', timestamp '2021-01-01 10:05:00 UTC' union all
        select 'x', timestamp '2021-01-01 10:08:00 UTC' union all
        select 'y', timestamp '2021-01-06 10:05:00 UTC' union all
        select 'y', timestamp '2021-01-06 10:07:00 UTC' union all
        select 'y', timestamp '2021-01-06 10:10:00 UTC'
      )
      select id, missing.date 
      from mytable full join (
        select * 
        from (
          select id, GENERATE_TIMESTAMP_ARRAY(min(date), max(date), interval 1 minute) as date_array
          from mytable
          group by id
        ), unnest(date_array) as date
      ) as missing using (id, date)
      where mytable.date is null
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-12-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多