【问题标题】:Counting unique values within a time window计算时间窗口内的唯一值
【发布时间】:2021-11-19 11:17:14
【问题描述】:

我的数据看起来像(超过 100.000 行):

timestamp               Location person
2017-09-04 08:07:00 UTC A        x
2017-09-04 08:08:00 UTC B        y
2017-09-04 08:09:00 UTC A        y
2017-09-04 08:07:00 UTC A        x
2017-09-04 08:27:00 UTC B        x

我想要什么:

Location  Nr_of_persons_working_at_the_same_time
A         2
B         1

解释

timestamp               Location person
2017-09-04 08:07:00 UTC A        x       <--- first action in A by person x
2017-09-04 08:08:00 UTC B        y       <--- different first action in B by person y
2017-09-04 08:09:00 UTC A        y       <--- second action in A, but could be different action as person x might be gone
2017-09-04 08:07:00 UTC A        x       <--- person x is still there, so count of persons in A is 2
2017-09-04 08:27:00 UTC B        x       <--- not a different action, person x coming in after 20 minutes, count of persons working at the same time remains 1

上下文

我想通过查看最多 10 分钟的时间窗口(时间戳)并检查一个人是真正同时工作还是只是接管他们的工作,从而了解有多少人(人)在同一位置(位置)工作在那个框架内移动。我通过 SQL 查询获取数据,并且可以使用 SQL 或 Python 对其进行解析。首选 SQL。

尝试过的解决方案

  • 按位置分组,时间戳导致“硬删减”
  • 可能需要一个所谓的窗口函数。但是按时间戳排序后,如何防止位置混淆?

注意: 如果更简单,我也可以尝试在 Python 中执行此操作,但我宁愿没有给出数据集的大小以及在云中执行此操作的有限选项。

【问题讨论】:

    标签: python sql group-by google-bigquery


    【解决方案1】:

    这应该可以工作

    with mytable as (
    select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
    select cast('2017-09-04 08:08:00' as datetime) as _timestamp ,'B' as Location,'y' as person union all
    select cast('2017-09-04 08:09:00' as datetime) as _timestamp ,'A' as Location,'y' as person union all
    select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
    select cast('2017-09-04 08:27:00' as datetime) as _timestamp ,'B' as Location,'x' as person 
    ),
    sorted_entry
    as (
    select  *,
            ifnull(first_value(_timestamp) over(partition by Location  order by _timestamp ),_timestamp ) as prev_timestamp ,
            ifnull(lag(person) over(partition by Location  order by _timestamp ),person ) as another_person
            
    from mytable 
    )
    ,flagged 
    as 
    (
    select *,
            case when person <> another_person then (
                case when datetime_diff(_timestamp,prev_timestamp,minute) <= 10 then 1
                else 0 end
            )
            else 0
            end as flag
    from sorted_entry 
    )
    select location ,sum(flag) + 1 as _count
    from flagged
    group by location 
    

    【讨论】:

      猜你喜欢
      • 2021-12-29
      • 1970-01-01
      • 2018-09-11
      • 2022-12-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-10-31
      相关资源
      最近更新 更多