【问题标题】:How to fill irregularly missing time-series values with linear interepolation by each user in BigQuery?如何在 BigQuery 中通过每个用户的线性插值填充不规则缺失的时间序列值?
【发布时间】:2020-11-14 00:37:49
【问题描述】:

我的数据缺少时间序列值不规则每个用户,我想要使用 BigQuery 标准 SQL 以一定的间隔通过线性插值对其进行转换。

+------+---------------------+-------+
| name |        time         | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 |     3 |
| Jane | 2020-11-14 09:05:00 |     5 |
| Jane | 2020-11-14 09:07:00 |     1 |
| Jane | 2020-11-14 09:09:00 |     8 |
| Jane | 2020-11-14 09:10:00 |     4 |
| Kay  | 2020-11-14 09:01:00 |     7 |
| Kay  | 2020-11-14 09:04:00 |     1 |
| Kay  | 2020-11-14 09:05:00 |    10 |
| Kay  | 2020-11-14 09:09:00 |     6 |
| Kay  | 2020-11-14 09:10:00 |     7 |
+------+---------------------+-------+

我想将其转换如下:

+------+---------------------+-------+-----------------+
| name |        time         | value |                 |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3     |                 |
| Jane | 2020-11-14 09:02:00 | 3.5   | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4     | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5     |                 |
| Jane | 2020-11-14 09:06:00 | 3     | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1     |                 |
| Jane | 2020-11-14 09:08:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8     |                 |
| Jane | 2020-11-14 09:10:00 | 4     |                 |
| Kay  | 2020-11-14 09:01:00 | 7     |                 |
| Kay  | 2020-11-14 09:02:00 | 5     | <= interpolaetd |
| Kay  | 2020-11-14 09:03:00 | 3     | <= interpolaetd |
| Kay  | 2020-11-14 09:04:00 | 1     |                 |
| Kay  | 2020-11-14 09:05:00 | 10    |                 |
| Kay  | 2020-11-14 09:06:00 | 9     | <= interpolaetd |
| Kay  | 2020-11-14 09:07:00 | 8     | <= interpolaetd |
| Kay  | 2020-11-14 09:08:00 | 7     | <= interpolaetd |
| Kay  | 2020-11-14 09:09:00 | 6     |                 |
| Kay  | 2020-11-14 09:10:00 | 7     |                 |
+------+---------------------+-------+-----------------+

我能问你一些聪明的解决方案吗?

补充:这是this stackoverflow question 的应用问题。它非常相似,但不同之处在于该数据是时间序列数据,它具有每个用户的名称。

谢谢。

【问题讨论】:

    标签: sql google-bigquery interpolation missing-data


    【解决方案1】:

    以下是 BigQuery SQL

    #standardSQL
    select name, time,
        ifnull(value, start_value 
          + (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
        ) as value_interpolated
    from (
        select name, time, value,
        first_value(tick ignore nulls ) over win1 as start_tick,
        first_value(value ignore nulls) over win1 as start_value,
        first_value(tick ignore nulls ) over win2 as end_tick,
        first_value(value ignore nulls) over win2 as end_value,
        from (
            select name, time, t.time as tick, value
            from (
                select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
                from `project.dataset.table`
                group by name
            )
            cross join unnest(times) time 
            left join `project.dataset.table` t 
            using(name, time)
        )
        window 
            win1 as (partition by name order by time desc rows between current row and unbounded following),
            win2 as (partition by name order by time rows between current row and unbounded following)
    )     
    

    如果应用于您问题的样本数据 - 输出是

    【讨论】:

      【解决方案2】:

      这与您之前的问题没有太大不同。从接受的答案开始,您可以这样做:

      select name, time,
          ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
      from (
          select name, time, value,
          first_value(tick ignore nulls ) over win1 as start_tick,
          first_value(value ignore nulls) over win1 as start_value,
          first_value(tick ignore nulls ) over win2 as end_tick,
          first_value(value ignore nulls) over win2 as end_value,
          from (
              select name, time, t.time as tick, value
              from (
                  select name, generate_array(min(time), max(time)) times
                  from `project.dataset.table`
                  group by name
              )
              cross join unnest(times) time 
              left join `project.dataset.table` t using(name, time)
          )
          window 
              win1 as (partition by name order by time desc rows between current row and unbounded following),
              win2 as (partition by name order by time rows between current row and unbounded following)
      )
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2016-01-16
        • 2016-12-08
        • 2014-10-03
        • 2018-01-22
        • 1970-01-01
        • 2015-12-03
        • 2020-08-28
        相关资源
        最近更新 更多