【问题标题】:How to fill irregularly missing values with linear interepolation in BigQuery?如何在 BigQuery 中使用线性插值填充不规则缺失值?
【发布时间】:2020-11-13 07:15:52
【问题描述】:

我有不规则缺失值的数据,我想使用 BigQuery 标准 SQL 使用线性插值将其转换为一定间隔。

具体来说,我有这样的数据:

# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
|    1 | 3.0   |
|    5 | 5.0   |
|    7 | 1.0   |
|    9 | 8.0   |
|   10 | 4.0   |
+------+-------+

我想将此表转换如下:

# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
|    1 | 3.0                |
|    2 | 3.5                |
|    3 | 4.0                |
|    4 | 4.5                |
|    5 | 5.0                |
|    6 | 3.0                |
|    7 | 1.0                |
|    8 | 4.5                |
|    9 | 8.0                |
|   10 | 4.0                |
+------+--------------------+

有什么聪明的解决方案吗?

补充:本题与this question in stackoverflow类似,不同之处在于数据不规则丢失。

谢谢。

【问题讨论】:

  • 把3.0放到time=6的逻辑是什么。
  • 感谢您的评论。计算为time=5(值为5.0)和time=7(值为1.0)的平均值
  • 你能解释一下你是如何在时间(2,3,4)达到 3.5,4,4.5 的吗
  • 谢谢。它在 time=1(value is 3) 和 time=5(value is 5.0) 之间线性插值数据。因此,第一个 3.5、4.0、4.5 中 0.5 的间隔计算为(值 5.0 - 值 3.0)/(时间 5 - 时间 1)= 2/4 = 0.5。
  • 谢谢,按照这个逻辑,time=8 的值应该是 (value 8.0- value 1.0)/(time 9 - Time 7) = 7/2 = 3.5

标签: sql google-bigquery interpolation linear-interpolation


【解决方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
select time,
  ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
  select time, value,
    first_value(tick ignore nulls) over win1 as start_tick,
    first_value(value ignore nulls) over win1 as start_value,
    first_value(tick ignore nulls) over win2 as end_tick,
    first_value(value ignore nulls) over win2 as end_value,
  from (
    select time, t.time as tick, value
    from (
      select generate_array(min(time), max(time)) times
      from `project.dataset.table`
    ), unnest(times) time 
    left join `project.dataset.table` t
    using(time)
  )
  window win1 as (order by time desc rows between current row and unbounded following),
  win2 as (order by time rows between current row and unbounded following)
)

如果应用于您问题的样本数据 - 输出是

【讨论】:

  • 谢谢。我用这个答案解决了。我发布了另一个与此问题相关的问题(更难),如果您不介意,我希望您检查一下。新问题的链接如下:stackoverflow.com/questions/64829772/…
  • 当然。也回答了:o)
【解决方案2】:

这是一个如何在 Postgresql 中解决此问题的示例。

https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1

with data
   as (select time
              ,lead(time) over(order by time) as next_time
              ,value
              ,lead(value) over(order by time) as next_value
              ,(lead(value) over(order by time)- value) as val_diff
              ,(lead(time) over(order by time)- time) as time_diff
          from t
      )
select *
       ,generate_series- time as grp
       ,case when generate_series- time = 0 then
                  value
             else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
         end as val_grp
  from data
cross join generate_series(time, coalesce(next_time-1,time))


+------+-----------------+-----+-------------------------+
| time | generate_series | grp |         val_grp         |
+------+-----------------+-----+-------------------------+
|    1 |               1 |   0 |                     3.0 |
|    1 |               2 |   1 | 3.500000000000000000000 |
|    1 |               3 |   2 | 4.000000000000000000000 |
|    1 |               4 |   3 | 4.500000000000000000000 |
|    5 |               5 |   0 |                     5.0 |
|    5 |               6 |   1 |     3.00000000000000000 |
|    7 |               7 |   0 |                     1.0 |
|    7 |               8 |   1 |     4.50000000000000000 |
|    9 |               9 |   0 |                     8.0 |
|   10 |              10 |   0 |                     4.0 |
+------+-----------------+-----+-------------------------+

我相信 BigQuery 中使用 UNNEST 和 GENERATE_ARRAY 的语法会有所不同,如下所示。你可以试试看。

 with data
       as (select time
                  ,lead(time) over(order by time) as next_time
                  ,value
                  ,lead(value) over(order by time) as next_value
                  ,(lead(value) over(order by time)- value) as val_diff
                  ,(lead(time) over(order by time)- time) as time_diff
              from t
          )
    select *
           ,generate_series- time as grp
           ,case when generate_series- time = 0 then
                      value
                 else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
             end as val_grp
      from data
    cross join  UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series

【讨论】:

    【解决方案3】:

    在 BigQuery 中,您可以使用 generate_array() 为每一行生成额外的行。然后你可以使用lead()从下一行获取信息和一些插值算法:

    with t as (
          select 1 as time, 3.0 as value union all
          select 5 , 5.0 union all  
          select 7 , 1.0 union all
          select 9 , 8.0 union all
          select 10 , 4.0 
         ),
         tt as (
          select t.*,
                 lead(time) over (order by time) as next_time,
                 lead(value) over (order by time) as next_value
          from t
         )
    select coalesce(n, tt.time) as time, 
           (case when n = tt.time or n is null then value
                 else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
            end) as value
    from tt left join
         unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
         on true
    order by 1;
    

    注意:您有一个名为 time 的列,其中包含一个整数。如果这确实是某种类型的日期/时间数据类型,我建议您提出一个 new 问题,其中包含更合适的示例数据和所需的结果——如果您不知道如何调整它回答。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-08-28
      • 2022-01-06
      • 1970-01-01
      • 2012-10-25
      • 2013-06-22
      • 1970-01-01
      相关资源
      最近更新 更多