填写和过滤不规则的时间序列数据答案

【问题标题】：Filling Out & Filtering Irregular Time Series Data填写和过滤不规则的时间序列数据
【发布时间】：2015-10-21 01:04:43
【问题描述】：

使用 Postgresql 9.4，我正在尝试对时间序列日志数据进行查询，每当值更新时（不是按计划）记录新值。日志可以在任何地方更新，从一分钟几次到一天一次。

我需要查询来完成以下操作：

只需选择时间戳范围的第一个条目即可过滤太多数据
使用最后读数作为日志值填写稀疏数据。例如，如果我按小时对数据进行分组，并且在上午 8 点有一个日志值为 10 的条目。那么下一个条目要到上午 11 点，日志值为 15，我希望查询返回类似这个：

Timestamp        | Value
2015-07-01 08:00 | 10 
2015-07-01 09:00 | 10 
2015-07-01 10:00 | 10 
2015-07-01 11:00 | 15

我有一个查询可以实现其中的第一个目标：

with time_range as (
    select hour
    from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
    select 
        date_trunc('hour', time_stamp) as log_hour,
        log_val,
        rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
    from time_series
)
select 
    time_range.hour,
    ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1;

但我不知道如何填写没有价值的nulls。我尝试使用 Postgresql 的 Window 函数的 lag() 功能，但是当连续有多个 null 时它不起作用。

这是一个演示该问题的 SQLFiddle： http://sqlfiddle.com/#!15/f4d13/5/0

【问题讨论】：

标签： sql postgresql

【解决方案1】：

您的列是log_hour 和first_vlue

with time_range as (
    select hour
    from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
    select 
        date_trunc('hour', time_stamp) as log_hour,
        log_val,
        rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
    from time_series
),
base as (
select 
    time_range.hour lh,
    ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1)
SELECT
  log_hour, log_val, value_partition, first_value(log_val) over (partition by value_partition order by log_hour)
FROM (
SELECT
    date_trunc('hour', base.lh) as log_hour,
    log_val,
    sum(case when log_val is null then 0 else 1 end) over (order by base.lh) as value_partition
  FROM base) as q

更新

这是您的查询返回的内容

Timestamp        | Value
2015-07-01 01:00 | 10 
2015-07-01 02:00 | null 
2015-07-01 03:00 | null 
2015-07-01 04:00 | 15 
2015-07-01 05:00 | nul 
2015-07-01 06:00 | 19 
2015-07-01 08:00 | 13

我希望将此结果集分成这样的组

2015-07-01 01:00 | 10       
2015-07-01 02:00 | null     
2015-07-01 03:00 | null    

2015-07-01 04:00 | 15     
2015-07-01 05:00 | nul    

2015-07-01 06:00 | 19     

2015-07-01 08:00 | 13

并为组中的每一行分配该组中第一行的值（由最后一次选择完成）

在这种情况下，获得分组的一种方法是创建一个列，其中包含在当前行之前计数的非空值并按此值拆分。（使用sum(case)）

value  | sum(case)
| 10   | 1 |   
| null | 1 |    
| null | 1 |   
| 15   | 2 |  <-- new not null, increment 
| nul  | 2 |  
| 19   | 3 |  <-- new not null, increment 
| 13   | 4 |  <-- new not null, increment

现在我可以通过sum(case)进行分区

【讨论】：

哇，这似乎工作得很好。你能解释一下子选择中发生了什么（sum(case when ...etc)？