使用 Timescale 查找每个间隔的最新值答案

【问题标题】：Using Timescale to find the latest value per interval使用 Timescale 查找每个间隔的最新值
【发布时间】：2021-12-22 04:37:31
【问题描述】：

我的时间序列数据精度高达毫秒。其中一些时间戳可能与准确时间一致，因此可以通过数据库 id 列进行排序，以确定哪个是最新的。

我正在尝试使用 Timescale 来获取每秒的最新值。这是我正在查看的数据示例

time                     db_id  value
2020-01-01 08:39:23.293 | 4460 | 136.01 | 
2020-01-01 08:39:23.393 | 4461 | 197.95 | 
2020-01-01 08:40:38.973 | 4462 |  57.95 | 
2020-01-01 08:43:01.223 | 4463 |    156 | 
2020-01-01 08:43:26.577 | 4464 | 253.43 | 
2020-01-01 08:43:26.577 | 4465 |  53.68 | 
2020-01-01 08:43:26.577 | 4466 | 160.00 |

获取最新的每秒价格时，我的结果应该是这样的

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:39:24 | 197.95 |
.
.
.
2020-01-01 08:40:37 | 197.95 |
2020-01-01 08:40:38 | 57.95  |
2020-01-01 08:40:39 | 57.95  |
.
.
.
2020-01-01 08:43:25 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |  
2020-01-01 08:43:27 | 160.00 |
.
.
.

我已经使用 Timescale time_bucket 成功获得了每秒的最新结果

SELECT last(value, db_id), time_bucket('1 seconds', time) AS per_second FROM timeseries GROUP BY per_second ORDER BY per_second DESC;

但它会在时间列中留下漏洞。

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |

我想到的解决方案是创建一个具有每秒时间戳和空值的数据库，从上一个结果表中迁移数据，然后用最后出现的值替换空值，但这似乎需要很多中间步骤。

我想知道是否有更好的方法来解决这个问题，以每秒、每分钟、每小时等方式查找“最新值”。我最初尝试用 python 解决这个问题，因为这似乎是一个简单的问题，但它占用大量计算时间。

【问题讨论】：

您好，感谢您提供详细的问题为了透明度，我为 Timescale 工作 我没有上述问题的直接答案，但想分享这个最近的视频，其中之一我们的开发者倡导者解决了寻找最新价值的问题并审查了许多选项youtube.com/watch?v=HwJrmYJoIw4
您好！一位同事刚刚提到，他们认为 time bucket gapfill 可能是您在这种情况下发现有价值的功能之一，值得您在 Timescale 文档中探索。
感谢您的回复@greenweeds！很高兴知道我的问题并不像我想象的那么简单:) 我一定会查看视频和 time_bucket_gapfill()。

标签： postgresql time-series data-science etl timescaledb

【解决方案1】：

为我的问题找到了一个很好的解决方案。它包括四个主要步骤：

获取最新值

    select 
        time_bucket('1 second', time + '1 second') as interval,
        last(val, db_id) as last_value
    from table
    where time  > <date_start> and time < <date_end>
    group by interval
    order by time;

这将生成一个包含最新值的表。 last 还利用列以防需要其他级别的排序。例如

time                 last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |

请注意，我使用+ '1 second' 将时间移动了一秒，因为我只想要特定秒之前的数据之前 - 如果没有这个，它将考虑第二个数据作为最后价格的一部分.

创建具有每秒时间戳的表

    select 
        time_bucket_gapfill('1 second', time) as per_second
    from table
    where time  > <date_start> and time < <date_end>
    group by per_second
    order by per_second;

在这里我生成了一个表格，其中每一行都有每秒的时间戳。

例如

per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000

将它们连接在一起并添加一个value_partition 列

select
    per_second,
    last_value,
    sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
    (
        select 
            time_bucket('1 second', time + '1 second') as interval,
            last(val, db_id) as last_value
        from table
        where time  > <date_start> and time < <date_end>
        group by interval, time
    ) a
right join
    (
        select 
            time_bucket_gapfill('1 second', time) as per_second
        from table
        where time  > <date_start> and time < <date_end>
        group by per_second
    ) b
on a.interval = b.per_second

受this answer 的启发，目标是有一个计数器 (value_partition)，仅当值不为空时才会递增。

例如

per_second              latest_value value_partition
2020-01-01 00:00:00.000 NULL         0         
2020-01-01 00:00:01.000 15.82        1         
2020-01-01 00:00:02.000 NULL         1         
2020-01-01 00:00:03.000 NULL         1         
2020-01-01 00:00:04.000 NULL         1         
2020-01-01 00:00:05.000 NULL         1         
2020-01-01 00:00:06.000 NULL         1         
2020-01-01 00:00:07.000 NULL         1         
2020-01-01 00:00:08.000 NULL         1         
2020-01-01 00:00:09.000 NULL         1         
2020-01-01 00:00:10.000 15.72        2 
2020-01-01 00:00:10.000 14.67        3

填写空值

select
    per_second,
    first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
    select
        per_second,
        last_value,
        sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
    from
    (
            select 
                time_bucket('1 second', time + '1 second') as interval,
                last(val, db_id) as last_value
            from table
            where time  > <date_start> and time < <date_end>
            group by interval
        ) a
    right join
        (
            select 
                time_bucket_gapfill('1 second', time) as per_second
            from table
            where time  > <date_start> and time < <date_end>
            group by per_second
        ) b
    on a.interval = b.per_second
) as q

这最后一步将所有内容整合在一起。这利用了value_partition 列并相应地覆盖空值。

例如

per_second              latest_value
2020-01-01 00:00:00.000 NULL        
2020-01-01 00:00:01.000 15.82       
2020-01-01 00:00:02.000 15.82       
2020-01-01 00:00:03.000 15.82       
2020-01-01 00:00:04.000 15.82       
2020-01-01 00:00:05.000 15.82       
2020-01-01 00:00:06.000 15.82       
2020-01-01 00:00:07.000 15.82       
2020-01-01 00:00:08.000 15.82       
2020-01-01 00:00:09.000 15.82       
2020-01-01 00:00:10.000 15.72       
2020-01-01 00:00:10.000 14.67

【讨论】：