【问题标题】:Most efficient way to join two time series加入两个时间序列的最有效方法
【发布时间】:2018-06-14 16:43:44
【问题描述】:

想象一下我有一张这样的桌子:

 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

使用这样的值:

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

我希望能够做到这一点:

SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

产生这样的结果:

*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

使用少量的行,就像在这个例子中一样,查询运行得非常快。问题是我必须对数百万行执行此操作,并且在 Redshift(语法类似于 Postgres)上,我的查询需要几天时间才能运行。它非常慢,但这是我最常见的查询模式之一。我怀疑问题是由于数据中 O(n^2) 的增长而不是更可取的 O(n)。

我在 python 中的 O(n) 实现是这样的:

rows = [('2017-01-01',10),
        ('2017-01-02',4),
        ('2017-01-03',13),
        ('2017-01-04',7),
        ('2017-01-05',15),
        ('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
    snapshot_date = row[0]
    sales = row[1]
    if index == 0:
        sales_total_previous += sales
        continue
    count += 1
    sales_avg = sales_total_previous / count
    print((snapshot_date,sales_avg, count))
    sales_total_previous += sales

这样的结果(与 SQL 查询相同):

('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)

我正在考虑切换到 Apache Spark,以便我可以准确地执行该 python 查询,但几百万行并不是那么大(最多 3-4 GB)并且使用具有 100 GB 的 Spark 集群RAM似乎有点矫枉过正。有没有一种高效且易于阅读的方法可以让我在 SQL 中获得 O(n) 效率,最好是在 Postgres / Redshift 中?

【问题讨论】:

    标签: python sql postgresql amazon-redshift


    【解决方案1】:

    你似乎想要:

    SELECT ts.snapshot_date, 
           AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg,
           ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT
    FROM time_series ts;
    

    你会发现使用窗口函数效率更高。

    【讨论】:

    • 哇。这真是太神奇了。它使我的运行时间从一周缩短到只有 23 秒。
    猜你喜欢
    • 1970-01-01
    • 2021-10-12
    • 2013-01-19
    • 2020-09-14
    • 2014-01-28
    • 2021-11-24
    • 1970-01-01
    • 2015-10-15
    • 2023-03-12
    相关资源
    最近更新 更多