SQL 在非 O(N^2) 时间内使用滚动窗口构建功能答案

【问题标题】：SQL Building Features Using Rolling Window in Non O(N^2) TimeSQL 在非 O(N^2) 时间内使用滚动窗口构建功能
【发布时间】：2020-05-01 21:41:48
【问题描述】：

我正在事实表（例如发票历史记录）之上构建功能，该表将简单地继续附加到右侧。一个基本的发票历史表可能如下所示：

|   date     |   customer   | product  | amount  | feature c-p (past 5 days) |  ...
-----------------------------------------------------------------------------------
| 2020/01/01 |      CA      |   P1     | 10      |    NA                     |
| 2020/01/02 |      CA      |   P1     | 5       |    10   = 10              |
| 2020/01/05 |      CA      |   P1     | 20      |    15   = 5 + 10          |
| 2020/01/07 |      CA      |   P1     | 20      |    25   = 20 + 5          |
                                                  (01/01 out of range above) |
| 2020/01/15 |      CA      |   P1     | 100     |    25   = 10 + 5 + 20     |
| 2020/01/17 |      CA      |   P1     | 200     |    100  = 100             |
| 2020/01/31 |      CA      |   P1     | 20      |    0    = 0               |

一开始，我们将使用自连接的逻辑写成类似于：

select 
    c.date, 
    c.customer, 
    c.product, 
    c.amount, 
    sum(c.amount2)
from
    (select 
        i1.*,
        i2.date as date2, 
        i2.amount as amount2
    from invoice i1
    inner join invoice i2
    on i1.customer = i2.customer 
    and i1.product = i2.product 
    and i1.date < i2.date and i1.date >= i2.date - 5    -- where we customize the window
    ) c   
group by 
    c.date, 
    c.customer, 
    c.product, 
    c.amount

如果我没记错的话，这个自连接本身是 O(N^2)，但逻辑很简单，每个人都可以理解。但直到最近，当我们开始使用一张大桌子时，这种方法才爆发。

我之前在考虑窗口函数，但我不确定是否有更高效（计算效率和存储效率更高）的方法？

我的想法是使用窗口函数，但看起来我的逻辑是自定义的超出范围，而不是固定的 N 行回溯，而是应该回溯 5 天？在 Hive/Impala 中是否有可能，如果没有，我想我将不得不填写缺失的日期，然后使用 windows 功能。愿意接受任何建议吗？

（今天我们使用的是 Hive/Impala，但如果其他数据库中确实有更有效的方法，我当然愿意接受）。

更新

刚刚运行了一个使用 2000 万行真实数据的基准测试，节省了大量时间：

自加入过滤：128 分钟
使用包含日期转换的窗口函数：15 分钟（Gordon 的回答），最重要的是，这种方法保证不会引入重复，因为同一客户和同一产品可能在同一天被购买多次
Hive 不支持内联相关子查询，但 GBM 的解决方案应该能够有效避免完全笛卡尔连接

【问题讨论】：

标签： sql hive impala

【解决方案1】：

Hive 支持range，但我认为只支持数字。幸运的是，您可以将日期转换为数字并仍然使用它：

select t.*,
       sum(amount) over (partition by customer, product
                         order by days
                         range between 5 preceding and 1 preceding
                        )
from (select t.*,
             datediff(date, '2000-01-01') as days
      from t
     ) t;

一个问题是很难区分 2020-01-01 和 2020-01-31。这两个都返回NULL。如果你真的想区分它们，那么你可以使用lag()和case：

select t.*,
       (case when datediff(date, prev_date) > 5 then 0
             when prev_date is null then null
             else sum(amount) over (partition by customer, product
                                    order by days
                                    range between 5 preceding and 1 preceding
                                   )
        end)
from (select t.*,
             datediff(date, '2000-01-01') as days,
             lag(date) over (partition by customer, product order by date) as prev_date
      from t
     ) t;

【讨论】：

在开始和结束时总是边缘情况。我正在尝试几种逻辑，例如 LOCF 或 LOCB，但感谢您提供的案例陈述，这是一个很棒的框架。

【解决方案2】：

如果您有幸运行一个支持range 子句的数据库来使用间隔开窗函数（例如 Postgres，从版本 11 开始），您可以这样做：

select
    t.*,
    sum(amount) over(
        partition by customer, product
        order by date
        range between interval '5 day' preceding and interval '1 day' preceding
    ) feature_cp
from mytable t

Demo on DB Fiddle：

日期 |客户 |产品 |金额 |特征_cp :--------- | :------- | :-------- | -----: | ---------: 2020-01-01 |加利福尼亚州 | P1 | 10 | 空 2020-01-02 |加利福尼亚州 | P1 | 5 | 10 2020-01-05 |加利福尼亚州 | P1 | 20 | 15 2020-01-07 |加利福尼亚州 | P1 | 20 | 25 2020-01-15 |加利福尼亚州 | P1 | 100 | 空 2020-01-17 |加利福尼亚州 | P1 | 200 | 100 2020-01-31 |加利福尼亚州 | P1 | 20 | 空

否则，我建议使用相关子查询。这比您的连接查询更有效，因为它避免了外部聚合的需要：

select
    t.*,
    (
        select sum(amount) 
        from mytable t1 
        where 
            t1.customer = t.customer 
            and t1.product = t.product
            and t1.date < t.date
            and t1.date >= t.date - interval '5 day'
    ) feature_cp
from mytable t

【讨论】：

是的。遗憾的是，Hive 不支持内联子查询，至少在我目前使用的版本中，但它看起来肯定比交叉连接然后过滤更有效。