【问题标题】:SQL Left Join Within Timeframe Window时间范围窗口内的 SQL 左连接
【发布时间】:2021-09-11 02:49:58
【问题描述】:

我有两个数据集:

dataset_a
time_stamp                   user    group    value
2021-06-20 12:48:24.521         A    video        1
2021-06-15 12:50:24.521         A    video        1
2021-06-10 12:48:24.521         A    video        1    

dataset_b
time_stamp                   user    group    label
2021-06-20 09:40:24.521         A    video       BA
2021-06-19 13:30:24.521         A    video       BB  
2021-06-13 12:48:24.521         A    video       BC  
2021-06-09 12:55:24.521         A    video       BD   

我想创建一个数据集,如果数据集 b 按时间戳、用户和组在数据集 a 的时间戳的 1 天内,则它是匹配的。以前有没有人做过类似的事情,比如left join on dataset_b.timestamp between dataset_a.timestamp and date_add(dataset_a.timestamp,-1)。我希望具有灵活性,将来我也可以测试 -7 天,以便轻松修改。

预期输出如下:

 dataset_a
time_stamp                   user    group    value    timestamp_b               label
2021-06-20 12:48:24.521         A    video      0.5    2021-06-20 09:40:24.521      BA
2021-06-20 12:48:24.521         A    video      0.5    2021-06-19 13:30:24.521      BB
2021-06-15 12:50:24.521         A    video        1    NULL                       NULL   
2021-06-10 12:48:24.521         A    video        1    2021-06-09 12:55:24.521      BD    

【问题讨论】:

    标签: sql left-join snowflake-cloud-data-platform dateadd


    【解决方案1】:

    JOIN 条件不必是唯一的相等运算符,所以:

    SELECT *
    FROM dataset_a
    LEFT JOIN dataset_b
      ON dataset_b.user = dataset_a.user
     AND dataset_b.group = dataset_a.group
     AND dataset_b.time_stamp BETWEEN dataset_a.time_stamp - INTERVAL '1 day'
                                  AND dataset_a.time_stamp ;
    

    是一个有效的连接。

    db<>fiddle demo

    【讨论】:

    • 我总是发现提供可以复制/粘贴到雪花中的代码很有用。需要更长的时间,但意味着 SQL 已经过验证,用户可以更快地启动和运行 :-)
    • @AdrianWhite 这正是我提供现场演示的原因 :) PostgreSQL 语法几乎可以 1:1 复制
    • 太棒了......抱歉没有看到/理解。伟大的工作。
    • 非常感谢!当存在重复的形式时,是否有一种简单的方法可以均匀地分配值?
    • 通常我会通过计算匹配键的重复项作为每一侧的新列预加入来分配它,然后在加入后划分。但不知道当匹配键是上面这样的范围时该怎么做
    【解决方案2】:

    一些调整......就像刚才 Lukasz 一样,您可以复制/粘贴并在雪花中运行 :-)

     with dataset_a as (
    select '2021-06-20 12:48:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 
    'video' groups,1 value
    union all select '2021-06-15 12:50:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 'video' groups,1 value
    union all select '2021-06-10 12:48:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 'video' groups,1 value 
    ) , dataset_b as( 
    select '2021-06-19 13:30:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BB' label  
    union all select '2021-06-13 12:48:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BC' label  
    union all select '2021-06-09 12:55:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BD' label  
    union all select '2021-06-20 09:40:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BA' label) 
    SELECT *
    FROM dataset_a
    LEFT JOIN dataset_b
    ON dataset_b.user = dataset_a.user
    AND dataset_b.groups = dataset_a.groups
    AND dataset_b.time_stamp between dataset_a.time_stamp - INTERVAL '1 day' 
    and dataset_a.time_stamp ; 
    

    添加了 avg(value) 来清理 dups ...或者只是添加 avg windowed over your key avg(dataset_a.value) over (partition by dataset_a.time_stamp, dataset_a.user, dataset_a.groups , dataset_b.user)

     with dataset_a as (
     select '2021-06-20 12:48:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 
     'video' groups,1 value
     union all select '2021-06-15 12:50:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 'video' groups,1 value
     union all select '2021-06-10 12:48:24.521'::TIMESTAMP_LTZ time_stamp, 'A' user, 'video' groups,1 value 
     ) , dataset_b as( 
     select '2021-06-19 13:30:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BB' label  
     union all select '2021-06-13 12:48:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BC' label  
     union all select '2021-06-09 12:55:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BD' label  
     union all select '2021-06-20 09:40:24.521'::TIMESTAMP_LTZ time_stamp,'A' user,'video' groups,'BA' label) 
     SELECT dataset_a.time_stamp, dataset_a.user, dataset_a.groups, avg(dataset_a.value), dataset_b.time_stamp, dataset_b.user, dataset_b.groups,dataset_b.label
     FROM dataset_a
     LEFT JOIN dataset_b
     ON dataset_b.user = dataset_a.user
     AND dataset_b.groups = dataset_a.groups
     AND dataset_b.time_stamp between dataset_a.time_stamp - INTERVAL '1 day' 
     and dataset_a.time_stamp 
     group by 1,2,3,5,6,7,8
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-02-05
      • 1970-01-01
      • 1970-01-01
      • 2023-03-03
      • 1970-01-01
      • 1970-01-01
      • 2019-09-07
      • 2014-02-12
      相关资源
      最近更新 更多