【问题标题】:How to combine data from different variables together in SQL?如何在 SQL 中将来自不同变量的数据组合在一起?
【发布时间】:2021-05-21 23:32:31
【问题描述】:

假设我有这样的数据:

USER_ID                 TIMESTAMP   data  data2
   0001   2021-05-09 12:13:03.445            44
   0001   2021-05-09 13:13:03.445    rob    
   0001   2021-05-09 11:13:03.445       
   0002   2021-05-09 09:13:03.445  perry    333
   0002   2021-05-09 12:13:03.445   carl    333
   0003   2021-05-09 16:13:03.445  mitch      1
   0003   2021-05-09 17:13:03.445
   0002   2021-05-09 16:13:03.445  mitch      5

我要做的就是从每一列中收集最新的非空值,并将它们压缩到一个表中,每一行都是一个条目。

最终结果:

USER_ID   data  data2
   0001    rob     44 
   0003  mitch      1
   0002  mitch      5

这是我所拥有的,但不完整:

WITH form AS (
    select b.*,
        rank() over (
            partition by user_id
            order by timestamp DESC
        ) as num
    FROM b
SELECT *
FROM b
WHERE num = 1

【问题讨论】:

    标签: sql snowflake-cloud-data-platform partition


    【解决方案1】:

    相关:Equivalent for Keep in Snowflake

    可以通过以下方式实现:

    WITH cte(user_id, timestamp, "data", data2) AS (
      SELECT *
      FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
       ('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
       ('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
       ('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
       ('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
       ('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
       ('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
       ('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
      ) 
    )
    SELECT user_id,
      (ARRAY_AGG("data") WITHIN GROUP (ORDER BY timestamp DESC))[0]::STRING AS "data",
      (ARRAY_AGG(data2)  WITHIN GROUP (ORDER BY timestamp DESC))[0] AS data2
    FROM cte
    GROUP BY user_id
    ORDER BY user_id;
    

    输出:

    +---------+----------+-------+
    | USER_ID |   data   | data2 |
    +---------+----------+-------+
    |    0001 | rob      |    44 |
    |    0002 | mitch    |     5 |
    |    0003 | mitch    |     1 |
    +---------+----------+-------+
    

    ARRAY_AGG 默认省略 NULL,并按时间戳降序排序。创建每个user_id 的数组后,只需访问第一个元素(索引为 [0] 的元素)。

    【讨论】:

      【解决方案2】:

      嗯。 . .这就是ignore nulls 真正有用的地方——但 Postgres 不支持(还没有??)。

      相反,您可以使用数组先对非 NULL 值排序,然后再按时间戳排序:

      select user_id,
             (array_agg(data order by (data is not null) desc, timestamp desc))[1],
             (array_agg(data2 order by (data2 is not null) desc, timestamp desc))[1]
      from t
      group by user_id;
      

      Here 是一个 dbfiddle。

      【讨论】:

      • 有用还是没用
      • @JohnThomas 。 . .我知道什么意思。现在,我也改写了它以传达这一点。
      【解决方案3】:

      您可以使用 LAST_VALUEFIRST_VALUE 函数来使用 IGNORE NULL。对于您的数据集:

      WITH x AS (
      SELECT *
      FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
         ('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
         ('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
         ('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
         ('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
         ('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
         ('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
         ('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
        ) x (id, ts, data, data2)
      )
      

      你会这样做:

      SELECT id,
             LAST_VALUE(data) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data_last,
             LAST_VALUE(data2) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data2_last
      FROM x
      QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts) = 1;
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-12-10
        相关资源
        最近更新 更多