【问题标题】:Selecting Distinct Consecutive Values of Timeseries选择时间序列的不同连续值
【发布时间】:2021-10-20 11:05:24
【问题描述】:

我在雪花/dbt 中有一个表,我想在其中选择行中不同的顺序条目。例如: 如果我有

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:01
2 3 scroll 21-08-01 12:00:02
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:04
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

我想把这个放在最后

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

我尝试使用 row_number() 和 next 限定,但即使它们不是,也会按顺序计算所有操作。

【问题讨论】:

    标签: sql snowflake-cloud-data-platform distinct


    【解决方案1】:

    我尝试了一些与 ggordon 不同的方法,使用“下一个”记录的内容构建了一个内联视图(使用 LEAD 函数)。

    select user_id, session_id, action, ts
    from (
      select abc.*, 
             lead(user_id) ignore nulls 
               over (order by ts, user_id, session_id, action) next_user_id, 
             lead(session_id) ignore nulls 
               over (order by ts, user_id, session_id, action) next_session_id, 
             lead(action) ignore nulls 
               over (order by ts, user_id, session_id, action) next_action, 
             lead(ts) ignore nulls 
               over (order by ts, user_id, session_id, action) next_ts
      from   abc 
      order by ts, user_id, session_id, action)
    where user_id = NVL(next_user_id, user_id)
    and   session_id = NVL(next_session_id, session_id)
    and   action <> NVL(next_action, 'x')
    order by ts, user_id, session_id, action;
    

    这很好,我能够获得您想要的相同的四条记录。

    我希望这会有所帮助...丰富

    附言如果这个(或另一个)答案对您有帮助,请花点时间“接受”有帮助的答案 通过单击答案旁边的复选标记将其从“灰色”切换为“已填充”。

    【讨论】:

      【解决方案2】:

      您可以尝试以下将最近发生的操作分组并按照它们出现的顺序选择最近发生的操作。

      SELECT
          user_id,
          session_id,
          action,
          timestamp
      FROM (
          SELECT
              *,
              ROW_NUMBER() OVER (
                   PARTITION BY user_id,session_id,action,gn
                   ORDER BY timestamp DESC
              ) as rn
          FROM (
              SELECT
                  *,
                  SUM(continued) OVER (ORDER BY timestamp) as gn
              FROM (
                  SELECT
                      *,
                      CASE 
                          WHEN
                              LAG(
                                  CONCAT(user_id,session_id,action),
                                  1,
                                  CONCAT(user_id,session_id,action)
                              ) OVER (
                                  ORDER BY timestamp
                              ) = CONCAT(user_id,session_id,action) THEN 0
                          ELSE 1
                      END as continued
                  FROM
                      my_table
              ) t2
          ) t1
      ) t
      WHERE rn=1
      

      让我知道这是否适合你

      【讨论】:

        【解决方案3】:

        这称为间隙和孤岛问题。这通常通过通过两个并发行编号创建组键来解决。

        select
          user_id, session_id, action, max(timestamp)
        from
        (
          select
            user_id, session_id, action, timestamp,
            row_number() over (order by timestamp, user_id, session_id, action) -
            row_number() over (partition by user_id, session_id, action order by timestamp)
              as grp
          from mytable
        )
        group by grp, user_id, session_id, action
        order by grp, user_id, session_id, action;
        

        【讨论】:

          猜你喜欢
          • 2023-02-23
          • 1970-01-01
          • 1970-01-01
          • 2020-07-02
          • 1970-01-01
          • 2018-12-18
          • 1970-01-01
          • 2020-03-05
          • 1970-01-01
          相关资源
          最近更新 更多