根据时间戳和活动窗口（超时）创建会话 ID答案

【问题标题】：Creating a session id based on a timestamp and an activity window (timeout)根据时间戳和活动窗口（超时）创建会话 ID
【发布时间】：2022-12-18 21:53:30
【问题描述】：

我正在尝试基于活动窗口为 redash 中的数据集创建一个 session_id。本质上，我有一个命中数据集，我想将其划分为会话，其中任意时间长度的不活动（我使用 30 分钟，但可以是任何时间）将指示会话结束（下一次命中将成为新的开始）。

我不是数据专家（因为以下内容将毫无疑问地证明）-我尝试使用滞后和前导 + 案例语句来识别会话的开始和结束，但我希望能够通过标记行中间也是会话的一部分（我想了解哪些用户最常使用该网站，哪些用户在该网站上的“旅程”最长）。

示例数据集：

User_ID	Timestamp
A1	2022-08-10 21:29:00
A1	2022-08-10 21:39:00
A1	2022-08-10 21:59:00
A1	2022-08-10 23:19:00
A1	2022-08-10 23:25:00
B2	2022-08-09 12:01:00
B2	2022-08-10 15:02:00
B2	2022-08-10 15:03:00
B2	2022-08-10 15:42:00

我想得到什么：

User_ID	Timestamp	Visit_ID
A1	2022-08-10 21:29:00	1
A1	2022-08-10 21:39:00	1
A1	2022-08-10 21:59:00	1
A1	2022-08-10 23:19:00	2
A1	2022-08-10 23:25:00	2
B2	2022-08-09 12:01:00	1
B2	2022-08-10 15:02:00	2
B2	2022-08-10 15:03:00	2
B2	2022-08-10 15:42:00	3

到目前为止我得到了什么：确定每个会话的开始：

SELECT 
a.user_id,
a.timestamp, 
case when timestamp - coalesce(lag(timestamp,1) over (partition by a.user_id order by timestamp),0) <= 1800 then 0
     else timestamp  
end as session_start
from example_dataset a
)

确定每个会话的结束：

SELECT 
a.user_id,
a.timestamp, 
case when coalesce(lead(a.timestamp,1) over (partition by a.user_id order by a.timestamp),0) - a.timestamp <= 1800 then 0
     else a.timestamp  
end as session_end
from example_dataset a
)

我不知道的是如何从那里得到我想要的漂亮整洁的数据集。你能帮我吗？

提前谢谢了！

【问题讨论】：

标签： sql sqlite

【解决方案1】：

所以没有人回答，我假设这意味着我没有正确标记它或其他东西。为了帮助将来偶然发现这篇文章的一些可怜的迷路搜索者，我找到了一个解决这个问题的方法。

基本上我是：

如上所述构建开始和结束
使用 rank() over () 函数实质上为它们添加增量 visit_id
使用 user_id 和他们的等级将他们连接在一起
使用混乱的时间比较将其连接回命中数据集
做我的分析
来一杯酒
目前我的问题是我计划用于路径分析的 group_concat 函数似乎不起作用。

希望这会有所帮助，哦，未来的 internetanaut。

【讨论】：

【解决方案2】：

这是一种方法：

使用 lag() 窗口函数查找每个 user_id 的连续时间戳之间的差异。窗口分区中的第一行将导致空值，因此将它们默认为 -1。
将导致 -1 或大于 30 分钟的所有时间戳差异标记为 1，其余标记为 0。

使用 sum() 窗口函数在 user_id 上排序，在点 2 生成的列上使用时间。

with cte as (
  select 'A1' user_id, timestamp('2022-08-10 21:29:00') time union all
  select 'A1' user_id, timestamp('2022-08-10 21:39:00') time union all
  select 'A1' user_id, timestamp('2022-08-10 21:59:00') time union all
  select 'A1' user_id, timestamp('2022-08-10 23:19:00') time union all
  select 'A1' user_id, timestamp('2022-08-10 23:25:00') time union all
  select 'B2' user_id, timestamp('2022-08-09 12:01:00') time union all
  select 'B2' user_id, timestamp('2022-08-10 15:02:00') time union all
  select 'B2' user_id, timestamp('2022-08-10 15:03:00') time union all
  select 'B2' user_id, timestamp('2022-08-10 15:42:00') time
)
select
  *,
  sum(case when min_diff = -1 or min_diff > 30 then 1 else 0 end) over (order by user_id, time) as visit_id
from (
  select
    *,
    coalesce(timestamp_diff(time, lag(time) over (partition by user_id order by time), minute),-1) min_diff
  from cte
)

导致：

user_id	time	min_diff	visit_id
A1	2022-08-10 21:29:00	-1	1
A1	2022-08-10 21:39:00	10	1
A1	2022-08-10 21:59:00	20	1
A1	2022-08-10 23:19:00	80	2
A1	2022-08-10 23:25:00	6	2
B2	2022-08-09 12:01:00	-1	3
B2	2022-08-10 15:02:00	1621	4
B2	2022-08-10 15:03:00	1	4
B2	2022-08-10 15:42:00	39	5

【讨论】：