【问题标题】:Finding the most visited place at a particular time in SQL在 SQL 中查找特定时间访问量最大的地方
【发布时间】:2022-01-11 04:11:36
【问题描述】:

我有一个用户表,其中包含有关 user_id、用户购买门票的地点以及用户购买门票的时间的信息。

用户:

|------------|-------------|----------------------|
|  user_id   |  place      | purchase_time        |
|------------|-------------|----------------------|
|     1      |  New York   | 2021-11-27:17:00:21  |
|     1      |  Chicago    | 2021-11-25:19:00:21  |
|     1      |  Chicago    | 2021-11-23:03:00:21  |
|     1      |  Washington | 2021-11-21:07:00:21  |
|     1      |  Washington | 2021-11-19:12:00:21  |
|     1      |  Washington | 2021-11-17:00:00:21  |
|     1      |  Washington | 2021-11-15:23:00:21  |
|     1      |  Washington | 2021-11-12:21:00:21  |
|     2      |  Chicago    | 2021-09-25:01:00:21  |
|     2      |  Milwaukee  | 2021-09-24:02:00:21  |
|     2      |  Milwaukee  | 2021-09-23:03:00:21  |
|     2      |  New York   | 2021-09-22:19:00:21  |
|     2      |  Chicago    | 2021-09-21:01:00:21  |
|     3      |  Milwaukee  | 2021-10-27:12:31:21  |
|     3      |  Washington | 2021-10-24:07:01:23  |
|     3      |  Chicago    | 2021-10-21:01:78:89  |
|------------|-------------|----------------------|

我想添加一个新列,显示用户在购票时访问最多的地方。表想(雪花):

|------------|-------------|----------------------|---------------------|
|  user_id   |  place      | purchase_time        | most_visited_place  |
|------------|-------------|----------------------|---------------------|
|     1      |  New York   | 2021-11-27:17:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Chicago    | 2021-11-25:19:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Chicago    | 2021-11-23:03:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Washington | 2021-11-21:07:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Washington | 2021-11-19:12:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Washington | 2021-11-17:00:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Washington | 2021-11-15:23:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     1      |  Washington | 2021-11-12:21:00:21  |    Washington       | <--- Washington, because at purchase_time This place was most visited by the user
|     2      |  Chicago    | 2021-09-21:01:00:25  |    Chicago          | <-- tie, break. Both Chicago and Milwaukee were most visited then take the recent most visited
|     2      |  Milwaukee  | 2021-09-21:02:00:24  |    Milwaukee        | <--- Milwaukee, because at purchase_time This place was most visited by the user
|     2      |  Milwaukee  | 2021-09-21:03:00:23  |    Milwaukee        | <--- Milwaukee, because at purchase_time This place was most visited by the user
|     2      |  New York   | 2021-09-21:19:00:22  |    New York         | <-- tie, break. Both Chicago and New York were most visited then take the recent most visited
|     2      |  Chicago    | 2021-09-21:01:00:21  |    Chicago          | <--- Chicago, because at purchase_time This place was most visited by the user
|     3      |  Milwaukee  | 2021-10-27:12:31:21  |    Milwaukee        |
|     3      |  Washington | 2021-10-24:07:01:23  |    Washington       |
|     3      |  Chicago    | 2021-10-21:01:78:89  |    Chicago          |
|------------|-------------|----------------------|---------------------|

【问题讨论】:

  • 您似乎在问题中添加了 2 个相同的表格。请更新您的问题以显示您想要达到的结果
  • @NickW 到第一个表(Users),我想根据用户访问最多的地方添加一个新字段most_visited_place
  • 你可以尝试使用mode窗口函数,它也可以处理varchar。但只要我无法访问 Snowflake 平台,我就没有测试它。

标签: sql snowflake-cloud-data-platform window-functions


【解决方案1】:

您想使用WINDOW 版本的 COUNT 来获取“先前行数”,然后加入所有先前计数的行,并通过 QUALIFY 过滤掉“最佳”

WITH prior_user AS (
    SELECT 
        user_id,
        place,
        purchase_time,
        COUNT(place) OVER (PARTITION BY user_id, place ORDER BY purchase_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS place_count
    FROM users
)
SELECT 
    u.user_id,
    u.place,
    u.purchase_time,
    p.place AS most_visited_place
FROM users u
JOIN prior_user p
    ON u.user_id = p.user_id AND u.purchase_time >= p.purchase_time
QUALIFY row_number() OVER (partition by u.user_id, u.purchase_time ORDER BY place_count DESC, p.purchase_time DESC) = 1

*此sql尚未运行。

【讨论】:

  • 第6行o的命令是什么?
  • 哈哈,粘贴错别字,会修正
  • COUNT(PLACE) 也可能只是 ROW_NUMBER(),因为 place 在 PARTITON BY 中。
  • 感谢您为这个业务逻辑答案提供的出色方法。是否可以在上表中添加另一个字段previous_most_visited_place
  • 不如在“之前”最好?您可以使用1,2,但是您需要添加另一层以从第一个延迟到第二个,然后过滤掉第二个。
【解决方案2】:

您可以通过 lateral 加入 Snowflake。 distinct 的使用有点丑陋,但我认为你可以用它代替 qualify,甚至可能得到一个更好的计划。从执行的角度来看,我很想知道这是否等同于其他答案。

select *
from Users u, lateral (
    select distinct first_value(place) over ()
        order by count(*) desc, max(u2.purchase_time) desc) as most_visited_place
    from Users u2
    where u2.user_id = u.user_id and u2.purchase_time <= u.purchase_time
    group by place
    --qualify row_number() over (order by u2.user_id) = 1 
) as mr
order by user_id, purchase_time desc

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=02784df13affab8027f7b052ad942d70

【讨论】:

  • 当我在雪花中尝试上述查询时,收到错误syntax error line 2 at position 34 unexpected 'lateral'
  • @R0bert 看起来 Snowflake 想要在 lateral 之前添加一个逗号。
猜你喜欢
  • 2015-02-24
  • 2022-01-20
  • 1970-01-01
  • 2017-05-08
  • 2017-01-20
  • 2017-10-15
  • 1970-01-01
  • 2015-05-01
  • 2022-01-10
相关资源
最近更新 更多