【问题标题】:Finding shortest geo-spatial distance from one point to all other points in SQL在 SQL 中查找从一个点到所有其他点的最短地理空间距离
【发布时间】:2022-01-10 18:29:57
【问题描述】:

从A镇、B镇、C镇或网上购买电影票的用户有两种。

我有以下表格:

位置:此表包含电影中心的位置

|--------------|------------|------------|
|    Towns     |  latitude  | longitude  |
|--------------|------------|------------|
|  Town_A      |  72.92629  | -12.89272  |
|  Town_B      |  93.62789  | -83.10172  |
|  Town_C      |  68.92612  | -67.17242  |
|--------------|------------|------------|

用户:此表包含用户购买的历史,即在线或城镇。还包括用户购买时的纬度/经度。

|------------|------------|------------|--------------|
|   user_id  |  latitude  | longitude  |    Towns     |
|------------|------------|------------|--------------|
|    1       |  21.89027  | -53.03772  |   Town_A     |
|    1       |  23.87847  | -41.78172  |   Town_C     |
|    1       |  39.62847  | -80.19892  |   online     |
|    1       |  77.87092  | -96.39242  |   Town_A     |
|    2       |  71.87782  | -38.03782  |   online     |
|    2       |  83.37847  | -62.78278  |   Town_B     |
|    3       |  89.81924  | -80.73892  |   Town_B     |
|    3       |  27.87282  | -18.39183  |   Town_A     |
|------------|------------|------------|--------------|

我想在用户购买时根据他的经纬度找到最近的城镇。决赛桌如下所示:

|------------|------------|------------|--------------|-----------------|
|   user_id  |  latitude  | longitude  |    Towns     | nearest_town    |
|------------|------------|------------|--------------|-----------------|
|    1       |  21.89027  | -53.03772  |   Town_A     |   Town_B        | <--- Town_B is near based on his lat/long (Irrespective of his purchase town)
|    1       |  23.87847  | -41.78172  |   Town_C     |   Town_A        | <--- Town_A is near based on his lat/long
|    1       |  39.62847  | -80.19892  |   online     |   Town_Online   |
|    1       |  77.87092  | -96.39242  |   Town_A     |   Town_A        |
|    2       |  71.87782  | -38.03782  |   online     |   Town_Online   |
|    2       |  83.37847  | -62.78278  |   Town_B     |   Town_C        |
|    3       |  89.81924  | -80.73892  |   Town_B     |   Town_A        |
|    3       |  27.87282  | -18.39183  |   Town_A     |   Town_A        |
|------------|------------|------------|--------------|-----------------|

SQL 查询 (Snowflake) 我的尝试:

With specific_location as
(
  select user_id,
         latitude,
     longitude,
     case when Towns in ('Town_A','Town_B','Town_C') then 'Town' else 'Town_Online' end as purchase_in
  from Locations
)
 select *, 
       case when purchase_in = 'Town' then
            (select Towns from Location qualify row_number() over (order by haversine(user.latitude,user.longitude,location.latitude,location.longitude))=1)
            else purchase_in
       end as nearest_town
 from specific_location

我收到一个错误:syntax error unexpected 'when' and unexpected 'else'

【问题讨论】:

    标签: sql snowflake-cloud-data-platform window-functions


    【解决方案1】:

    您的 CTE specific_location 缺少到 USERS 的 JOIN,因为位置本身没有 user_id 列。

    我还会创建一个丰富的用户,添加一个序列,以便稍后位置匹配可以明显地在每个用户行中进行,然后在第二个 CTE 中执行用户/位置连接,从而在最后进行选择具有预先计算的值:

    我还把你的两个价值 CASE 陈述换成了 IFF 的

    WITH enriched_user AS (
        SLECT 
            u.user_id,
            u.latitude,
            u.longitude,
            u.town,
            seq4() as seq,
            IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in
        FROM user AS u
    ), user_and_closest_location AS (
        SELECT 
            u.user_id,
            u.latitude,
            u.longitude,
            u.town,
            u.purchase_in
            l.town as closest_town
            haversine(u.latitude, u.longitude, l.latitude, l.longitude)
        FROM enriched_user AS u,
            location AS l
        QUALIFY row_number() OVER (PARTION BY u.seq ORDER BY haversine(u.latitude, u.longitude, l.latitude, l.longitude)) = 1
    )
    SELECT      
        u.user_id,
        u.latitude,
        u.longitude,
        u.town,
        IFF(u.purchase_in = 'Town', u.closest_town, u.purchase_in) AS nearest_town
    FROM user_and_closest_location AS u
    ORDER BY 1,2,3; 
    

    计算所有行的基于距离的连接的逻辑是它会更快,如果有你不想做的事情,最好修剪那里的输入,但是你需要重新加入输入以捕获跳过的值。

    WITH enriched_user AS (
        SLECT 
            u.user_id,
            u.latitude,
            u.longitude,
            u.town,
            seq4() as seq,
            IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in
        FROM user AS u
    ), user_and_closest_location AS (
        SELECT 
            u.user_id,
            u.latitude,
            u.longitude,
            u.town,
            u.purchase_in
            l.town as closest_town
            haversine(u.latitude, u.longitude, l.latitude, l.longitude)
        FROM enriched_user AS u,
            location AS l
        WHERE u.purchase_in = 'Town'
        QUALIFY row_number() OVER (PARTION BY u.seq ORDER BY haversine(u.latitude, u.longitude, l.latitude, l.longitude)) = 1
    )
    SELECT      
        u.user_id
        u.latitude,
        u.longitude,
        u.town,
        IFF(u.purchase_in = 'Town', ucl.closest_town, u.purchase_in) AS nearest_town
    FROM enriched_user user_and_closest_location AS u
    LEFT JOIN user_and_closest_location AS ucl 
        ON u.seq = ucl.seq
    ORDER BY 1,2,3;
    

    in towns 也可以翻转为不“在线”

    IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in
    

    成为:

    IFF(towns != 'online', 'Town', 'Town_Online')
    

    此时可以将实际测试移到以后使用的地方。

    【讨论】:

    • 谢谢西蒙。我很好奇您为什么以及如何想到使用seq4()
    • @R0bert 我在想你想如何到达每个用户的“最近”位置,但注意到每个用户看起来像是多次购买,但没有“预购 id,因此需要一些东西来 GROUP row_number 中的 BY/PARTITION BY。鉴于您有一个锻炼购买的预处理步骤,我只是在那里戳了 seq4()。但是如果它的交易有一个 ID,我会使用它(假设您的数据不像给出的示例那么简单)
    猜你喜欢
    • 2022-01-10
    • 1970-01-01
    • 2018-03-01
    • 2022-10-13
    • 1970-01-01
    • 2020-09-24
    • 2023-02-07
    • 2016-05-08
    • 1970-01-01
    相关资源
    最近更新 更多