【问题标题】:For rows with the same A, B, C values, how to select only "earliest" row (specified by column D) within a given time range?对于具有相同 A、B、C 值的行,如何在给定时间范围内仅选择“最早”行(由 D 列指定)?
【发布时间】:2021-05-14 13:20:36
【问题描述】:

我目前有一个查询(简化如下):

SELECT name, node, points, <bunch of other fields>,
  DATE(snapshot_date) AS snap_date,
  HOUR(snapshot_date) AS snap_hour,
  CASE
    WHEN MINUTE(snapshot_date) IN (5, 6) THEN 0
    WHEN MINUTE(snapshot_date) IN (35, 36) THEN 30
  END AS snap_mins,
from some_table
where 
  <bunch of conditions here> 

基本上,我们从中获取数据的数据源会在其作业完成运行时写入其时间戳,而不是在数据实际进入时(我们对此无能为力)。他们的工作通常在数据进入后 5-6 分钟结束,因此我们通过自己设置 snap_mins 来推断数据的实际分钟值。 (由于我不会详细介绍的原因,我们只能使用半小时的值,并且需要每半小时为给定的名称/id/节点设置一行。)

但是,我知道这真的很不稳定,我想尝试,而不是寻找特定的分钟值,只需获取快照日期的分钟值介于 4-10 分钟之间的 first 行每半小时后(XX:00 或 XX:30)。例如,如果表格如下所示:

| name  | node | points | ... | snapshot_date | 
| Jane  | 1    | 1      | ... | 1/1/21 22:02  |
| Jane  | 1    | 2      | ... | 1/1/21 22:05  | // take this value
| Jane  | 1    | 3      | ... | 1/1/21 22:09  |
| Jane  | 1    | 4      | ... | 1/1/21 22:38  | // take this value
| Jane  | 1    | 5      | ... | 1/1/21 22:41  |

| Jane  | 1    | 1      | ... | 1/1/21 23:05  | // take this value
| Jane  | 1    | 2      | ... | 1/1/21 23:06  |
| Jane  | 1    | 3      | ... | 1/1/21 23:35  | // take this value
| Jane  | 1    | 4      | ... | 1/1/21 23:38  |

| Jane  | 3    | 1      | ... | 1/1/21 23:02  |
| Jane  | 3    | 2      | ... | 1/1/21 23:07  | // take this value (current query wouldn't pick this up because it's only looking for snapshots where the minute value is 5 or 6) 
| Jane  | 3    | 3      | ... | 1/1/21 23:10  |
| Jane  | 3    | 4      | ... | 1/1/21 23:35  | // take this value
| Jane  | 3    | 5      | ... | 1/1/21 23:38  |

所以我想得到的查询结果是:

| name  | node | points | ... | snap_date | snap_hour | snap_mins
| Jane  | 1    | 2      | ... | 1/1/21    | 22        | 0
| Jane  | 1    | 4      | ... | 1/1/21    | 22        | 30
| Jane  | 1    | 1      | ... | 1/1/21    | 23        | 0
| Jane  | 1    | 3      | ... | 1/1/21    | 23        | 30
| Jane  | 3    | 2      | ... | 1/1/21    | 23        | 0
| Jane  | 3    | 4      | ... | 1/1/21    | 23        | 30

但是,我一直在努力确定是否有可能做到这一点。任何帮助将不胜感激!

【问题讨论】:

    标签: sql presto trino


    【解决方案1】:

    一个选项

    1. 添加定义相关 30 分钟间隔的列
    2. 对 30 分钟组内的行进行排序
    3. 仅从每个组中提取序列 #1
    WITH x AS (
      SELECT name, node, points, snapshot_date, <bunch of other fields>,
      DATE(snapshot_date) AS snap_date,
      HOUR(snapshot_date) AS snap_hour,
      CASE
        WHEN MINUTE(snapshot_date) < 31 THEN 0
        ELSE 30
      END AS snap_mins
    FROM some_table
    where 
      <bunch of conditions here> 
    ), 
    y as (
      SELECT x.*,
      ROW_NUMBER() OVER (PARTITION BY name, node, snap_date, snap_hour, snap_minute ORDER BY snapshot_date) AS seq
      FROM x
    )
    SELECT * FROM y WHERE seq = 1
    

    如果您确实需要忽略 MINUTE 0-3 和 30-33 中的行,请在第一个 CTE 中添加 WHERE 条件。

    【讨论】:

      猜你喜欢
      • 2018-06-30
      • 2011-03-26
      • 1970-01-01
      • 2021-08-08
      • 2014-09-25
      • 2016-03-16
      • 2011-02-21
      • 2023-01-28
      • 2012-01-27
      相关资源
      最近更新 更多