【问题标题】:How to obtain the most recent row per type and perform calculations, depending on the row type?如何根据行类型获取每种类型的最新行并执行计算?
【发布时间】:2017-01-16 14:31:01
【问题描述】:

我需要一些帮助来编写/优化查询以按类型检索每行的最新版本并根据类型执行一些计算。我认为最好用一个例子来说明。

给定以下数据集:

+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id    | event_type        | event_timestamp     | message_id  | sent_at             | status | rate     |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1     | create            | 2016-11-25 09:17:48 | 1           | 2016-11-25 09:17:48 | 0      | 0.500000 |
| 2     | status_update     | 2016-11-25 09:24:38 | 1           | 2016-11-25 09:28:49 | 1      | 0.500000 |
| 3     | create            | 2016-11-25 09:47:48 | 2           | 2016-11-25 09:47:48 | 0      | 0.500000 |
| 4     | status_update     | 2016-11-25 09:54:38 | 2           | 2016-11-25 09:48:49 | 1      | 0.500000 |
| 5     | rate_update       | 2016-11-25 09:55:07 | 2           | 2016-11-25 09:50:07 | 0      | 1.000000 |
| 6     | create            | 2016-11-26 09:17:48 | 3           | 2016-11-26 09:17:48 | 0      | 0.500000 |
| 7     | create            | 2016-11-27 09:17:48 | 4           | 2016-11-27 09:17:48 | 0      | 0.500000 |
| 8     | rate_update       | 2016-11-27 09:55:07 | 4           | 2016-11-27 09:50:07 | 0      | 2.000000 |
| 9     | rate_update       | 2016-11-27 09:55:07 | 2           | 2016-11-25 09:55:07 | 0      | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+

预期的结果应该是:

+------------+--------------------+--------------------+-----------------------+
| sent_at    | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total)       |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 |                  2 |                  2 |              2.500000 |
| 2016-11-26 |                  1 |                  0 |              0.500000 |
| 2016-11-27 |                  1 |                  0 |              2.000000 |
+------------+--------------------+--------------------+-----------------------+

文章末尾是用于获取此结果的查询。我愿意打赌应该有一种优化它的方法,因为它使用带有连接的子查询,并且从我所读到的关于 BigQuery 的内容中,最好避免连接。但首先是一些背景:

本质上,数据集表示一个仅追加的表,其中写入了多个事件。数据规模以亿计,并将增长到数十亿+。由于 BigQuery 中的更新不实用,并且数据正在流式传输到 BQ,因此我需要一种方法来检索每个事件的最新事件,根据特定条件执行一些计算并返回准确的结果。查询是根据用户输入动态生成的,因此可以包含更多字段/计算,但为简单起见省略了。

  • 只有一个 create 事件,但 n 是任何其他类型的事件
  • 对于每组事件,在计算时只应考虑最新的事件。
    • status_update - 更新状态
    • rate_update - 更新速率
    • 创建 - 不言自明
  • 每个不是create的事件都可能不携带原始的其余信息/可能不准确(除了message_id和事件操作的字段)(数据集被简化,但想象一下还有很多列,以后会添加更多事件)
    • 例如rate_update 可能有也可能没有设置状态字段,或者不是最终值,因此无法对来自rate_update 事件的状态字段进行计算,status_update 也是如此
  • 可以假设表是按日期分区的,每个查询都会使用这些分区。为了简单起见,暂时省略了这些条件。

所以我想我有几个问题:

  • 如何优化此查询?
  • 将除create 之外的事件放在他们自己的表中是否更好,其中唯一可用的字段将是与事件相关的字段,以及连接所需的字段(message_id、event_timestamp)?这会减少处理的数据量吗?
  • 未来添加更多事件的最佳方式是什么,它们有自己的条件和计算方式?

实际上,任何有关如何高效且友好地查询此数据集的建议都非常受欢迎!谢谢! :)

我想出的怪物如下。 INNER JOINS 用于检索每一行的最新版本,根据这个resource

    select
    sent_at as sent_at,
    sum(submitted_msg) as submitted,
    sum(delivered_msg) as delivered,
    sum(sales_rate_total) as sales_rate_total
    FROM (

      #DELIVERED
        SELECT 
            d.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            sum(if(status=1,1,0)) as delivered_msg,
            0 as sales_rate_total
        FROM `events` d
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "status_update" 
                    group by 1
                    ) g on d.message_id = g.message_id and d.event_timestamp = g.ts
        GROUP BY 1,2

        UNION ALL

      #SALES RATE
        SELECT 
            s.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            0 as delivered_msg,
            sum(sales_rate) as sales_rate_total
        FROM `events` s
        INNER JOIN 
                    (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type in ("rate_update", "create")  
                    group by 1
                    ) f on s.message_id = f.message_id and s.event_timestamp = f.ts
        GROUP BY 1,2

        UNION ALL

      #SUBMITTED & REST
        SELECT 
            r.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            sum(if(status=0,1,0)) as submitted_msg,
            0 as delivered_msg,
            0 as sales_rate_total
        FROM `events` r
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "create" 
                    group by 1
                    ) e on r.message_id = e.message_id and r.event_timestamp = e.ts
        GROUP BY 1, 2

    ) k
    group by 1

【问题讨论】:

标签: sql performance google-bigquery query-optimization query-performance


【解决方案1】:

如何优化此查询?

试试下面的版本

#standardSQL
WITH types AS (
  SELECT 
    FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
    message_id,
    FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
    FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
    FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
  FROM events
), latest AS (
  SELECT 
    sent_at,
    message_id,
    ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,  
    ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,  
    ANY_VALUE(sales_rate) AS sales_rate
  FROM types
  GROUP BY 1, 2
)
SELECT   
  sent_at,
  SUM(submitted) AS submitted,  
  SUM(delivered) AS delivered,  
  SUM(sales_rate) AS sales_rate_total        
FROM latest
GROUP BY 1

它足够紧凑,易于管理,没有冗余,根本没有连接等。
如果您的表已分区 - 您可以通过在一个地方调整查询来轻松使用它

如果想先在低音量上检查上述查询,您可以使用以下虚拟数据

WITH events AS (
  SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
  SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
  SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
  SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
  SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate 
)

【讨论】:

  • 干杯!这真的很有效,而且改进很棒:)
【解决方案2】:

对于每个包含多个事件以及我们需要选择最新事件的表格,我们都有一个适当的视图。

查看: user_profile_latest

SELECT * from (
  select rank() over (partition by user_id order by bq.created DESC, bq.insert_id  desc) as _rank,
*
FROM [user_profile_event]
) where _rank=1

我们维护一个带有 created 和 insert_id 的记录 BQ 用于重复数据删除。

【讨论】:

  • 我不确定我是否得到这个......在我看来,它会简化查询,就编写的行而言,但不是性能/处理的数据?如果我错了,请纠正我……但在这种情况下,我们不能利用日期分区。假设table's partitions = sent_at。如果我想查询特定的日期范围,我可以在每个查询中添加_PARTITIONTIME,这样会大大减少处理的数据量吗?
  • 据我所知,您需要为您使用的列付费,即使在这里看起来您已阅读所有列但后来您过滤掉了,它也有效果,您也可以根据需要添加 where 条件
  • 是的,您需要为选定的列付费,但也要为处理的数据总量付费。如果您要查询它的一个子集,则可以减少它,它位于 partition
猜你喜欢
  • 2020-10-13
  • 2023-03-07
  • 1970-01-01
  • 1970-01-01
  • 2023-01-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多