【问题标题】:Group a sequence of rows by their first value in SQL按 SQL 中的第一个值对行序列进行分组
【发布时间】:2020-11-22 15:56:47
【问题描述】:

如何按 SQL 中每个序列的第一个值对序列数据集进行分组?

例如,我有以下数据集

id  name  key  metric
1   alice a    0   <- key = 'a', start of a sequence
2   alice b    1
3   alice b    1
-----------------
4   alice a    1   <- key = 'a', start of a sequence
5   alice b    0
6   alice b    0
7   alice b    0
-----------------
8   bob   a    1   <- key = 'a', start of a sequence
9   bob   b    1
-----------------
10  bob   a    0   <- key = 'a', start of a sequence

key = 'a' 的行开始一个新组。例如,我想对所有后续行的指标求和,直到达到另一个 key = 'a' 或另一个 name

数据集按id排序。

最终的结果应该是这样的:

id  name   metric
1   alice  2
4   alice  1
8   bob    2
10  bob    0

这是 JavaScript 中的等效操作,但我希望能够通过 SQL 查询获得相同的结果。

data.reduce((acc, a) => {
    if(a.key === 'a'){
      // key = 'a' starts a new group
      return [{id: a.id, name: a.name, metric: a.metric}].concat(acc)
    } else {
      // because the data is sorted, 
      // all the subsequent rows with key = 'b' belong to the latest group
      const [head, ...tail] = acc
      const head_updated = {...head, metric: head.metric + a.metric}
      return [head_updated, ...tail]
    }
  }, [])
  .reverse()

示例 SQL 数据集:

with dataset as (
  select 
    1       as id
  , 'alice' as name
  , 'a'     as key
  , 0       as metric
  union select
    2       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select
    3       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select 
    4       as id
  , 'alice' as name
  , 'a'     as key
  , 1       as metric
  union select
    5       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    6       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    7       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    8       as id
  , 'bob'   as name
  , 'a'     as key
  , 1       as metric
  union select
    9       as id
  , 'bob'   as name
  , 'b'     as key
  , 1       as metric
  union select
    10      as id
  , 'bob'   as name
  , 'a'     as key
  , 0       as metric
)

select * from dataset
order by name, id

【问题讨论】:

    标签: sql postgresql amazon-redshift window-functions


    【解决方案1】:

    您可以使用窗口函数sum()创建组然后聚合:

    select min(id) id, name, sum(metric) metric
    from (
      select *, sum((key = 'a')::int) over (partition by name order by id) grp 
      from dataset
    ) t
    group by name, grp
    order by id
    

    请参阅demo
    结果:

    > id | name  | metric
    > -: | :---- | -----:
    >  1 | alice |      2
    >  4 | alice |      1
    >  8 | bob   |      2
    > 10 | bob   |      0
    

    【讨论】:

      【解决方案2】:

      根据 OP 在 cmets 中写的内容,查询确实是这样的:

      SELECT MAX(t.head_id) AS id,
             t.head_name AS name,
             SUM(t.metric) AS metric
      FROM (
          SELECT SUM(CASE WHEN key = 'a' THEN 1 END) OVER (PARTITION BY name ORDER BY id) AS group_id,
                 CASE WHEN key = 'a' THEN id END AS head_id,
                 name AS head_name,
                 metric
          FROM dataset
      ) t
      GROUP BY t.head_name, t.group_id
      

      但是,如果可以按名称和id添加索引,则确实可以提高查询的性能。这是因为它在聚合之前不需要排序操作。

      使用一百万行的表进行测试,这是没有索引的解释分析的输出:

      HashAggregate  (cost=177154.34..177158.34 rows=400 width=25) (actual time=3374.878..3489.755 rows=400000 loops=1)
        Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
        ->  WindowAgg  (cost=132154.34..157154.34 rows=1000000 width=25) (actual time=1920.338..3000.218 rows=1000000 loops=1)
              ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=15) (actual time=1920.323..2232.936 rows=1000000 loops=1)
                    Sort Key: dataset.name, dataset.id
                    Sort Method: external merge  Disk: 28192kB
                    ->  Seq Scan on dataset  (cost=0.00..15406.00 rows=1000000 width=15) (actual time=0.020..172.746 rows=1000000 loops=1)
      
      Planning Time: 0.870 ms
      Execution Time: 3516.726 ms
      

      通过创建索引,查询计划变为如下:

      索引

      CREATE INDEX dataset__name_id__idx ON dataset(name, id);
      

      查询计划

      HashAggregate  (cost=90169.90..90173.90 rows=400 width=25) (actual time=1464.759..1567.778 rows=400000 loops=1)
        Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
        ->  WindowAgg  (cost=0.42..70169.90 rows=1000000 width=25) (actual time=0.033..1077.362 rows=1000000 loops=1)
              ->  Index Scan using dataset__name_id__idx on dataset  (cost=0.42..47669.90 rows=1000000 width=15) (actual time=0.022..225.445 rows=1000000 loops=1)
      
      Planning Time: 0.131 ms
      Execution Time: 1590.040 ms
      

      旧答案

      根据您的 javascript 代码,您不想在外部查询中按name 对窗口进行分区,也不想按name 分组。如果不这样做,您实际上会以一个更好的查询结束,该查询允许您仅使用主索引,假设 id 列已编入索引。

      SELECT t.head_id AS id,
             MAX(t.head_name) AS name,
             SUM(t.metric) AS metric
      FROM (
              SELECT MAX(CASE WHEN key = 'a' THEN id END) OVER (ORDER BY id) AS head_id,
                     CASE WHEN key = 'a' THEN name END AS head_name,
                     metric
              FROM dataset
          ) t
      GROUP BY t.head_id
      

      这是一个有 100 万行的 dataset 的查询计划:

      HashAggregate  (cost=68889.43..68891.43 rows=200 width=44) (actual time=1277.469..1393.709 rows=400000 loops=1)
        Group Key: max(CASE WHEN (dataset.key = 'a'::text) THEN dataset.id ELSE NULL::integer END) OVER (?)
        ->  WindowAgg  (cost=0.42..51389.43 rows=1000000 width=44) (actual time=0.025..927.595 rows=1000000 loops=1)
              ->  Index Scan using dataset_pkey on dataset  (cost=0.42..31389.42 rows=1000000 width=15) (actual time=0.017..209.657 rows=1000000 loops=1)
      
      Planning Time: 0.127 ms
      Execution Time: 1411.975 ms
      

      【讨论】:

      • 在 JS sn-p 中,我假设数据已经按名称排序。我对这个假设发表了评论。如果数据没有排序,那么我必须使用地图。我明白你的意思,如果我有唯一和排序的 ID,这是有道理的。不幸的是,在我的真实数据集中,情况并非如此。
      • @homam 使用您评论的信息编辑了答案
      猜你喜欢
      • 2013-07-01
      • 2017-03-21
      • 1970-01-01
      • 1970-01-01
      • 2022-07-27
      • 2017-06-11
      • 1970-01-01
      • 1970-01-01
      • 2018-01-10
      相关资源
      最近更新 更多