【问题标题】：Group a sequence of rows by their first value in SQL按 SQL 中的第一个值对行序列进行分组
【发布时间】：2020-11-22 15:56:47
【问题描述】：

如何按 SQL 中每个序列的第一个值对序列数据集进行分组？

例如，我有以下数据集

id  name  key  metric
1   alice a    0   <- key = 'a', start of a sequence
2   alice b    1
3   alice b    1
-----------------
4   alice a    1   <- key = 'a', start of a sequence
5   alice b    0
6   alice b    0
7   alice b    0
-----------------
8   bob   a    1   <- key = 'a', start of a sequence
9   bob   b    1
-----------------
10  bob   a    0   <- key = 'a', start of a sequence

key = 'a' 的行开始一个新组。例如，我想对所有后续行的指标求和，直到达到另一个 key = 'a' 或另一个 name。

数据集按id排序。

最终的结果应该是这样的：

id  name   metric
1   alice  2
4   alice  1
8   bob    2
10  bob    0

这是 JavaScript 中的等效操作，但我希望能够通过 SQL 查询获得相同的结果。

data.reduce((acc, a) => {
    if(a.key === 'a'){
      // key = 'a' starts a new group
      return [{id: a.id, name: a.name, metric: a.metric}].concat(acc)
    } else {
      // because the data is sorted, 
      // all the subsequent rows with key = 'b' belong to the latest group
      const [head, ...tail] = acc
      const head_updated = {...head, metric: head.metric + a.metric}
      return [head_updated, ...tail]
    }
  }, [])
  .reverse()

示例 SQL 数据集：

with dataset as (
  select 
    1       as id
  , 'alice' as name
  , 'a'     as key
  , 0       as metric
  union select
    2       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select
    3       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select 
    4       as id
  , 'alice' as name
  , 'a'     as key
  , 1       as metric
  union select
    5       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    6       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    7       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    8       as id
  , 'bob'   as name
  , 'a'     as key
  , 1       as metric
  union select
    9       as id
  , 'bob'   as name
  , 'b'     as key
  , 1       as metric
  union select
    10      as id
  , 'bob'   as name
  , 'a'     as key
  , 0       as metric
)

select * from dataset
order by name, id

【问题讨论】：

标签： sql postgresql amazon-redshift window-functions

【解决方案1】：

您可以使用窗口函数sum()创建组然后聚合：

select min(id) id, name, sum(metric) metric
from (
  select *, sum((key = 'a')::int) over (partition by name order by id) grp 
  from dataset
) t
group by name, grp
order by id

请参阅demo。
结果：

> id | name  | metric
> -: | :---- | -----:
>  1 | alice |      2
>  4 | alice |      1
>  8 | bob   |      2
> 10 | bob   |      0

【讨论】：

【解决方案2】：

根据 OP 在 cmets 中写的内容，查询确实是这样的：

SELECT MAX(t.head_id) AS id,
       t.head_name AS name,
       SUM(t.metric) AS metric
FROM (
    SELECT SUM(CASE WHEN key = 'a' THEN 1 END) OVER (PARTITION BY name ORDER BY id) AS group_id,
           CASE WHEN key = 'a' THEN id END AS head_id,
           name AS head_name,
           metric
    FROM dataset
) t
GROUP BY t.head_name, t.group_id

但是，如果可以按名称和id添加索引，则确实可以提高查询的性能。这是因为它在聚合之前不需要排序操作。

使用一百万行的表进行测试，这是没有索引的解释分析的输出：

HashAggregate  (cost=177154.34..177158.34 rows=400 width=25) (actual time=3374.878..3489.755 rows=400000 loops=1)
  Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=132154.34..157154.34 rows=1000000 width=25) (actual time=1920.338..3000.218 rows=1000000 loops=1)
        ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=15) (actual time=1920.323..2232.936 rows=1000000 loops=1)
              Sort Key: dataset.name, dataset.id
              Sort Method: external merge  Disk: 28192kB
              ->  Seq Scan on dataset  (cost=0.00..15406.00 rows=1000000 width=15) (actual time=0.020..172.746 rows=1000000 loops=1)

Planning Time: 0.870 ms
Execution Time: 3516.726 ms

通过创建索引，查询计划变为如下：

索引：

CREATE INDEX dataset__name_id__idx ON dataset(name, id);

查询计划：

HashAggregate  (cost=90169.90..90173.90 rows=400 width=25) (actual time=1464.759..1567.778 rows=400000 loops=1)
  Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=0.42..70169.90 rows=1000000 width=25) (actual time=0.033..1077.362 rows=1000000 loops=1)
        ->  Index Scan using dataset__name_id__idx on dataset  (cost=0.42..47669.90 rows=1000000 width=15) (actual time=0.022..225.445 rows=1000000 loops=1)

Planning Time: 0.131 ms
Execution Time: 1590.040 ms

旧答案

根据您的 javascript 代码，您不想在外部查询中按name 对窗口进行分区，也不想按name 分组。如果不这样做，您实际上会以一个更好的查询结束，该查询允许您仅使用主索引，假设 id 列已编入索引。

SELECT t.head_id AS id,
       MAX(t.head_name) AS name,
       SUM(t.metric) AS metric
FROM (
        SELECT MAX(CASE WHEN key = 'a' THEN id END) OVER (ORDER BY id) AS head_id,
               CASE WHEN key = 'a' THEN name END AS head_name,
               metric
        FROM dataset
    ) t
GROUP BY t.head_id

这是一个有 100 万行的 dataset 的查询计划：

HashAggregate  (cost=68889.43..68891.43 rows=200 width=44) (actual time=1277.469..1393.709 rows=400000 loops=1)
  Group Key: max(CASE WHEN (dataset.key = 'a'::text) THEN dataset.id ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=0.42..51389.43 rows=1000000 width=44) (actual time=0.025..927.595 rows=1000000 loops=1)
        ->  Index Scan using dataset_pkey on dataset  (cost=0.42..31389.42 rows=1000000 width=15) (actual time=0.017..209.657 rows=1000000 loops=1)

Planning Time: 0.127 ms
Execution Time: 1411.975 ms

【讨论】：

在 JS sn-p 中，我假设数据已经按名称排序。我对这个假设发表了评论。如果数据没有排序，那么我必须使用地图。我明白你的意思，如果我有唯一和排序的 ID，这是有道理的。不幸的是，在我的真实数据集中，情况并非如此。
@homam 使用您评论的信息编辑了答案