聚合行以获得没有子集的唯一数组答案

【问题标题】：Aggregate rows to get unique arrays without subsets聚合行以获得没有子集的唯一数组
【发布时间】：2021-04-16 02:11:45
【问题描述】：

初始数据（实际表包含超过2,000,000行）：

+--------+--------+-------+
| note   | factor | label |
+--------+--------+-------+
| note_1 | 1      | 2     |
+--------+--------+-------+
| note_1 | 1      | 3     |
+--------+--------+-------+
| note_1 | 2      | 4     |
+--------+--------+-------+
| note_2 | 123    | 2     |
+--------+--------+-------+
| note_2 | 123    | 3     |
+--------+--------+-------+
| note_2 | 2      | 4     |
+--------+--------+-------+
| note_3 | 456    | 4     |
+--------+--------+-------+
| note_4 | 434    | 5     |
+--------+--------+-------+
| note_5 | 456    | 3     |
+--------+--------+-------+
| note_5 | 456    | 4     |
+--------+--------+-------+

我想得到什么（进一步决赛桌）：

+----+-----------------+
| id | notes           |
+----+-----------------+
| 1  | {note_1,note_2} |
+----+-----------------+
| 2  | {note_4}        |
+----+-----------------+
| 3  | {note_3,note_5} |
+----+-----------------+

更清楚：

我需要将 notes 按 factor 和 label 列分组。注释只能在结果表中出现一次。结果表应包含两列：id - 行号，notes - 注释数组。

我已经写了一个查询来分组factor和label：

select row_number() over (order by factor) as id
     , array_agg(note order by note) as notes
from test_brand
group by factor, label

它给出了这些结果：

+---+-----------------+
| 1 | {note_1}        |
+---+-----------------+
| 2 | {note_1}        |
+---+-----------------+
| 3 | {note_2}        |
+---+-----------------+
| 4 | {note_2}        |
+---+-----------------+
| 5 | {note_1,note_2} |
+---+-----------------+
| 6 | {note_4}        |
+---+-----------------+
| 7 | {note_5}        |
+---+-----------------+
| 8 | {note_3,note_5} |
+---+-----------------+

但我不知道如何从这里进入决赛桌。

如果我们省略标识符并返回普通数字，那么这个任务看起来就像集合的并集（实际上它是）。
假设我们有 8 个集合：{1}、{1}、{2}、{2}、{1,2}、{4}、{5}、{3,5}。我们需要得到三个集合：{1,2}、{4}、{3,5}。

在我看来它是如何发生的：
集合 {1}、{1}、{2}、{2}、{1,2} 合并为一个集合 {1,2}，因为 {1} 和 {2} 与 {1,2} 有交集.
集合 {3,5}、{5} 合并为一组 {3,5}，因为 {5} 和 {3,5} 之间存在交集。
集合 {4} 不与任何人相交，因此保持原样。

【问题讨论】：

标签： sql postgresql merge concatenation postgresql-performance

【解决方案1】：

可能有更有效的方法，但这样做：

WITH cte AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn  -- ORDER BY factor, label?!
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   )
SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   cte c
WHERE  NOT EXISTS (
   SELECT FROM cte c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

db小提琴here

在您的初始查询之后，删除 CTE 中的重复项并记住最小行号。

在最后的SELECT 中，删除该集合包含在另一个集合中的所有行（除了自身）。使用另一个 row_number() 实例压缩行号。
瞧。

优化性能

超过 2,000,000 行。

如果 note 可以是 integer 而不是字符串类型，计算将大大加快，在安装附加模块 intarray 后更是如此，它提供了更快的实现@> 整数数组的运算符。

如果来自 CTE 的派生表仍然很大，则可能需要创建一个临时表，添加一个索引（和 ANALYZE！），然后基于该临时表运行外部 SELECT：


CREATE TEMP TABLE tmp AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   );

CREATE INDEX ON tmp USING gin (notes gin__int_ops);
ANALYZE tmp;

SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   tmp c
WHERE  NOT EXISTS (
   SELECT FROM tmp c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

见：

【讨论】：

感谢您的回答！我不确定，但出了点问题。您可以在此处查看运行示例：fiddle 我只是添加了两个新行：note_7, 123, 2 和 note_8, 656, 8。正确的结果将是： {note_1,note_2,note_7} {note_4} {note_3,note_5} {note_8} 但运行脚本给出以下结果： {note_1,note_2} {note_2,note_7} {note_4} {note_3,note_5} {note_8}
@ErwinBrandstetter 。 . .不知何故，您以“可能有更有效的方法”开始回答在认知上是不和谐的。几乎按照定义，我希望您的回答是在 Postgres 中做某事的最有效方式。
@Moon 但是note_1 和note_7 不分享(factor, label)？
@Gordon：这次不太确定。 @> 是一个没有索引的昂贵操作。关于优化的附加部分对我来说更好。不过，我觉得可能有更多的潜力来加速它。（还有，谢谢。）
@ErwinBrandstetter，是的，他们不是。但是note_1 与note_2 共享(factor, label)，以及note_2 与(factor, label) 共享note_7。而且我们需要将它们组合在一起。换句话说：{1}、{1,2}、{2,7} 应该创建一组 {1,2,7}。