PostgreSQL 排除跨越其他表值的记录答案

【问题标题】：PostgreSQL exclude records crossing other table valuesPostgreSQL 排除跨越其他表值的记录
【发布时间】：2014-04-23 23:31:38
【问题描述】：

考虑两个 PostgreSQL 表：

表 #1

id INT
secret_id INT
title VARCHAR

表#2

id INT
secret_id INT

我需要从表#1 中选择所有记录，但排除表#2 跨越secret_id 值。

以下查询非常慢，表 #1 中有 1 000 000 条记录，Table #2 中有 500 000 条记录：

select * from table_1 where secret_id not in (select secret_id from table_2);

实现这一目标的最佳方法是什么？

【问题讨论】：

问题是，究竟是什么，我猜如何让它更快？你看过查询计划（EXPLAIN/EXPLAIN ANALYZE）吗？ “非常慢”有多慢？你有索引吗？这些 ID 的分布是什么 - 如果每一行都有一个唯一的 secret_id，那么您的示例将返回 500000 行，这无论如何都会很慢？
见stackoverflow.com/questions/7125291/…
您需要提高work_mem 设置的可能性很大，但您可能会发现设置SELECT * FROM table_1 EXCEPT SELECT t1.* FROM table_1 t1 JOIN table_2 t2 ON t1.secret_id = t2.secret_id 会更快
你也可以试试不直观的SELECT t1.* FROM table_1 t1 LEFT JOIN table_2 t2 ON t1.secret_id = t2.secret_id WHERE t2.secret_id IS NULL。
改用NOT EXISTS，或者像 Daniel Lyons 建议的那样使用左反连接。并显示EXPLAIN ANALYZE 输出。

标签： sql postgresql query-optimization

【解决方案1】：

FWIW，我在上面的 cmets 中测试了 Daniel Lyons 和 Craig Ringer 的建议。以下是我的特定案例（每个表约 500k 行）的结果，按效率排序（效率最高的优先）。

反连接：

> EXPLAIN ANALYZE SELECT * FROM table1 t1 LEFT JOIN table2 t2 ON t1.secret_id=t2.secret_id WHERE t2.secret_id IS NULL;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=19720.19..56129.91 rows=21 width=28) (actual time=139.868..459.991 rows=142993 loops=1)
   Hash Cond: (t1.secret_id = t2.secret_id)
   ->  Seq Scan ON table1 t1  (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..61.913 rows=622338 loops=1)
   ->  Hash  (cost=10849.75..10849.75 rows=510275 width=14) (actual time=138.176..138.176 rows=510275 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 777kB
         ->  Seq Scan ON table2 t2  (cost=0.00..10849.75 rows=510275 width=14) (actual time=0.018..47.005 rows=510275 loops=1)
 Total runtime: 466.748 ms
(7 lignes)

不存在：

> EXPLAIN ANALYZE SELECT * FROM table1 t1 WHERE NOT EXISTS (SELECT secret_id FROM table2 t2 WHERE t2.secret_id=t1.secret_id);
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=19222.19..55133.91 rows=21 width=14) (actual time=181.881..517.632 rows=142993 loops=1)
   Hash Cond: (t1.secret_id = t2.secret_id)
   ->  Seq Scan ON table1 t1  (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..70.478 rows=622338 loops=1)
   ->  Hash  (cost=10849.75..10849.75 rows=510275 width=4) (actual time=179.665..179.665 rows=510275 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 592kB
         ->  Seq Scan ON table2 t2  (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.019..78.074 rows=510275 loops=1)
 Total runtime: 524.300 ms
(7 lignes)

除了：

> EXPLAIN ANALYZE SELECT * FROM table1 EXCEPT (SELECT t1.* FROM table1 t1 join table2 t2 ON t1.secret_id=t2.secret_id);
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 SetOp Except  (cost=1524985.53..1619119.03 rows=62261 width=14) (actual time=16926.056..19850.915 rows=142993 loops=1)
   ->  Sort  (cost=1524985.53..1543812.23 rows=7530680 width=14) (actual time=16925.010..18596.860 rows=6491408 loops=1)
         Sort Key: "*SELECT* 1".secret_id, "*SELECT* 1".jeu, "*SELECT* 1".combinaison, "*SELECT* 1".gains
         Sort Method: external merge  Disk: 185232kB
         ->  Append  (cost=0.00..278722.63 rows=7530680 width=14) (actual time=0.007..2951.920 rows=6491408 loops=1)
               ->  Subquery Scan ON "*SELECT* 1"  (cost=0.00..19275.12 rows=622606 width=14) (actual time=0.007..176.892 rows=622338 loops=1)
                     ->  Seq Scan ON table1  (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..69.842 rows=622338 loops=1)
               ->  Subquery Scan ON "*SELECT* 2"  (cost=19222.19..259447.51 rows=6908074 width=14) (actual time=168.529..2228.335 rows=5869070 loops=1)
                     ->  Hash Join  (cost=19222.19..190366.77 rows=6908074 width=14) (actual time=168.528..1450.663 rows=5869070 loops=1)
                           Hash Cond: (t1.secret_id = t2.secret_id)
                           ->  Seq Scan ON table1 t1  (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.002..64.554 rows=622338 loops=1)
                           ->  Hash  (cost=10849.75..10849.75 rows=510275 width=4) (actual time=168.329..168.329 rows=510275 loops=1)
                                 Buckets: 4096  Batches: 32  Memory Usage: 592kB
                                 ->  Seq Scan ON table2 t2  (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.017..72.702 rows=510275 loops=1)
 Total runtime: 19896.445 ms
(15 lignes)

不在：

> EXPLAIN SELECT * FROM table1 WHERE secret_id NOT IN (SELECT secret_id FROM table2);
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Seq Scan ON table1  (cost=0.00..5189688549.26 rows=311303 width=14)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..15395.12 rows=510275 width=4)
           ->  Seq Scan ON table2  (cost=0.00..10849.75 rows=510275 width=4)
(5 lignes)

我没有分析后者，因为它需要很长时间。

【讨论】：