Postgres多个列上的多个谓词答案

【问题标题】：Postgres multiple predicates on multiple columnsPostgres多个列上的多个谓词
【发布时间】：2019-10-04 09:31:23
【问题描述】：

已编辑：

我想我应该解释一下我要做什么，这样别人可能会比我问的更好地了解如何编写查询。

我有一个大约有 5 亿行的表和另一个大约 50M 行的表。

表定义如下

CREATE TABLE NGRAM_CONTENT
(
    id  BIGINT NOT NULL PRIMARY KEY,
    ref TEXT   NOT NULL,
    data      TEXT
);

CREATE INDEX idx_reference_ngram_content ON NGRAM_CONTENT (ref);
CREATE INDEX idx_id_ngram_content ON NGRAM_CONTENT (id);


CREATE TABLE NGRAMS
(
    id  BIGINT NOT NULL,
    ngram   TEXT   NOT NULL,
    ref TEXT   NOT NULL,
    name_length INT NOT NULL
);

CREATE INDEX combined_index ON NGRAMS (name_length, ngram, ref, id);
CREATE INDEX namelength_idx ON NGRAMS (name_length);
CREATE INDEX id_idx ON NGRAMS (id);
CREATE INDEX ref_idx ON NGRAMS (ref);
CREATE INDEX ngram_idx ON NGRAMS (ngram);

为了使用批量快速插入，已标记为已删除的上游事件将使用null 插入 NGRAM_CONTENT 表的数据列，并且没有设置外部约束 ngrams 表中的 id 和 ref 都是 NGRAM_CONTENT 表的外键。

一些样本数据

Ngram_Content:
|id | ref  | data       |
| 1 | 'P1' | some_json  |
| 2 | 'P1' | some_new_json  | # P1 comes again as an update
| 3 | 'P2' | P3  | 
| 4 | 'P1' | null  | 

Ngrams: 

name_length | ngram | ref  | id |
12          | CH    | 'P1' | 1  |
12          | AN    | 'P1' | 1  |
14          | NEW   | 'P1' | 2  |
20          | CH    | 'P2' | 3  |
20          | CHAI  | 'P2' | 3  |
...

对于上述数据，如果我搜索 id P1 内容为 some_json 但是如果我使用 id id=2 的最新版本已更新为 NEW 如果我搜索 @987654331 @ id P1 已被删除。

所有搜索都应该在name_length from 和 to 的距离内完成。

也就是说，只查找给定ref的最新ngram内容，该内容在name_length的范围内没有被删除到某个id

我需要支持两个条件 1. 带有事件 ID（用于历史运行） 2.没有事件ID，使用最新的

所以我想出了两个这样的变体

使用 event_id：

select w.* From NGRAM_CONTENT  w
inner join (
    select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref) wi
    on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;

没有 event_id：

select w.* From NGRAM_CONTENT  w
inner join (
    select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id group by w.ref) wi
    on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;

两个查询都需要很长时间才能运行，并且在运行查询说明时，Postgres 显示为完整扫描。

SEQ_SCAN (Seq Scan)  table: NGAMS;  121494200   3358896.0   0.0 Node Type = Seq Scan;
Parent Relationship = Outer;
Parallel Aware = true;
Relation Name = NGRAMS;
Alias = w_1;
Startup Cost = 0.0;
Total Cost = 3358896.0;
Plan Rows = 121494200;
Plan Width = 16;

带有execute (analyze, buffers) query的详细执行计划

 Nested Loop  (cost=5032852.92..6943974.42 rows=1 width=381) (actual time=50787.356..52095.938 rows=9437 loops=1)
   Buffers: shared hit=149882 read=769965, temp read=732 written=736
   ->  Finalize GroupAggregate  (cost=5032852.35..5125447.71 rows=265783 width=16) (actual time=50785.079..50808.811 rows=9437 loops=1)
         Group Key: w_1.ref
         Buffers: shared hit=114072 read=758535, temp read=732 written=736
         ->  Gather Merge  (cost=5032852.35..5120132.05 rows=531566 width=16) (actual time=50785.072..50801.624 rows=10261 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
               ->  Partial GroupAggregate  (cost=5031852.33..5057776.12 rows=265783 width=16) (actual time=50766.172..50777.757 rows=3420 loops=3)
                     Group Key: w_1.ref
                     Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
                     ->  Sort  (cost=5031852.33..5039607.65 rows=3102128 width=16) (actual time=50766.163..50769.734 rows=41777 loops=3)
                           Sort Key: w_1.ref
                           Sort Method: quicksort  Memory: 3251kB
                           Worker 0:  Sort Method: quicksort  Memory: 3326kB
                           Worker 1:  Sort Method: quicksort  Memory: 3396kB
                           Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
                           ->  Hash Join  (cost=787482.50..4591332.06 rows=3102128 width=16) (actual time=14787.585..50749.022 rows=41777 loops=3)
                                 Hash Cond: (w_1.ref = i.ref)
                                 Join Filter: (w_1.id >= i.max_event_id)
                                 Buffers: shared hit=343708 read=2276169, temp read=2196 written=2208
                                 ->  Parallel Seq Scan on NGRAMS w_1  (cost=0.00..3662631.50 rows=53797008 width=16) (actual time=0.147..30898.313 rows=38518899 loops=3)
                                       Filter: (id < 45000000)
                                       Rows Removed by Filter: 58676466
                                       Buffers: shared hit=15819 read=2128135
                                 ->  Hash  (cost=786907.78..786907.78 rows=45978 width=16) (actual time=14767.179..14767.180 rows=9437 loops=3)
                                       Buckets: 65536  Batches: 1  Memory Usage: 955kB
                                       Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                       ->  Subquery Scan on i  (cost=782779.42..786907.78 rows=45978 width=16) (actual time=14669.187..14764.701 rows=9437 loops=3)
                                             Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                             ->  GroupAggregate  (cost=782779.42..786448.00 rows=45978 width=16) (actual time=14669.186..14763.369 rows=9437 loops=3)
                                                   Group Key: NGRAMS.ref
                                                   Filter: (count(NGRAMS.ref) >= 2)
                                                   Rows Removed by Filter: 210038
                                                   Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                                   ->  Sort  (cost=782779.42..783265.52 rows=194442 width=16) (actual time=14669.164..14708.948 rows=229489 loops=3)
                                                         Sort Key: NGRAMS.ref
                                                         Sort Method: external merge  Disk: 5856kB
                                                         Worker 0:  Sort Method: external merge  Disk: 5856kB
                                                         Worker 1:  Sort Method: external merge  Disk: 5856kB
                                                         Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                                         ->  Index Only Scan using combined_index on NGRAMS  (cost=0.57..762373.68 rows=194442 width=16) (actual time=0.336..14507.098 rows=229489 loops=3)
                                                               Index Cond: ((indexed = ANY ('{YU,CA,SAN,LT,TO}'::text[])) AND (name_length >= 15) AND (name_length <= 20) AND (event_id < 45000000))
                                                               Heap Fetches: 688467
                                                               Buffers: shared hit=327861 read=148034
   ->  Index Scan using idx_id_ngram_content on NGRAM_CONTENT w  (cost=0.56..6.82 rows=1 width=381) (actual time=0.135..0.136 rows=1 loops=9437)
         Index Cond: (id = (max(w_1.id)))
         Filter: ((data IS NOT NULL) AND (w_1.ref = ref))
         Buffers: shared hit=35810 read=11430
 Planning Time: 12.075 ms
 Execution Time: 52100.064 ms

有没有办法让这些查询更快？

我试图将查询分成更小的块并分析它们，并发现完全扫描发生在此连接中

select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(event_id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref

但我不知道为什么，也不确定缺少哪些索引。

最好为 Postgres 提供答案，但最坏的情况也请为 Oracle 提供答案。

我知道这很长，但如果可以的话，请尽量提供帮助。谢谢

【问题讨论】：

使用 explain (analyze, buffers, format text) 生成的真实（未经编辑的）执行计划将比您的计划“摘要”更有帮助
谢谢，我已经编辑添加了详细的执行计划
这是一个“简单”的解释计划，不是使用explain (analyze, buffers)生成的
对不起@a_horse_with_no_name，现在得到它，并更新它。谢谢
@a_horse_with_no_name 我在这里用确切的例子和场景再次问了这个问题stackoverflow.com/questions/58246864/…

标签： sql postgresql performance

【解决方案1】：

对于如此多样的查询，最好的办法是创建三个索引：

CREATE INDEX ON ngrams (id);
CREATE INDEX ON ngrams (name_length);
CREATE INDEX ON ngrams (ngram);

如果其中一个条件选择性不够，希望PostgreSQL可以使用Bitmap And。

【讨论】：

感谢@Laurenz Albe，我已更新问题以包含更多详细信息。
正如@a_horse_with_no_name 上面评论的那样，缺少的细节是执行计划。