在 postgresql 9.4 中强制 GIN 索引扫描答案

【问题标题】：Force GIN index scan in postgresql 9.4在 postgresql 9.4 中强制 GIN 索引扫描
【发布时间】：2018-03-06 11:33:58
【问题描述】：

我有一个位置表（大约 2900 万行）

Table "public.locations"
Column   |   Type| Modifiers  
------------------------------------+-------------------+------------------------------------------------------------
id | integer   | not null default nextval('locations_id_seq'::regclass)
dl | text  | 
Indexes:
"locations_pkey" PRIMARY KEY, btree (id)
"locations_test_idx" gin (to_tsvector('english'::regconfig, dl))

我希望以下查询有良好的表现。

EXPLAIN (ANALYZE,BUFFERS) SELECT id  FROM locations WHERE  to_tsvector('english'::regconfig, dl)  @@ to_tsquery('Lymps') LIMIT 10;

但是生成的查询计划显示正在使用顺序扫描。

                                                          QUERY PLAN                                                           

-------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..65.18 rows=10 width=4) (actual time=62217.569..62217.569 rows=0 loops=1)
  Buffers: shared hit=262 read=447808
  I/O Timings: read=861.370
  ->  Seq Scan on locations  (cost=0.00..967615.99 rows=148442 width=2) (actual time=62217.567..62217.567 rows=0 loops=1)
         Filter: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
         Rows Removed by Filter: 29688342
         Buffers: shared hit=262 read=447808
         I/O Timings: read=861.370
Planning time: 0.109 ms
Execution time: 62217.584 ms

强行关闭seq scan时

set enable_seqscan to off;

查询计划现在使用 gin 索引。

                                                                  QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1382.43..1403.20 rows=10 width=2) (actual time=0.043..0.043 rows=0 loops=1)
   Buffers: shared hit=1 read=3
   ->  Bitmap Heap Scan on locations  (cost=1382.43..309697.73 rows=148442 width=2) (actual time=0.043..0.043 rows=0 loops=1)
         Recheck Cond: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
         Buffers: shared hit=1 read=3
         ->  Bitmap Index Scan on locations_test_idx  (cost=0.00..1345.32 rows=148442 width=0) (actual time=0.041..0.041 rows=0 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, dl) @@ to_tsquery('Lymps'::text))
               Buffers: shared hit=1 read=3
 Planning time: 0.089 ms
 Execution time: 0.069 ms
(10 rows)

成本设置已粘贴在下方。

select name,setting from pg_settings where name like '%cost';                       
         name         | setting 
----------------------+---------
 cpu_index_tuple_cost | 0.005
 cpu_operator_cost    | 0.0025
 cpu_tuple_cost       | 0.01
 random_page_cost     | 4
 seq_page_cost        | 1
(5 rows)

我正在寻找一种解决方案，它不对上述查询使用顺序扫描以及将顺序扫描设置为关闭等技巧。

我尝试将 seq_page_cost 的值更新为 20，但查询计划保持不变。

【问题讨论】：

作为第一次尝试，运行ANALYZE 来更新统计信息。如果这不起作用，请增加 default_statistics_target 并重试。
我跑了VACUUM FULL ANALYZE locations ，但没有帮助。 SET default_statistics_target to 10000; 也没有帮助。
在更改default_statistics_target 后运行ANALYZE（不是VACUUM FULL）。还是没有好转？
实际上我在设置default_statistics_target 之前运行了VACUUM FULL ANALYZE locations。设置default_statistics_target 并运行VACUUM FULL ANALYZE 后，将查询计划更改为使用GIN 索引。我猜ANALYZE 是这里的关键。非常感谢您的建议。

标签： postgresql performance query-optimization postgresql-9.4 query-planner

【解决方案1】：

这里的问题是 PostgreSQL 认为有足够的行满足条件，所以它认为通过顺序获取行直到匹配到 10 行会更便宜。

但是没有一行满足条件，所以查询最终会扫描整个表，而索引扫描会发现更快。

您可以像这样提高为该列收集的统计信息的质量：

ALTER TABLE locations_test_idx
   ALTER to_tsvector SET STATISTICS 10000;

然后运行ANALYZE，PostgreSQL 将为该列收集更好的统计信息，希望改进查询计划。

【讨论】：