为什么 Postgresql 在此查询中使用（并行）顺序扫描而不是索引扫描？答案

【问题标题】：Why does Postgresql use a (parallel) sequential scan instead of index scan in this query?为什么 Postgresql 在此查询中使用（并行）顺序扫描而不是索引扫描？
【发布时间】：2021-05-23 23:30:05
【问题描述】：

我有下表：

create table images
(
    image_id       bigint,
    image          text,
    url            text,
    post_id        bigint,
    checksum       text,
    path           varchar
);


create index images_postid_idx
    on images (post_id);

create index image_2020_idx
    on images (image);

（是的，没有主键！还有一些其他字段大多为空且与查询无关。）

这个非常简单的查询：

SELECT 1 FROM images where image = 'foo';

产生以下执行计划：

Gather  (cost=1000.00..40080349.00 rows=93 width=4) (actual time=339826.750..339933.048 rows=0 loops=1)
  Workers Planned: 10
  Workers Launched: 10
  Buffers: shared hit=527195 read=39504582 dirtied=23
  ->  Parallel Seq Scan on images  (cost=0.00..40079339.70 rows=9 width=4) (actual time=339800.607..339800.607 rows=0 loops=11)
        Filter: (image = 'foo'::text)
        Rows Removed by Filter: 3459138
        Buffers: shared hit=527195 read=39504582 dirtied=23
Planning Time: 3.684 ms
JIT:
  Functions: 34
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 4.039 ms, Inlining 382.076 ms, Optimization 121.531 ms, Emission 71.917 ms, Total 579.563 ms"
Execution Time: 339978.002 ms

导致该问题的原因可能是什么？我复制了表格，放置了相同的索引并复制了几千行 => 那里一切正常。我还运行了ANALYSE images 来更新统计数据。

我不确定为什么规划器不使用此表上的索引。在总共 380,000,000 行中，image 有 28,091,491 个不同的值。由于我的查询实际上没有从表中选择任何内容，为什么规划器会选择除索引扫描以外的其他内容？

我正在使用 PG 12.5。

更新： select * from pg_stats where tablename = 'images' and attname = 'image';的输出：https://pastebin.com/Xeg7DjQd

更新 2： \d+ images的输出：

                                      Table "public.images"
     Column     |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
----------------+-------------------+-----------+----------+---------+----------+--------------+-------------
 image_id       | bigint            |           |          |         | plain    |              | 
 image          | text              |           |          |         | extended |              | 
 url            | text              |           |          |         | extended |              | 
 post_id        | bigint            |           |          |         | plain    |              | 
 checksum       | text              |           |          |         | extended |              | 
 path           | character varying |           |          |         | extended |              | 
 field1         | numeric           |           |          |         | main     |              | 
 field2         | numeric           |           |          |         | main     |              | 
 field3         | integer           |           |          |         | plain    |              | 
 field4         | numeric           |           |          |         | main     |              | 
 field5         | double precision  |           |          |         | plain    |              | 
 field6         | double precision  |           |          |         | plain    |              | 
Indexes:
    "image_2020_idx" btree (image)
    "images_postid_idx" btree (post_id)
Replica Identity: FULL
Access method: heap

【问题讨论】：

可能索引处于无效状态。尝试reindex 重建它。
@Dai：不，不使用 Postgres。 text 或 varchar(n) 之间的速度或存储要求绝对没有区别（如果两者都存储相同的字符串）
@user1068464: no 255 不是 varchar 长度的幻数。
@steve 不，条件非常有选择性，PostgreSQL 绝对应该使用索引。 psql中\d images的输出是什么？
没有主键的表没有意义。在尝试优化之前修复您的数据模型（图像、url 或路径的基数是什么？它们是否依赖于 image_id 和/或 post_id？

标签： postgresql indexing

【解决方案1】：

问题是，桌子实际上太大了。它使用了将近 300GB 的磁盘空间，而实际上它应该只使用了大约 10GB。我尝试了一个普通的VACUUM，它没有帮助，VACUUM FULL（我避免运行它，因为我认为它需要 300 GB 空间），但是，确实如此。当表缩小到那个大小时，索引就被正确使用了。

但是，我不知道表如何在 auto_vacuum 开启的情况下积累那么多死空间。另外据使用这张表的人说，他们甚至没有做任何批量删除，所以那部分仍然是个谜。

【讨论】：