PostgreSQL 没有使用直接索引答案

【问题标题】：PostgreSQL is not using a straight forward indexPostgreSQL 没有使用直接索引
【发布时间】：2019-03-06 17:50:08
【问题描述】：

我在 Amazon RDS 上有一个 PostgreSQL 10.6 数据库。我的桌子是这样的：

CREATE TABLE dfo_by_quarter (
    release_key int4 NOT NULL,
    country varchar(100) NOT NULL,
    product_group varchar(100) NOT NULL,
    distribution_type varchar(100) NOT NULL,
    "year" int2 NOT NULL,
    "date" date NULL,
    quarter int2 NOT NULL,
    category varchar(100) NOT NULL,
    units numeric(38,6) NOT NULL,
    sales_value_eur numeric(38,6) NOT NULL,
    sales_value_usd numeric(38,6) NOT NULL,
    sales_value_local numeric(38,6) NOT NULL,
    data_status bpchar(1) NOT NULL,
    panel_market_units numeric(38,6) NOT NULL,
    panel_market_sales_value_eur numeric(38,6) NOT NULL,
    panel_market_sales_value_usd numeric(38,6) NOT NULL,
    panel_market_sales_value_local numeric(38,6) NOT NULL,
    CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
    CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);

我了解主键意味着唯一索引

如果我只是问在过滤不存在的数据时我有多少行（release_key = 1 不返回任何内容），我可以看到它使用索引

EXPLAIN
SELECT COUNT(*)
  FROM dpretailer.dfo_by_quarter
  WHERE release_key = 1

Aggregate  (cost=6.32..6.33 rows=1 width=8)
  ->  Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter  (cost=0.55..6.32 rows=1 width=0)
        Index Cond: (release_key = 1)

但是如果我对返回数据的值运行相同的查询，它会扫描表，这必然会更昂贵......

EXPLAIN
SELECT COUNT(*)
  FROM dpretailer.dfo_by_quarter
  WHERE release_key = 2

Finalize Aggregate  (cost=47611.07..47611.08 rows=1 width=8)
  ->  Gather  (cost=47610.86..47611.07 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=46610.86..46610.87 rows=1 width=8)
              ->  Parallel Seq Scan on dfo_by_quarter  (cost=0.00..46307.29 rows=121428 width=0)
                    Filter: (release_key = 2)

我知道在没有数据时使用索引是有意义的，并且由表上的统计数据驱动（我在测试之前运行了 ANALYZE）

但是如果有数据为什么不使用我的索引呢？

当然，扫描索引的一部分（因为 release_key 是第一列）肯定比扫描整个表更快？？？

我一定是错过了什么……？

2019-03-07 更新

感谢您的 cmets，它们非常有用。

这个简单的查询只是我试图理解为什么不使用索引...

但我应该更了解（我是 postgresql 新手，但在 SQL Server 方面有多年经验），正如您所评论的那样，事实并非如此是有道理的。

选择性不好，因为我的条件只过滤了大约 20% 的行
糟糕的桌子设计（太胖，我们知道并正在解决这个问题）
索引未“覆盖”查询等...

所以，如果可以的话，让我“稍微”改变一下我的问题......

我们的表格将按事实/维度进行规范化（不再有 varchars 在错误的位置）。

我们只进行插入，从不更新，删除很少，我们可以忽略它。

表不会很大（几千万行顺序）。

我们的查询将始终指定确切的 release_key 值。

我们的新版本表格如下所示

CREATE TABLE dfo_by_quarter (
    release_key int4 NOT NULL,
    country_key int2 NOT NULL,
    product_group_key int2 NOT NULL,
    distribution_type_key int2 NOT NULL,
    category_key int2 NOT NULL,
    "year" int2 NOT NULL,
    "date" date NULL,
    quarter int2 NOT NULL,
    units numeric(38,6) NOT NULL,
    sales_value_eur numeric(38,6) NOT NULL,
    sales_value_usd numeric(38,6) NOT NULL,
    sales_value_local numeric(38,6) NOT NULL,
    CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
    CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
    CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
    CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
    CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
    CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);

考虑到这一点，在 SQL Server 环境中，我可以通过使用“集群”主键（对整个表进行排序）或在主键上使用 INCLUDE 选项为所需的其他列设置索引来解决此问题涵盖查询（单位、值等）。

问题 1)

在 postgresql 中，是否有与 SQL Server 聚集索引等效的功能？一种对整个表进行实际排序的方法？我想这可能很困难，因为 postgresql 不会“就地”进行更新，因此它可能会使排序变得昂贵......

或者，有没有办法创建类似于 SQL Server Index WITH INCLUDE(units, values) 的东西？

更新：我遇到了 SQL CLUSTER 命令，这是我认为最接近的命令。很适合我们

问题 2

下面的查询

EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
  SELECT
    ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
    "year",
    "main"."product_group_key" AS "productgroupkey",
    SUM("main"."units") AS "salesunits",
    SUM("main"."sales_value_eur") AS "salesvalue",
    SUM("sales_value_eur")/SUM("units") AS "asp"
  FROM "milly"."dfo_by_quarter" AS "main"

  WHERE
    "release_key" = 17 AND
    "main"."year" >= 2010
  GROUP BY
    "year",
    "main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
  "year" AS date,
  SUM("salesunits") AS "salesunits",
  SUM("salesvalue") AS "salesvalue",
  SUM("salesvalue")/SUM("salesunits") AS "asp",
  CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
  "rank_query"
GROUP BY
  "year",
  CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
  FROM BeforeLookup AS BL
  INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;

我明白了

Hash Join  (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
  Hash Cond: (bl.productgroupkey = pg.product_group_key)
  Buffers: shared hit=483 read=22719
  CTE rank_query
    ->  WindowAgg  (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
          Buffers: shared hit=480 read=22719
          ->  Sort  (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
                Sort Key: main.year, (sum(main.units)) DESC
                Sort Method: quicksort  Memory: 616kB
                Buffers: shared hit=480 read=22719
                ->  Finalize HashAggregate  (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
                      Group Key: main.year, main.product_group_key
                      Buffers: shared hit=480 read=22719
                      ->  Gather  (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
                            Workers Planned: 2
                            Workers Launched: 2
                            Buffers: shared hit=480 read=22719
                            ->  Partial HashAggregate  (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
                                  Group Key: main.year, main.product_group_key
                                  Buffers: shared hit=480 read=22719
                                  ->  Parallel Seq Scan on dfo_by_quarter main  (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
                                        Filter: ((year >= 2010) AND (release_key = 17))
                                        Rows Removed by Filter: 546602
                                        Buffers: shared hit=480 read=22719
  CTE beforelookup
    ->  HashAggregate  (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
          Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
          Buffers: shared hit=480 read=22719
          ->  CTE Scan on rank_query  (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
                Buffers: shared hit=480 read=22719
  ->  CTE Scan on beforelookup bl  (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
        Buffers: shared hit=480 read=22719
  ->  Hash  (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=3
        ->  Seq Scan on dim_dfo_product_group pg  (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
              Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms

有什么想到的吗？

如果我读得正确，这意味着到目前为止我最大的成本是表的初始扫描......但我没有设法让它使用索引......

我创建了一个索引，希望能有所帮助，但它被忽略了......

CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);

ANALYZE milly.dfo_by_quarter;

我也尝试过对表格进行聚类，但也没有明显效果

CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)

ANALYZE milly.dfo_by_quarter;

非常感谢

埃里克

【问题讨论】：

你的表有多少行？有多少release_key = 2。表上是否存在（并发）写入负载？在VACUUM dpretailer.dfo_by_quarter 之后是否看到仅索引扫描？
另外，你能用EXPLAIN (ANALYZE, BUFFERS)的输出替换EXPLAINs吗？这将为我们提供具体的时间安排和共享缓存命中/未命中。

标签： postgresql indexing amazon-rds postgresql-performance

【解决方案1】：

因为release_key 实际上不是唯一列，所以从您提供的信息中无法知道是否应该使用索引。如果很大比例的行有release_key = 2，或者在大表上匹配的行比例更小，那么使用索引可能效率不高。

这部分是因为 Postgres 索引是间接的——即索引实际上包含一个指针，指向真正元组所在的堆中的磁盘位置。所以遍历索引需要从索引中读取一个条目，从堆中读取元组，然后重复。对于大量元组，直接扫描堆并避免间接磁盘访问损失通常更有价值。

编辑：您通常不想在 PostgreSQL 中使用CLUSTER；这不是索引的维护方式，因此在野外很少看到这种情况。

您没有数据的更新查询给出了这个计划：

                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on beforelookup bl  (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
   Buffers: shared hit=4
   CTE rank_query
     ->  WindowAgg  (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
           Buffers: shared hit=4
           ->  Sort  (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
                 Sort Key: main.year, (sum(main.units)) DESC
                 Sort Method: quicksort  Memory: 25kB
                 Buffers: shared hit=4
                 ->  GroupAggregate  (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
                       Group Key: main.year, main.product_group_key
                       Buffers: shared hit=1
                       ->  Sort  (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
                             Sort Key: main.year, main.product_group_key
                             Sort Method: quicksort  Memory: 25kB
                             Buffers: shared hit=1
                             ->  Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main  (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
                                   Index Cond: ((release_key = 17) AND (year >= 2010))
                                   Buffers: shared hit=1
   CTE beforelookup
     ->  HashAggregate  (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
           Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
           Buffers: shared hit=4
           ->  CTE Scan on rank_query  (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
                 Buffers: shared hit=4
 Planning Time: 0.723 ms
 Execution Time: 0.485 ms
(27 rows)

所以 PostgreSQL 完全有能力为您的查询使用索引，但规划器认为它不值得（即直接使用索引的成本高于使用并行序列扫描的成本）。

如果您set enable_indexscan = off; 没有数据，您将获得位图索引扫描（如我所料）。如果您set enable_bitmapscan = off; 没有数据，您将获得（非并行）序列扫描。

如果您set max_parallel_workers = 0;，您应该会看到计划变回（包含大量数据）。

但是查看查询的解释结果，我非常希望使用索引比使用并行序列扫描更昂贵并且花费更长的时间。在更新后的查询中，您仍在扫描非常高比例的表和大量行，并且您还通过访问不在索引中的字段来强制访问堆。 Postgres 11（我相信）添加了覆盖索引，理论上这将允许您使此查询仅由索引驱动，但我根本不相信在这个示例中它实际上是值得的。

【讨论】：

我现在无法访问数据库，所以今晚无法提供更多详细信息。来自 SQL Server 背景，我了解索引的“间接”方面，如果我引用索引未涵盖的任何列，这同样适用于 SQL Server。但在我的示例中，我指出只做一个 COUNT(*)，在 SQL Server 中，它可以在不访问基础表的情况下被解析，因为行数可以从索引中计算出来。
@EricMamet 我为您更新的问题更新了答案。
非常感谢。有道理

【解决方案2】：

一般来说，尽管可能，PK 跨越 7 列，其中有几个是 varchar(100)，至少可以说没有针对性能进行优化。

如果您对相关列有更新，这样的索引一开始就很大，而且往往会迅速膨胀。

我会使用代理 PK，serial（或 bigserial，如果您有那么多行）。或IDENTITY。见：

Auto increment table column

并对所有 7 个进行 UNIQUE 约束以强制执行唯一性（无论如何都是 NOT NULL）。

如果您有大量计数查询，且只有 release_key 上的谓词，请考虑仅在该列上添加一个普通 btree 索引。

这么多列的数据类型varchar(100) 可能不是最佳的。一些标准化可能会有所帮助。

更多建议取决于缺失的信息...

【讨论】：

对此答案投票 +1，因为在过滤器列上添加索引会增加获得仅索引扫描的机会。一旦到位（请记住，创建外键不会创建和索引），主键应该是无关紧要的。
* 我运行这个特定的查询纯粹是为了尝试进行索引扫描。原始查询要复杂得多。 * 在“真实场景”中，我需要其他列。 * 没有其他操作同时进行 * 这个 release_key 值可能有大约 300,000 行，总共有几百万 * 在这个阶段，我的表是“胖”（varchar 列），因为它是一个快速而肮脏的实现，但我会很快让它看起来更像事实/维度（因此使用代理键而不是 varchars）
@EricMamet：那么这个问题可能会产生误导。所有细节都很重要。 Postgres 根据估计的成本决定查询计划。 “几百万”太模糊了。可能是 2 或 9 百万，这有很大的不同。
@ErwinBrandstetter 我明白这一点，明天我会回来提供更多细节。在 SQL Server 中，这样的请求将始终使用索引，因为它根本不需要查看表，而只会按顺序扫描索引的子集（无论行数如何）。我想知道 postgresql 的行为是否相同。显然不是！
@EricMamet：这取决于。对于index-only scans，就像我们在您的第一个查询计划中看到的那样，必须满足一些先决条件。

【解决方案3】：

我最初的问题的答案：为什么 postgresql 不在像 SELECT (*) 这样的东西上使用我的索引...可以在文档中找到...

Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

特别是：这意味着每次从索引中读取一行时，引擎还必须读取表中的实际行，以确保该行没有被删除。

这很好地解释了为什么我没有设法让 postgresql 使用我的索引，而从 SQL Server 的角度来看，它显然是“应该”的。

【讨论】：