Postgresql：多列索引与单列索引答案

【问题标题】：Postgresql: multicolumn indexes vs single column indexPostgresql：多列索引与单列索引
【发布时间】：2018-04-11 13:48:03
【问题描述】：

我有一个每年会增长 1000 万行的表。

该表有 10 列，分别称为 c1、c2、c3、...、c10。

我将使用 WHERE 子句，可能会用到其中的 8 个。

更具体地说：每次我查询表时，c10 列上总会有一个 WHERE 子句（它是一个日期，我可以搜索相等或范围）。

其他 7 个可能的可搜索列将不遵循任何架构。我可以搜索：

c10、c1、c2、c5
c10, c5
c10, c3
c10、c2、c6
c10、c2、c3、c5、c6

...以及所有其他可能的组合。

因此，在 WHERE 子句中，c10 将始终存在，而其他可以以任意组合存在（甚至根本不存在）。

在这种情况下，什么索引策略可以提高性能？我认为正确的做法是为每一列创建一个索引。使用多列索引可以提高性能吗？

据我所知，您将通过 (c1, c2, c3) 上的多列索引获得性能，仅适用于按顺序使用 c1、c2、c3 或 c1、c2 或 c1 的查询。但就像我说的，在我的场景中我唯一可以假设的是 c10 将始终出现在 WHERE 子句中（它也可以是第一个子句，如果有帮助的话）

【问题讨论】：

Postgres 在单个查询中组合多个索引方面非常有效。所以你肯定想要一个c10 - 其他的完全取决于你的查询。如果c10 的条件已经大大减少了行数，那么额外的索引可能无济于事。如果没有看到真实的查询、真实的表定义和带有真实世界数据的执行计划，这很难说
在最好的情况下，c10上的条件会减少100k的行数。我相信 100k+ 行在查询中需要额外的索引，其中 c10 不是唯一的子句
表是否有主键/任何 {cx,cy,...} 集是否构成候选键？ cx 列都是真正独立的吗？ cx 列的基数及其组合是什么？
主键是一个简单的自增id。没有集合构成候选键。 cx 列的基数（如果您的意思是 where 子句中使用的列）从 1 到 8 开始。

标签： sql postgresql indexing

【解决方案1】：

要回答我们应该使用哪种索引的问题，我们可以创建一个简单的测试。首先，我们创建一个数据库、表和索引。

CREATE DATABASE index_test;

CREATE TABLE single_column(a int, b int, c int);
CREATE TABLE multi_column(a int, b int, c int);

CREATE INDEX single_column_a_idx ON single_column (a);
CREATE INDEX single_column_b_idx ON single_column (b);
CREATE INDEX single_column_c_idx ON single_column (c);

CREATE INDEX multi_column_idx ON multi_column (a, b, c);

用随机数据填充表格。

-- this function will be used for random number generation
CREATE OR REPLACE FUNCTION random_in_range(INTEGER, INTEGER) RETURNS INTEGER AS $$
SELECT floor(($1 + ($2 - $1 + 1) * random()))::INTEGER;
$$ LANGUAGE SQL;

INSERT INTO single_column(a, b, c)
SELECT random_in_range(1, 100),
    random_in_range(1, 100),
    random_in_range(1, 100)
FROM generate_series(1, 1000000);

INSERT INTO multi_column(a, b, c)
SELECT random_in_range(1, 100),
    random_in_range(1, 100),
    random_in_range(1, 100)
FROM generate_series(1, 1000000);

运行测试。

EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3;
EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3;
EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3;

EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3;

EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3 AND b > 10 AND c <= 11;

结果

index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3;
                                                               QUERY PLAN                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on single_column  (cost=3925.39..13926.49 rows=367608 width=12) (actual time=5.802..44.904 rows=20070 loops=1)
   Recheck Cond: (a < 3)
   Heap Blocks: exact=5269
   ->  Bitmap Index Scan on single_column_a_idx  (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.018..4.019 rows=20070 loops=1)
         Index Cond: (a < 3)
 Planning Time: 0.325 ms
 Execution Time: 46.589 ms
(7 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3;
                                                               QUERY PLAN                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on single_column  (cost=3925.39..13926.49 rows=367608 width=12) (actual time=6.630..26.814 rows=19902 loops=1)
   Recheck Cond: (b < 3)
   Heap Blocks: exact=5296
   ->  Bitmap Index Scan on single_column_b_idx  (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.852..4.853 rows=19902 loops=1)
         Index Cond: (b < 3)
 Planning Time: 0.270 ms
 Execution Time: 28.762 ms
(7 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3;
                                                               QUERY PLAN                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on single_column  (cost=3925.39..13926.49 rows=367608 width=12) (actual time=5.896..25.304 rows=19946 loops=1)
   Recheck Cond: (c < 3)
   Heap Blocks: exact=5274
   ->  Bitmap Index Scan on single_column_c_idx  (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.125..4.126 rows=19946 loops=1)
         Index Cond: (c < 3)
 Planning Time: 0.270 ms
 Execution Time: 27.136 ms
(7 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3;
                                                             QUERY PLAN                                                 
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on multi_column  (cost=8569.39..18570.49 rows=367608 width=12) (actual time=7.760..67.173 rows=19938 loops=1)
   Recheck Cond: (a < 3)
   Heap Blocks: exact=5267
   ->  Bitmap Index Scan on multi_column_idx  (cost=0.00..8477.49 rows=367608 width=0) (actual time=6.008..6.008 rows=19938 loops=1)
         Index Cond: (a < 3)
 Planning Time: 0.564 ms
 Execution Time: 68.630 ms
(7 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3;
                                                           QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..13481.03 rows=18667 width=12) (actual time=1.451..135.028 rows=19897 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on multi_column  (cost=0.00..10614.33 rows=7778 width=12) (actual time=0.038..61.993 rows=6632 loops=3)
         Filter: (b < 3)
         Rows Removed by Filter: 326701
 Planning Time: 1.123 ms
 Execution Time: 136.128 ms
(8 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3;
                                                           QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..13627.63 rows=20133 width=12) (actual time=0.957..135.119 rows=19860 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on multi_column  (cost=0.00..10614.33 rows=8389 width=12) (actual time=0.035..66.760 rows=6620 loops=3)
         Filter: (c < 3)
         Rows Removed by Filter: 326713
 Planning Time: 0.225 ms
 Execution Time: 136.239 ms
(8 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11;
                                                                   QUERY PLAN                                           
-------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on single_column  (cost=1424.66..5716.83 rows=2110 width=12) (actual time=21.694..26.123 rows=2000 loops=1)
   Recheck Cond: ((a < 3) AND (c <= 11))
   Filter: (b > 10)
   Rows Removed by Filter: 230
   Heap Blocks: exact=1833
   ->  BitmapAnd  (cost=1424.66..1424.66 rows=2338 width=0) (actual time=20.981..20.983 rows=0 loops=1)
         ->  Bitmap Index Scan on single_column_a_idx  (cost=0.00..230.43 rows=21067 width=0) (actual time=3.932..3.932 rows=20070 loops=1)
               Index Cond: (a < 3)
         ->  Bitmap Index Scan on single_column_c_idx  (cost=0.00..1192.92 rows=111000 width=0) (actual time=16.080..16.080 rows=110276 loops=1)
               Index Cond: (c <= 11)
 Planning Time: 1.812 ms
 Execution Time: 26.742 ms
(12 rows)


index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3 AND b > 10 AND c <= 11;
                                                                 QUERY PLAN                                             
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using multi_column_idx on multi_column  (cost=0.42..642.38 rows=2071 width=12) (actual time=0.329..2.086 rows=1953 loops=1)
   Index Cond: ((a < 3) AND (b > 10) AND (c <= 11))
   Heap Fetches: 0
 Planning Time: 0.176 ms
 Execution Time: 2.165 ms
(5 rows)

结论

single_column 表在任何情况下都将始终使用索引。

EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3; -- index used

EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11; -- index used

对multi_column 表执行单列WHERE，仅当查询中的列与索引定义中的第一列相同时才会使用索引。

EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3; -- index not used
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3; -- index not used

虽然single_column 表可以在多列WHERE 上使用索引，但multi_column 表更快。
虽然multi_column 表可以在单列WHERE 上使用索引，但single_column 表更快。

【讨论】：

【解决方案2】：

多列索引非常通用，比单列索引更通用。 (c1, c2) 上的多列索引也适用于 (c1) 上的索引可以工作的查询。

假设您的条件都是相等条件，那么索引中列的顺序无关紧要。对于您描述的情况，以下索引将完全优化所有查询：

(c10, c5, c1, c2)
(c10, c3)
(c10, c2, c6)
(c10, c2, d3, c5, c6)

您是否需要所有这些索引是另一回事。这取决于列的选择性（即他们选择的表中行的比例）。通过检索值过滤几十行并不是特别昂贵。因此，如果 c10 条件只返回少数几行，那么在索引中包含其他列可能不会带来显着的额外性能改进。

此外，更多的索引意味着插入、更新和删除需要更多的时间。这也会影响您的索引策略。

分区（如另一个答案中所述）也很有用。它是否适合您的情况，取决于数据和查询的样子。

【讨论】：

我的子句示例不完整。我可以使用 c10（范围）的子句执行查询，每个顺序中的每一列都可能后跟

【解决方案3】：

我强烈建议以下策略：

在其他列上创建单列索引；
c10 上的分区。由于是日期，您可以按范围进行分区，进行年度或每月分区。

我已经看到分区带来了巨大的性能提升，特别是在WHERE 和大型表中始终使用一列或多列的情况下。

【讨论】：