【发布时间】:2019-03-06 17:50:08
【问题描述】:
我在 Amazon RDS 上有一个 PostgreSQL 10.6 数据库。我的桌子是这样的:
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);
我了解主键意味着唯一索引
如果我只是问在过滤不存在的数据时我有多少行(release_key = 1 不返回任何内容),我可以看到它使用索引
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1
Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)
但是如果我对返回数据的值运行相同的查询,它会扫描表,这必然会更昂贵......
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2
Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)
我知道在没有数据时使用索引是有意义的,并且由表上的统计数据驱动(我在测试之前运行了 ANALYZE)
但是如果有数据为什么不使用我的索引呢?
当然,扫描索引的一部分(因为 release_key 是第一列)肯定比扫描整个表更快???
我一定是错过了什么……?
2019-03-07 更新
感谢您的 cmets,它们非常有用。
这个简单的查询只是我试图理解为什么不使用索引...
但我应该更了解(我是 postgresql 新手,但在 SQL Server 方面有多年经验),正如您所评论的那样,事实并非如此是有道理的。
- 选择性不好,因为我的条件只过滤了大约 20% 的行
- 糟糕的桌子设计(太胖,我们知道并正在解决这个问题)
- 索引未“覆盖”查询等...
所以,如果可以的话,让我“稍微”改变一下我的问题......
我们的表格将按事实/维度进行规范化(不再有 varchars 在错误的位置)。
我们只进行插入,从不更新,删除很少,我们可以忽略它。
表不会很大(几千万行顺序)。
我们的查询将始终指定确切的 release_key 值。
我们的新版本表格如下所示
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);
考虑到这一点,在 SQL Server 环境中,我可以通过使用“集群”主键(对整个表进行排序)或在主键上使用 INCLUDE 选项为所需的其他列设置索引来解决此问题涵盖查询(单位、值等)。
问题 1)
在 postgresql 中,是否有与 SQL Server 聚集索引等效的功能?一种对整个表进行实际排序的方法?我想这可能很困难,因为 postgresql 不会“就地”进行更新,因此它可能会使排序变得昂贵......
或者,有没有办法创建类似于 SQL Server Index WITH INCLUDE(units, values) 的东西?
更新:我遇到了 SQL CLUSTER 命令,这是我认为最接近的命令。 很适合我们
问题 2
下面的查询
EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"
WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;
我明白了
Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms
有什么想到的吗?
如果我读得正确,这意味着到目前为止我最大的成本是表的初始扫描......但我没有设法让它使用索引......
我创建了一个索引,希望能有所帮助,但它被忽略了......
CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);
ANALYZE milly.dfo_by_quarter;
我也尝试过对表格进行聚类,但也没有明显效果
CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)
ANALYZE milly.dfo_by_quarter;
非常感谢
埃里克
【问题讨论】:
-
你的表有多少行?有多少
release_key = 2。表上是否存在(并发)写入负载?在VACUUM dpretailer.dfo_by_quarter之后是否看到仅索引扫描? -
另外,你能用
EXPLAIN (ANALYZE, BUFFERS)的输出替换EXPLAINs吗?这将为我们提供具体的时间安排和共享缓存命中/未命中。
标签: postgresql indexing amazon-rds postgresql-performance