【发布时间】:2021-07-03 05:50:22
【问题描述】:
我有一个非常大的表,我需要按日期对它进行分区(在我的情况下通过触发器)。 我遇到的问题是我可以通过时间戳过滤器非常快地获取数据,但是在通过主键提取特定行的数据时无法获得良好的性能。
主表是:
CREATE TABLE parent_table (
guid uuid NOT NULL DEFAULT uuid_generate_v4(), -- This is gonna be the primary key
tm timestamptz NOT NULL, -- Timestamp, on which paritions are based
value int4 not null default -1, -- Just a value
CONSTRAINT z_detections_pk PRIMARY KEY (guid)
);
CREATE INDEX parent_table_tm_idx ON dev.dev_z_detections USING btree (tm DESC);
如果有新的日期,我会创建一个简单的触发器来创建新的分区
CREATE OR REPLACE FUNCTION parent_table_insert_fn()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
DECLARE
schema_name varchar(255) := 'public';
table_master varchar(255) := 'parent_table';
table_part varchar(255) := '';
table_date_underscore varchar(255) := '';
constraint_tm_start timestamp with time zone;
constraint_tm_end timestamp with time zone;
BEGIN
table_part := table_master || '_' || to_char(timezone('utc', new.tm), 'YYYY_MM_DD');
table_date_underscore := '' || to_char(timezone('utc', new.tm), 'YYYY_MM_DD');
PERFORM
1
from
information_schema.tables
WHERE
table_schema = schema_name
AND table_name = table_part
limit 1;
IF NOT FOUND
then
constraint_tm_start := to_char(timezone('utc', new.tm), 'YYYY-MM-DD')::timestamp at time zone 'utc';
constraint_tm_end := constraint_tm_start + interval '1 day';
execute '
CREATE TABLE ' || schema_name || '.' || table_part || ' (
CONSTRAINT parent_table_' || table_date_underscore || '_pk PRIMARY KEY (guid),
CONSTRAINT parent_table_' || table_date_underscore || '_ck CHECK ( tm >= ' || QUOTE_LITERAL(constraint_tm_start) || ' and tm < ' || QUOTE_LITERAL(constraint_tm_end) || ' )
) INHERITS (' || schema_name || '.' || table_master || ');
CREATE INDEX parent_table_' || table_date_underscore || '_tidx ON ' || schema_name || '.' || table_part || ' USING btree (tm desc);
';
END IF;
execute '
INSERT INTO ' || schema_name || '.' || table_part || '
SELECT ( (' || QUOTE_LITERAL(NEW) || ')::' || schema_name || '.' || TG_RELNAME || ' ).*;';
RETURN NULL;
END;
$function$
;
在父表上启用触发器:
create trigger parent_table_insert_fn_trigger before insert
on parent_table for each row execute function parent_table_insert_fn();
并在其中插入一些数据:
insert into parent_table(guid, tm, value)
values
('1f4835c0-2b22-4cfc-ab3c-940af679ace6', '2021-04-06 14:00:00+03:00', 1),
('5ca37d57-e79e-4e1f-ace7-91eb671f3a82', '2021-04-07 15:30:00+03:00', 2),
('b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808', '2021-04-07 17:10:00+03:00', 3),
('ad69cd35-5b20-466f-9d5c-61fa5d41bc5f', '2021-04-08 16:50:00+03:00', 66),
('bb0ec87a-72bb-438e-8f4c-2cdc3ae7d525', '2021-03-21 19:00:00+03:00', -10);
经过这些操作后,我得到了 4 张桌子:
parent_table
parent_table_2021_03_21
parent_table_2021_04_06
parent_table_2021_04_07
parent_table_2021_04_08
检查索引是否适用于时间戳过滤器:
explain analyze
select * from parent_table where tm > '2021-04-07 10:00:00+03:00' and tm <= '2021-04-07 16:30:00+03:00';
> > >
Append (cost=0.00..14.43 rows=8 width=28) (actual time=0.017..0.020 rows=1 loops=1)
-> Seq Scan on parent_table parent_table_1 (cost=0.00..0.00 rows=1 width=28) (actual time=0.002..0.002 rows=0 loops=1)
Filter: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
-> Bitmap Heap Scan on parent_table_2021_04_07 parent_table_2 (cost=4.22..14.39 rows=7 width=28) (actual time=0.013..0.015 rows=1 loops=1)
Recheck Cond: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
Heap Blocks: exact=1
-> Bitmap Index Scan on parent_table_2021_04_07_tidx (cost=0.00..4.22 rows=7 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: ((tm > '2021-04-07 10:00:00+03'::timestamp with time zone) AND (tm <= '2021-04-07 16:30:00+03'::timestamp with time zone))
Planning Time: 0.381 ms
Execution Time: 0.053 ms
这很好,可以按我的预期工作。
但是通过某些主键选择会给我下一个分析的输出:
explain analyze
select * from parent_table where guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808';
> > >
Append (cost=0.00..32.70 rows=5 width=28) (actual time=0.021..0.035 rows=1 loops=1)
-> Seq Scan on parent_table parent_table_1 (cost=0.00..0.00 rows=1 width=28) (actual time=0.003..0.004 rows=0 loops=1)
Filter: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_06_pk on parent_table_2021_04_06 parent_table_2 (cost=0.15..8.17 rows=1 width=28) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_07_pk on parent_table_2021_04_07 parent_table_3 (cost=0.15..8.17 rows=1 width=28) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_04_08_pk on parent_table_2021_04_08 parent_table_4 (cost=0.15..8.17 rows=1 width=28) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
-> Index Scan using parent_table_2021_03_21_pk on parent_table_2021_03_21 parent_table_5 (cost=0.15..8.17 rows=1 width=28) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: (guid = 'b57bfbf6-7ed0-4dde-a40b-9fa2e6f24808'::uuid)
Planning Time: 0.345 ms
Execution Time: 0.076 ms
而且这个查询给了我很差的性能(我猜?),尤其是在非常大的分区表上,比如每个分区有 10M+ 行。
所以我的问题是:我应该怎么做才能避开分区扫描以进行简单的主键查找?
注意:我使用的是 PostgreSQL 13.1
更新 2021-04-07 15:22+03:00: 所以,在半生产表中我有这样的结果:
- 时间戳过滤器
Append (cost=0.00..809.35 rows=16616 width=32) (actual time=0.037..5.612 rows=16865 loops=1)
-> Seq Scan on wifi_logs t_1 (cost=0.00..0.00 rows=1 width=32) (actual time=0.010..0.011 rows=0 loops=1)
Filter: ((tm >= '2020-04-07 14:00:00+03'::timestamp with time zone) AND (tm <= '2020-04-07 17:00:00+03'::timestamp with time zone))
-> Index Scan using wifi_logs_tm_idx_2020_04_07 on wifi_logs_2020_04_07 t_2 (cost=0.29..726.27 rows=16615 width=32) (actual time=0.026..4.655 rows=16865 loops=1)
Index Cond: ((tm >= '2020-04-07 14:00:00+03'::timestamp with time zone) AND (tm <= '2020-04-07 17:00:00+03'::timestamp with time zone))
Planning Time: 14.869 ms
Execution Time: 6.151 ms
- GUID(主键过滤器)
-> Seq Scan on wifi_logs t_1 (cost=0.00..0.00 rows=1 width=32) (actual time=0.015..0.016 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
-> Seq Scan on wifi_logs_2014_12_04 t_4 (cost=0.00..1.01 rows=1 width=32) (actual time=0.006..0.006 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Rows Removed by Filter: 1
--
-- TONS OF PARTITION TABLE SCANS
---
-> Index Scan using wifi_logs_2021_03_18_pk on wifi_logs_2021_03_18 t_387 (cost=0.42..8.44 rows=1 width=32) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
-> Seq Scan on wifi_logs_1970_01_01 t_388 (cost=0.00..3.60 rows=1 width=32) (actual time=0.020..0.020 rows=0 loops=1)
Filter: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Rows Removed by Filter: 119
-> Index Scan using wifi_logs_2021_03_19_pk on wifi_logs_2021_03_19 t_389 (cost=0.42..8.44 rows=1 width=32) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
--
-- ANOTHER TONS OF PARTITION TABLE SCANS
---
-> Index Scan using wifi_logs_2021_04_07_pk on wifi_logs_2021_04_07 t_408 (cost=0.42..8.44 rows=1 width=32) (actual time=0.010..0.010 rows=0 loops=1)
Index Cond: (guid = '78bc5537-4f2f-4e83-8abd-4241ac3f9f27'::uuid)
Planning Time: 97.662 ms
Execution Time: 36.756 ms
【问题讨论】:
-
您应该在 Postgres 中使用本机分区,这比基于继承的分区要快得多。但无论如何:如果您的查询不包含分区键,那么这将始终比在非分区表上执行相同操作要慢。
-
执行时间:0.076 ms,你要什么样的性能?
-
@FrankHeikens 如果我有 1500 多个分区,实际查询速度会非常慢(不如我查询没有这些分区的单个大表那么快)upd: 并且 0.076 仍然比0.053 用于更复杂的条件(时间戳过滤)
-
@a_horse_with_no_name 我会更新问题
-
@a_horse_with_no_name 更新的问题:刚刚给出了半生产数据库的示例输出
标签: sql postgresql indexing plpgsql