【问题标题】:Postgresql OVERLAPS performance on big datasetsPostgresql 在大数据集上的 OVERLAPS 性能
【发布时间】:2021-06-26 14:30:50
【问题描述】:

我正在努力使用 OVERLAPS 构建高性能查询。

表定义:

CREATE TABLE date_range_table
(
  id int NOT NULL,
  item_id int NOT NULL,
  item1_id int NOT NULL,
  item2_id int NOT NULL,
  item3_id int NOT NULL,
  item4_id int NULL,
  item5_id int NULL,
  date_from date NOT NULL,
  date_to date NOT NULL
  CONSTRAINT pk_date_range_table PRIMARY KEY (id)
  -- Some other constraints
)

-- Unique constraints(partial)

CREATE UNIQUE INDEX ix_date_range_table_items_null
ON date_range_table USING btree
(item_id, item1_id, item2_id, item3_id, date_from, date_to)
WHERE item4_id IS NULL AND item5_id IS NULL;

CREATE UNIQUE INDEX ix_date_range_table_items_not_null
ON date_range_table USING btree
(item_id, item1_id, item2_id, item3_id, item4_id, item5_id, date_from, date_to)
WHERE item4_id IS NOT NULL AND item5_id IS NOT NULL;

我也有与临时表 (ETL) 相同的表。

staging_date_range_table

现在我需要检查 staging_date_range_table 与 date_range_table 中可能存在的任何跨接(OVERLAPS)。

到目前为止我有什么:

SELECT count(t.*)
FROM staging_date_range_table t
JOIN date_range_table td
  on
      t.item_id = td.item_id
  AND t.item1_id = td.item1_id
  AND t.item2_id = td.item2_id
  AND t.item3_id = td.item3_id
  AND COALESCE(t.item4_id, 0) = COALESCE(td.item4_id, 0)
  AND COALESCE(t.item5_id, 0) = COALESCE(td.item5_id, 0)

WHERE (t.date_from, t.date_to) OVERLAPS (td.date_from, td.date_to)

工作集:

staging_date_range_table: 100k rows

date_range_table: 20mil rows

这将运行 10 小时 +

知道如何加快速度吗?

========== 更新==========

从一个较小的暂存集(COLUMN NAMES REPLACED)解释:

"Update on _2103301527_2fd34e_date_range_table t  (cost=95785.89..1923251.02 rows=1 width=3560)"
"  ->  Merge Join  (cost=95785.89..1923251.02 rows=1 width=3560)"
"        Merge Cond: ((td.item_id = t._fcd195352fb4bd386b496c42e58904bb) AND (td.item1_id = t._2c7d5721c3def81d253271f0c2065421))"
"        Join Filter: ((COALESCE(t._842c0c2670edb9fe4ede4cc9e4bac082, '0'::bigint) = COALESCE(td.item4_id, '0'::bigint)) AND (COALESCE(t._522ffbc4b23b13d84bad22e151f4c9df, '0'::bigint) = COALESCE(td.item5_id, '0'::bigint)) AND (t._9afea17b49bfb167b72276a824712179 = td.item2_id) AND (t._452088c89804e1b5d34a6d266ca6c51a = td.item3_id) AND ((t._3027783fd3d10afad84a9a15552b3445 <> td.date_from) OR (t._0dc00dd4dfdbf2864a0cdf57034916c2 <> td.date_to)) AND ""overlaps""((t._3027783fd3d10afad84a9a15552b3445)::timestamp with time zone, (t._0dc00dd4dfdbf2864a0cdf57034916c2)::timestamp with time zone, (td.date_from)::timestamp with time zone, (td.date_to)::timestamp with time zone))"
"        ->  Index Scan using index_iip_on_... on date_range_table td  (cost=0.56..1594851.20 rows=8202145 width=62)"
"        ->  Materialize  (cost=95785.34..95889.13 rows=20759 width=3550)"
"              ->  Sort  (cost=95785.34..95837.23 rows=20759 width=3550)"
"                    Sort Key: t._fcd195352fb4bd386b496c42e58904bb, t._2c7d5721c3def81d253271f0c2065421"
"                    ->  Bitmap Heap Scan on _2103301527_2fd34e_date_range_table t  (cost=1353.30..30862.77 rows=20759 width=3550)"
"                          Recheck Cond: (_rs < 100)"
"                          ->  Bitmap Index Scan on idx_2f01fb51327eb8ab144f717aad1c80487b711093f6efd7af3a  (cost=0.00..1348.11 rows=20759 width=0)"
"                                Index Cond: (_rs < 100)"

实际解释(列名已替换):

"Update on _2103301527_2fd34e_staging_date_range_table t  (cost=2329696.92..2475637.46 rows=188 width=2742) (actual time=167057.275..167057.277 rows=0 loops=1)"
"  Buffers: shared hit=417 read=370671, temp read=269876 written=270422"
"  ->  Merge Join  (cost=2329696.92..2475637.46 rows=188 width=2742) (actual time=167057.274..167057.275 rows=0 loops=1)"
"        Merge Cond: ((td.item1_id = t._9afea17b49bfb167b72276a824712179) AND (td.item_id = t._fcd195352fb4bd386b496c42e58904bb) AND ((COALESCE(td.item4_id, '0'::bigint)) = (COALESCE(t._842c0c2670edb9fe4ede4cc9e4bac082, '0'::bigint))) AND ((COALESCE(td.item5_id, '0'::bigint)) = (COALESCE(t._522ffbc4b23b13d84bad22e151f4c9df, '0'::bigint))) AND (td.item2_id = t._2c7d5721c3def81d253271f0c2065421) AND (td.item3_id = t._452088c89804e1b5d34a6d266ca6c51a))"
"        Join Filter: (((t._3027783fd3d10afad84a9a15552b3445 <> td.date_from) OR (t._0dc00dd4dfdbf2864a0cdf57034916c2 <> td.date_to)) AND ""overlaps""((t._3027783fd3d10afad84a9a15552b3445)::timestamp with time zone, (t._0dc00dd4dfdbf2864a0cdf57034916c2)::timestamp with time zone, (td.date_from)::timestamp with time zone, (td.date_to)::timestamp with time zone))"
"        Rows Removed by Join Filter: 187204665"
"        Buffers: shared hit=417 read=370671, temp read=269876 written=270422"
"        ->  Sort  (cost=1980647.14..2001152.50 rows=8202145 width=62) (actual time=19793.734..24438.091 rows=8176251 loops=1)"
"              Sort Key: td.item1_id, td.item_id, (COALESCE(td.item4_id, '0'::bigint)), (COALESCE(td.item5_id, '0'::bigint)), td.item2_id, td.item3_id"
"              Sort Method: external merge  Disk: 658152kB"
"              Buffers: shared hit=382 read=339568, temp read=264238 written=264767"
"              ->  Seq Scan on date_range_table td  (cost=0.00..421967.45 rows=8202145 width=62) (actual time=0.021..4518.051 rows=8202145 loops=1)"
"                    Buffers: shared hit=378 read=339568"
"        ->  Materialize  (cost=348994.45..349641.59 rows=129428 width=2732) (actual time=614.072..9890.352 rows=187202992 loops=1)"
"              Buffers: shared hit=35 read=31103, temp read=5638 written=5655"
"              ->  Sort  (cost=348994.45..349318.02 rows=129428 width=2732) (actual time=614.069..920.578 rows=129777 loops=1)"
"                    Sort Key: t._9afea17b49bfb167b72276a824712179, t._fcd195352fb4bd386b496c42e58904bb, (COALESCE(t._842c0c2670edb9fe4ede4cc9e4bac082, '0'::bigint)), (COALESCE(t._522ffbc4b23b13d84bad22e151f4c9df, '0'::bigint)), t._2c7d5721c3def81d253271f0c2065421, t._452088c89804e1b5d34a6d266ca6c51a"
"                    Sort Method: external merge  Disk: 36632kB"
"                    Buffers: shared hit=35 read=31103, temp read=5638 written=5655"
"                    ->  Seq Scan on _2103301527_2fd34e_staging_date_range_table t  (cost=0.00..32755.85 rows=129428 width=2732) (actual time=0.574..168.827 rows=129777 loops=1)"
"                          Filter: (_rs < 100)"
"                          Buffers: shared hit=35 read=31103"
"Planning Time: 4.976 ms"
"Execution Time: 167135.877 ms"

========== 更新==========

增加左下角的generate_series

Fiddle

【问题讨论】:

  • 很难说没有EXPLAIN (ANALYZE, BUFFERS)
  • 我建议在db-fiddle.com 设置一个示例(包括解释计划),以便人们可以快速复制和尝试事物。
  • @Dominik Fiddle 添加了
  • WHERE item5_id IS NOT NULL AND item5_id IS NOT NULL;
  • @Willem 我的建议也是在所有项目列上都有一个非唯一索引,如果您不需要它,请尝试摆脱合并并尝试 SET work_mem TO 'xGB'; 如果您的条目适合进入 RAM 例如对于排序,您将获得内存快速排序,而不是大量使用 IO。

标签: sql postgresql performance query-optimization


【解决方案1】:

您应该尝试daterange,而不是单独的“从”和“到”日期,并使用“重叠”运算符&amp;&amp;。这允许您使用 GiST 索引,并且您可能可以获得快速的嵌套循环连接。其背后的原因是合并连接中的大多数行都被OVERLAPS 条件过滤掉了。

WHERE daterange(t.date_from,  t.date_to,  '[]')
   && daterange(td.date_from, td.date_to, '[]')

建议索引:

CREATE INDEX ON date_range_table USING gist
   (daterange(date_from, date_to, '[]'));

【讨论】:

  • 有没有办法重写查询以提高性能,或者这只是野兽的本性,我必须等待查询完成?
  • 也许吧。我概述了一些您可以在我的回答中尝试的内容。
  • 我终于有时间运行一些基准测试,daterange 的索引成功了。太感谢了。赏金来袭^^
  • @Willem 出于好奇:你从这个优化中大概得到了多少好处?
  • @Dominik 最后我使用了一个 btree-gist 作为我的唯一列和日期范围。 ETL 表中的 100k。数据库中的 1000 万。从 10 小时到 6 分钟。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2012-08-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-04-20
  • 2018-12-14
相关资源
最近更新 更多