【发布时间】:2018-06-21 15:16:15
【问题描述】:
我有一个日志表purchase_history,用于跟踪客户的购买历史记录,我想通过date_purchased 获取给定customer_id 订单的每种产品的最新购买信息。
该表有数百万条记录,对于某些包含表中大部分记录的 customer_id(例如,某些 customer_id 的 25% 记录),我的解决方案非常慢(20 多秒),例如其他只有几行的customer_id,速度非常快(1 秒)。
表定义:
create table purchase_history (
id int PRIMARY KEY,
product_name VARCHAR(100),
date_purchased date,
customer_id int
);
一些虚拟数据:
INSERT into purchase_history VALUES (
1, 'A', '2017-10-10', 123)
, (2, 'A', '2017-10-11', 123)
, (3, 'B', '2017-10-12', 123)
, (4, 'C', '2017-10-09', 123)
, (5, 'B', '2017-11-10', 123);
我在 (customer_id, product_name, date_purchased) 上有一个多列索引
我想要得到的结果:
5,B,2017-11-10
2,A,2017-10-11
4,C,2017-10-09
到目前为止我提出的解决方案:
SELECT *
FROM (
SELECT DISTINCT ON (product_name) *
FROM purchase_history
WHERE customer_id = 123
ORDER BY product_name, date_purchased DESC
) t
ORDER BY date_purchased DESC;
我想知道是否有更好或更快的解决方案?
更新:2018 年 1 月 14 日
感谢到目前为止的 cmets 和答案,并对造成的混乱表示抱歉。我想补充一些细节:
- 所有列都是
not null,包括date_purchased -
我的索引与排序匹配 (
date_purchased DESC)create index purchase_history_idx on purchase_history(customer_id, product_name, date_purchased DESC) 最好使用引用另一个表的
product_id,但不幸的是production_name在任何其他表中都不存在。这是客户指定的名称。假设我有一个用户界面供客户输入他们想要购买的东西,而客户输入的确切内容是product_name。所以purchase_history会跟踪所有客户的所有“愿望清单”。
记录数:
- 表格共有20M条记录
-
customer_id=123是我们最大的客户,包含 8573491 条记录,占 42% -
customer_id=124是我们的第二大客户,包含 3062464 条记录,占 15%
这是我原来的distinct on解决方案的解释分析:
Sort (cost=2081285.86..2081607.09 rows=128492 width=106) (actual time=11771.444..12012.732 rows=623680 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 69448kB
-> Unique (cost=0.56..2061628.55 rows=128492 width=106) (actual time=0.021..11043.910 rows=623680 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=106) (actual time=0.019..8506.109 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Planning time: 0.098 ms
Execution time: 12133.664 ms
这是 Erwin 对 CTE 解决方案的解释分析
Sort (cost=125.62..125.87 rows=101 width=532) (actual time=30924.208..31154.908 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
CTE cte
-> Recursive Union (cost=0.56..120.23 rows=101 width=39) (actual time=0.022..29772.944 rows=623680 loops=1)
-> Limit (cost=0.56..0.80 rows=1 width=39) (actual time=0.020..0.020 rows=1 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=39) (actual time=0.019..0.019 rows=1 loops=1)
Index Cond: (customer_id = 123)
-> Nested Loop (cost=0.56..11.74 rows=10 width=39) (actual time=0.046..0.047 rows=1 loops=623680)
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.13 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=623680)
-> Index Scan using purchase_history_idx on purchased_history purchased_history_1 (cost=0.56..1616900.83 rows=2828637 width=39) (actual time=0.044..0.044 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=0.024..30269.107 rows=623680 loops=1)
Planning time: 0.207 ms
Execution time: 31273.462 ms
让我感到惊讶的另一件事是,我的查询运行速度比customer_id=124 慢得多,它包含的记录比customer_id=123 少得多(注意:不使用索引扫描,而是使用位图索引扫描,而我不使用'不知道为什么)
Sort (cost=1323695.21..1323812.68 rows=46988 width=106) (actual time=85739.561..85778.735 rows=109347 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 14560kB
-> Unique (cost=1301329.65..1316845.56 rows=46988 width=106) (actual time=60443.890..85608.347 rows=109347 loops=1)
-> Sort (cost=1301329.65..1309087.61 rows=3103183 width=106) (actual time=60443.888..84727.062 rows=3062464 loops=1)
" Sort Key: purchase_history.product_name, purchase_history.date_purchased"
Sort Method: external merge Disk: 427240kB
-> Bitmap Heap Scan on purchase_history (cost=203634.23..606098.02 rows=3103183 width=106) (actual time=8340.662..10584.483 rows=3062464 loops=1)
Recheck Cond: (customer_id = 124)
Rows Removed by Index Recheck: 4603902
Heap Blocks: exact=41158 lossy=132301
-> Bitmap Index Scan on purchase_history_idx (cost=0.00..202858.43 rows=3103183 width=0) (actual time=8331.711..8331.711 rows=3062464 loops=1)
Index Cond: (customer_id = 124)
Planning time: 0.102 ms
Execution time: 85872.871 ms
2018 年 1 月 15 日更新
这是 riskop 询问的explain (analyze,buffers):
GroupAggregate (cost=0.56..683302.46 rows=128492 width=31) (actual time=0.028..5156.113 rows=623680 loops=1)
Group Key: product_name
Buffers: shared hit=1242675
-> Index Only Scan using purchase_history_idx on purchase_history (cost=0.56..639587.99 rows=8485910 width=31) (actual time=0.022..2673.661 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Heap Fetches: 0
Buffers: shared hit=1242675
Planning time: 0.079 ms
Execution time: 5272.877 ms
请注意,即使它更快,我也不能使用此查询,原因有两个:
- 查询中未指定排序,而我的预期结果集按
date_purchased DESC排序 - 我还需要在结果集中包含几列。所以我不能只使用
group by。
解决这两个问题的一种方法是将 riskop 的基于 group by 的查询用作子查询或 CTE,根据需要添加 order by 和更多列。
2018 年 1 月 21 日更新
利用“松散索引扫描”听起来不错,但不幸的是product_name 是高度分布的。有 1810440 个唯一的 product_name 和 2565179 个唯一的 product_name 和 customer_id 组合:
select count(distinct product_name) from purchase_history; -- 1810440
select count(distinct (customer_id, product_name)) from purchase_history; -- 2565179
因此,对 riskop 的 313 毫秒查询对我来说用了 33 秒:
Sort (cost=122.42..122.68 rows=101 width=532) (actual time=33509.943..33748.856 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
" Buffers: shared hit=3053791 read=69706, temp read=4244 written=8484"
CTE cte
-> Recursive Union (cost=0.56..117.04 rows=101 width=39) (actual time=5.886..32288.212 rows=623680 loops=1)
Buffers: shared hit=3053788 read=69706
-> Limit (cost=0.56..0.77 rows=1 width=39) (actual time=5.885..5.885 rows=1 loops=1)
Buffers: shared hit=5 read=3
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..1809076.40 rows=8543899 width=39) (actual time=5.882..5.882 rows=1 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=5 read=3
-> Nested Loop (cost=0.56..11.42 rows=10 width=39) (actual time=0.050..0.051 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.10 rows=1 width=39) (actual time=0.049..0.049 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> Index Scan using purchase_history_idx on purchase_history purchase_history_1 (cost=0.56..1537840.29 rows=2847966 width=39) (actual time=0.048..0.048 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
Buffers: shared hit=3053783 read=69703
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=5.889..32826.816 rows=623680 loops=1)
" Buffers: shared hit=3053788 read=69706, temp written=4240"
Planning time: 0.278 ms
Execution time: 33873.798 ms
请注意,它进行了内存排序:Sort Method: quicksort Memory: 853kB 用于 riskop,但外部磁盘排序:Sort Method: external merge Disk: 33880kB 用于我。
如果这不是关系数据库的可解决问题,我想知道是否还有其他非关系数据库或基于大数据的解决方案,只要它满足 2 个要求:
- 合理的响应时间(例如 2 秒)。
- 实时无延迟。
【问题讨论】:
-
如果有很多行,它有时会采用顺序扫描。你能发布一个解释分析吗?我不认为group by,即。 “select product_name, date_purchased from purchase_history where customer_id = 123 group by product_name, date_purchased”会有所帮助,但值得一试。
-
{product_name, date_purchased} 可能是自然键。 (如果它是唯一的,它不是) {customer_id, date_purchased} 相同,因此您最终将它们三个都作为自然键。 (如果 date_purchased 足够独特...... -->> 它应该是一个时间戳)
-
那么你有答案了吗?
-
您可以创建一个包含列 (customer_id,product_id,last_purchase_date,id) 的“帮助”表。在该表中 customer_id 和 product_id 将是复合键。根据您 1 月 21 日的更新。该表中将有大约 250 万条记录。这比原来的要少得多。您还可以在此表的列(customer_id、last_purchase_date)上建立索引。我希望搜索 customer_id + last_purchase_date 的查询会很快。这样做的代价是每次向 20M 的表中插入一条记录时,您都必须维护新表及其索引。
标签: sql postgresql greatest-n-per-group query-performance postgresql-performance