当我对计数值进行排序时，查询需要很长时间才能执行答案

【问题标题】：When I do an order by on the count value, the query takes a long time to execute当我对计数值进行排序时，查询需要很长时间才能执行
【发布时间】：2022-01-20 22:51:10
【问题描述】：

Django: Performance issues with query sets using m2m

我在这里问了这个问题，但没有得到答案，所以我重新发布一个更详细的问题。

当我将ORDER BY 与Count 聚合值一起使用时，由于某种原因未使用索引并且查询需要很长时间才能执行。

videos_video_tags 列有大约 130 万行。

以下操作大约需要 500-800 毫秒。

SELECT "videos_tag"."id",
       "videos_tag"."name",
       COUNT("videos_video_tags"."video_id") AS "count"
FROM "videos_tag"
LEFT OUTER JOIN "videos_video_tags" ON ("videos_tag"."id" = "videos_video_tags"."tag_id")
GROUP BY "videos_tag"."id"
ORDER BY "count" DESC
LIMIT 100;

从此 SQL 语句中删除 ORDER BY "count" DESC 只需要大约 2-10ms。

如果使用EXPLAIN查看执行计划中的详细信息，会发现使用ORDER BY不使用索引的查询没有被使用。

                                                                         QUERY PLAN                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=35198.66..35198.91 rows=100 width=37) (actual time=770.355..770.376 rows=100 loops=1)
   Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
   Buffers: shared hit=6928 read=4311
   ->  Sort  (cost=35198.66..35212.53 rows=5548 width=37) (actual time=770.354..770.366 rows=100 loops=1)
         Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
         Sort Key: (count(videos_video_tags.video_id)) DESC
         Sort Method: top-N heapsort  Memory: 37kB
         Buffers: shared hit=6928 read=4311
         ->  HashAggregate  (cost=34931.14..34986.62 rows=5548 width=37) (actual time=766.050..768.090 rows=5548 loops=1)
               Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
               Group Key: videos_tag.id
               Batches: 1  Memory Usage: 977kB
               Buffers: shared hit=6928 read=4311
               ->  Hash Right Join  (cost=221.83..28246.14 rows=1337000 width=45) (actual time=2.840..497.697 rows=1337000 loops=1)
                     Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
                     Inner Unique: true
                     Hash Cond: (videos_video_tags.tag_id = videos_tag.id)
                     Buffers: shared hit=6928 read=4311
                     ->  Seq Scan on public.videos_video_tags  (cost=0.00..24512.00 rows=1337000 width=32) (actual time=0.008..109.061 rows=1337000 loops=1)
                           Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
                           Buffers: shared hit=6831 read=4311
                     ->  Hash  (cost=152.48..152.48 rows=5548 width=29) (actual time=2.795..2.796 rows=5548 loops=1)
                           Output: videos_tag.id, videos_tag.name
                           Buckets: 8192  Batches: 1  Memory Usage: 399kB
                           Buffers: shared hit=97
                           ->  Seq Scan on public.videos_tag  (cost=0.00..152.48 rows=5548 width=29) (actual time=0.008..1.048 rows=5548 loops=1)
                                 Output: videos_tag.id, videos_tag.name
                                 Buffers: shared hit=97
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.497 ms
 Execution Time: 770.812 ms
(32 rows)

Time: 772.336 ms

如果您没有使用 ORDER BY，您将看到以下内容

                                                                                         QUERY PLAN                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.71..1689.61 rows=100 width=37) (actual time=0.069..9.664 rows=100 loops=1)
   Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
   Buffers: shared hit=7761
   ->  GroupAggregate  (cost=0.71..93700.72 rows=5548 width=37) (actual time=0.069..9.647 rows=100 loops=1)
         Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
         Group Key: videos_tag.id
         Buffers: shared hit=7761
         ->  Merge Left Join  (cost=0.71..86960.24 rows=1337000 width=45) (actual time=0.060..8.222 rows=11375 loops=1)
               Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
               Merge Cond: (videos_tag.id = videos_video_tags.tag_id)
               Buffers: shared hit=7761
               ->  Index Scan using videos_tag_pkey on public.videos_tag  (cost=0.28..635.50 rows=5548 width=29) (actual time=0.011..0.066 rows=101 loops=1)
                     Output: videos_tag.id, videos_tag.name, videos_tag.is_actress, videos_tag.created_at
                     Buffers: shared hit=102
               ->  Index Scan using videos_video_tags_tag_id_2673cfc8 on public.videos_video_tags  (cost=0.43..69598.37 rows=1337000 width=32) (actual time=0.012..5.928 rows=11375 loops=1)
                     Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
                     Buffers: shared hit=7659
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.364 ms
 Execution Time: 9.734 ms
(21 rows)

Time: 10.639 ms

我认为索引也存在没有任何问题。

public | videos_tag_name_key                                          | index | postgres | videos_tag
public | videos_tag_pkey                                              | index | postgres | videos_tag
public | videos_video_tags_pkey                                       | index | postgres | videos_video_tags
public | videos_video_tags_tag_id_2673cfc8                            | index | postgres | videos_video_tags
public | videos_video_tags_video_id_8220dbb8                          | index | postgres | videos_video_tags
public | videos_video_tags_video_id_tag_id_f8d6ba70_uniq              | index | postgres | videos_video_tags

我在这个问题上花费了相当多的时间，但仍然无法解决它。您认为可能是什么原因？

【问题讨论】：

HashAggregate (... rows=5548 ...) (... rows=5548 ...) 行表明您有 5548 个查询结果。添加ORDER BY 时，需要对这些结果进行排序，然后返回前100 个（来自LIMIT）。如果您删除 ORDER BY，则前 random 将返回 100 条记录，速度更快，但无用，因为您将不知道它们是否是 TOP 100。
那么我该怎么做呢？由于我们在实现分页，所以需要按count排序，得到前100名左右。
也许是MATERIALIZED VIEW，就像这个答案可以提供帮助：stackoverflow.com/a/12925639/724039

标签： sql postgresql

【解决方案1】：

索引不直接用于选择性。它们用于生成已按字段排序的行，该字段对连接和分组都很有用。一旦 100 个组以某种方便（对系统）的顺序挤出，它就可以停止，基本上很早。

但是使用 ORDER BY，除非您知道所有组的计数，否则您不能按计数对所有组进行排序。没有早停的机会。由于这是使用索引的主要优势，一旦机会消失，就没有理由再使用索引了。无论如何，哈希连接在必须运行完成时可能更有效。

那么我该怎么做呢？由于我们在实现分页，所以我们需要按count排序，得到前100名左右

不要在数据库中实现分页。 5548 并不多，计算一次并将它们全部发送到客户端或应用程序服务器，让它自己处理分页。而且这不会很快改变，所以使用物化视图来存储摘要并每隔一小时左右重新计算一次。

【讨论】：