【问题标题】:When I do an order by on the count value, the query takes a long time to execute当我对计数值进行排序时,查询需要很长时间才能执行
【发布时间】:2022-01-20 22:51:10
【问题描述】:

Django: Performance issues with query sets using m2m

我在这里问了这个问题,但没有得到答案,所以我重新发布一个更详细的问题。

当我将ORDER BYCount 聚合值一起使用时,由于某种原因未使用索引并且查询需要很长时间才能执行。

videos_video_tags 列有大约 130 万行。

以下操作大约需要 500-800 毫秒。

SELECT "videos_tag"."id",
       "videos_tag"."name",
       COUNT("videos_video_tags"."video_id") AS "count"
FROM "videos_tag"
LEFT OUTER JOIN "videos_video_tags" ON ("videos_tag"."id" = "videos_video_tags"."tag_id")
GROUP BY "videos_tag"."id"
ORDER BY "count" DESC
LIMIT 100;

从此 SQL 语句中删除 ORDER BY "count" DESC 只需要大约 2-10ms

如果使用EXPLAIN查看执行计划中的详细信息,会发现使用ORDER BY不使用索引的查询没有被使用。

                                                                         QUERY PLAN                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=35198.66..35198.91 rows=100 width=37) (actual time=770.355..770.376 rows=100 loops=1)
   Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
   Buffers: shared hit=6928 read=4311
   ->  Sort  (cost=35198.66..35212.53 rows=5548 width=37) (actual time=770.354..770.366 rows=100 loops=1)
         Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
         Sort Key: (count(videos_video_tags.video_id)) DESC
         Sort Method: top-N heapsort  Memory: 37kB
         Buffers: shared hit=6928 read=4311
         ->  HashAggregate  (cost=34931.14..34986.62 rows=5548 width=37) (actual time=766.050..768.090 rows=5548 loops=1)
               Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
               Group Key: videos_tag.id
               Batches: 1  Memory Usage: 977kB
               Buffers: shared hit=6928 read=4311
               ->  Hash Right Join  (cost=221.83..28246.14 rows=1337000 width=45) (actual time=2.840..497.697 rows=1337000 loops=1)
                     Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
                     Inner Unique: true
                     Hash Cond: (videos_video_tags.tag_id = videos_tag.id)
                     Buffers: shared hit=6928 read=4311
                     ->  Seq Scan on public.videos_video_tags  (cost=0.00..24512.00 rows=1337000 width=32) (actual time=0.008..109.061 rows=1337000 loops=1)
                           Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
                           Buffers: shared hit=6831 read=4311
                     ->  Hash  (cost=152.48..152.48 rows=5548 width=29) (actual time=2.795..2.796 rows=5548 loops=1)
                           Output: videos_tag.id, videos_tag.name
                           Buckets: 8192  Batches: 1  Memory Usage: 399kB
                           Buffers: shared hit=97
                           ->  Seq Scan on public.videos_tag  (cost=0.00..152.48 rows=5548 width=29) (actual time=0.008..1.048 rows=5548 loops=1)
                                 Output: videos_tag.id, videos_tag.name
                                 Buffers: shared hit=97
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.497 ms
 Execution Time: 770.812 ms
(32 rows)

Time: 772.336 ms

如果您没有使用 ORDER BY,您将看到以下内容

                                                                                         QUERY PLAN                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.71..1689.61 rows=100 width=37) (actual time=0.069..9.664 rows=100 loops=1)
   Output: videos_tag.id, videos_tag.name, (count(videos_video_tags.video_id))
   Buffers: shared hit=7761
   ->  GroupAggregate  (cost=0.71..93700.72 rows=5548 width=37) (actual time=0.069..9.647 rows=100 loops=1)
         Output: videos_tag.id, videos_tag.name, count(videos_video_tags.video_id)
         Group Key: videos_tag.id
         Buffers: shared hit=7761
         ->  Merge Left Join  (cost=0.71..86960.24 rows=1337000 width=45) (actual time=0.060..8.222 rows=11375 loops=1)
               Output: videos_tag.id, videos_tag.name, videos_video_tags.video_id
               Merge Cond: (videos_tag.id = videos_video_tags.tag_id)
               Buffers: shared hit=7761
               ->  Index Scan using videos_tag_pkey on public.videos_tag  (cost=0.28..635.50 rows=5548 width=29) (actual time=0.011..0.066 rows=101 loops=1)
                     Output: videos_tag.id, videos_tag.name, videos_tag.is_actress, videos_tag.created_at
                     Buffers: shared hit=102
               ->  Index Scan using videos_video_tags_tag_id_2673cfc8 on public.videos_video_tags  (cost=0.43..69598.37 rows=1337000 width=32) (actual time=0.012..5.928 rows=11375 loops=1)
                     Output: videos_video_tags.id, videos_video_tags.video_id, videos_video_tags.tag_id
                     Buffers: shared hit=7659
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.364 ms
 Execution Time: 9.734 ms
(21 rows)

Time: 10.639 ms

我认为索引也存在没有任何问题。

public | videos_tag_name_key                                          | index | postgres | videos_tag
public | videos_tag_pkey                                              | index | postgres | videos_tag
public | videos_video_tags_pkey                                       | index | postgres | videos_video_tags
public | videos_video_tags_tag_id_2673cfc8                            | index | postgres | videos_video_tags
public | videos_video_tags_video_id_8220dbb8                          | index | postgres | videos_video_tags
public | videos_video_tags_video_id_tag_id_f8d6ba70_uniq              | index | postgres | videos_video_tags

我在这个问题上花费了相当多的时间,但仍然无法解决它。 您认为可能是什么原因?

【问题讨论】:

  • HashAggregate (... rows=5548 ...) (... rows=5548 ...) 行表明您有 5548 个查询结果。添加ORDER BY 时,需要对这些结果进行排序,然后返回前100 个(来自LIMIT)。如果您删除 ORDER BY,则前 random 将返回 100 条记录,速度更快,但无用,因为您将不知道它们是否是 TOP 100
  • 那么我该怎么做呢?由于我们在实现分页,所以需要按count排序,得到前100名左右。
  • 也许是MATERIALIZED VIEW,就像这个答案可以提供帮助:stackoverflow.com/a/12925639/724039

标签: sql postgresql


【解决方案1】:

索引不直接用于选择性。它们用于生成已按字段排序的行,该字段对连接和分组都很有用。一旦 100 个组以某种方便(对系统)的顺序挤出,它就可以停止,基本上很早。

但是使用 ORDER BY,除非您知道所有组的计数,否则您不能按计数对所有组进行排序。没有早停的机会。由于这是使用索引的主要优势,一旦机会消失,就没有理由再使用索引了。无论如何,哈希连接在必须运行完成时可能更有效。

那么我该怎么做呢?由于我们在实现分页,所以我们需要按count排序,得到前100名左右

不要在数据库中实现分页。 5548 并不多,计算一次并将它们全部发送到客户端或应用程序服务器,让它自己处理分页。而且这不会很快改变,所以使用物化视图来存储摘要并每隔一小时左右重新计算一次。

【讨论】:

    猜你喜欢
    • 2018-07-06
    • 1970-01-01
    • 2016-04-22
    • 2021-12-25
    • 2016-06-17
    • 1970-01-01
    • 1970-01-01
    • 2014-01-15
    相关资源
    最近更新 更多