从排序列表中获取前 10 个结果时的 postgres 查询性能问题答案

【问题标题】：postgres query performance issue when getting top 10 result from sorted list从排序列表中获取前 10 个结果时的 postgres 查询性能问题
【发布时间】：2014-12-09 16:17:37
【问题描述】：

工具：Rails 和 Postgresql

数据结构：

Feed：有很多消息

消息：有一个作者

messages:
  feed_id
  author_id
  posted_at

作者：包含一个 hstore

authors:
  account_stats->'likes_count'

我有一组按 feed_ids 和 posted_at 时间戳过滤的消息。

我想在这组消息中获取按 likes_count 排序的前 10 个作者 ID。

作者和消息集都非常大，大约有 1-2M 条记录。

最初我尝试了两个单独的查询，首先找到消息的 author_ids，然后找到该 author_ids 中的所有作者，但是这个 author_ids 列表太大，所以我尝试使用 CTE 将它们组合成一个查询。

这就是我所做的：

WITH "filtered_messages" AS (SELECT "messages".* FROM "messages" WHERE "messages"."feed_id" IN (1, 2, 3, 7) AND (messages.posted_at >= '2014-11-24 00:00:00.000000') AND (messages.posted_at < '2014-12-09 05:00:00.000000')) SELECT "authors"."id_str" FROM "authors" WHERE (authors.id_str in (select distinct filtered_messages.author_id from filtered_messages)) ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10

但是这个查询很慢：

这是使用解释分析的查询计划 http://explain.depesz.com/s/xCg

或

      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=48170.47..48170.49 rows=10 width=88) (actual time=7352.169..7352.171 rows=10 loops=1)
   CTE filtered_messages
     ->  Seq Scan on messages  (cost=0.00..37190.11 rows=414468 width=2024) (actual time=0.179..271.049 rows=416269 loops=1)
           Filter: ((posted_at >= '2014-11-24 00:00:00'::timestamp without time zone) AND (posted_at < '2014-12-09 05:00:00'::timestamp without time zone) AND (feed_id = ANY ('{1,2,3,7}'::integer[])))
           Rows Removed by Filter: 21420
   ->  Sort  (cost=10980.35..10980.85 rows=200 width=88) (actual time=7352.169..7352.171 rows=10 loops=1)
         Sort Key: (((authors.account_stats -> 'likes_count'::text))::integer)
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Nested Loop  (cost=9330.45..10976.03 rows=200 width=88) (actual time=1555.397..7268.009 rows=304363 loops=1)
               ->  HashAggregate  (cost=9330.03..9332.03 rows=200 width=516) (actual time=1555.342..1610.825 rows=304363 loops=1)
                     ->  Subquery Scan on "ANY_subquery"  (cost=9325.53..9329.53 rows=200 width=516) (actual time=1360.939..1469.422 rows=304363 loops=1)
                           ->  HashAggregate  (cost=9325.53..9327.53 rows=200 width=516) (actual time=1360.938..1434.832 rows=304363 loops=1)
                                 ->  CTE Scan on filtered_messages  (cost=0.00..8289.36 rows=414468 width=516) (actual time=0.183..1181.150 rows=416269 loops=1)
               ->  Index Scan using authors_pkey on authors  (cost=0.42..8.20 rows=1 width=88) (actual time=0.017..0.018 rows=1 loops=304363)
                     Index Cond: ((id_str)::text = ("ANY_subquery".author_id)::text)
 Total runtime: 7418.278 ms

总运行时间非常慢。我尝试为 likes_count 添加有序索引，但计划没有使用它。我还应该添加其他索引吗？

"CREATE INDEX index_authors_on_likes_count ON authors (((account_stats -> 'likes_count')::INT) DESC NULLS LAST) where (account_stats ? 'likes_count')"

编辑：使用连接表：

SELECT  DISTINCT "authors"."id_str", (account_stats->'likes_count')::INT FROM "authors" 
INNER JOIN "messages" ON "messages"."author_id" = "authors"."id_str" 
WHERE ( "messages"."feed_id" IN (6, 4, 5, 1, 2, 3, 7) AND (messages.posted_at >= '2014-10-26 00:00:00.000000') AND (messages.posted_at < '2015-12-11 05:00:00.000000'))  
ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10;

两个不同的查询计划结果：

feed_ids = 1,2,3,7：explain.depesz.com/s/iaR

feed_ids = 6、4、5、1、2、3、7，大日期范围：explain.depesz.com/s/cGm

feed_ids = 6,4,5,1,2,3,7，相同的日期范围：http://explain.depesz.com/s/UbPg

查询计划大致相同，但是表大小增加后，连接似乎很慢。也许一些索引会有所帮助？我已经对 authors.id_str 和 messages.author_id 有了索引。

提前感谢您的帮助和解释。

【问题讨论】：

标签： sql ruby-on-rails performance postgresql

【解决方案1】：

我认为您正在将 id_Str 与 author_id 进行比较，并且该字段没有相同的类型，对吗？我应该是 varchar 和 bigint？

postgres 需要比较相同的类型，所以当你使用不同的类型时，postgres 需要为你的 table 的每一行进行强制转换！如果你有一百万次，postgres 会执行此操作一百万次。

因此，您需要通过将 id_str 转换为 bigint（与 author_id 相同的类型）来检查您的模型，或者尝试在 id_str 上创建索引，例如

create index idx_id_str on mytable(id_str::bigint)

检查语法，因为我这样做了很长时间；）

但是在这个示例中，您在 id_str 上创建了一个索引，但实际上索引包含 bigint 值，因此 postgres donc 需要为每一行进行强制转换，因为索引已经按照您的要求完成了工作；）

事实上，试着移除你所有的演员，因为它非常缓慢；）

之后请重新发布一个新的解释，看看现在是否可以；）

Ps：在你的解释中我看不到你创建的索引，所以 postgres 决定不使用它...你可以删除它并尝试另一个。;）

新评论：

WITH "filtered_messages" AS (SELECT "messages".* FROM "messages"  
WHERE "messages"."feed_id" IN (1, 2, 3, 7) 
AND (messages.posted_at >= '2014-11-24 00:00:00.000000')
AND (messages.posted_at < '2014-12-09 05:00:00.000000')) 
SELECT  "authors"."id_str" FROM "authors"  
WHERE (authors.id_str in (select distinct filtered_messages.author_id from filtered_messages)) 
ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10

尝试做：

WITH "filtered_messages" AS (SELECT distinct "messages".author_id  FROM "messages"  
WHERE "messages"."feed_id" IN (1, 2, 3, 7) 
AND (messages.posted_at >= '2014-11-24 00:00:00.000000')
AND (messages.posted_at < '2014-12-09 05:00:00.000000')) 
SELECT  "authors"."id_str" FROM "authors"  
WHERE (authors.id_str in (select filtered_messages.author_id from filtered_messages)) 
ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10

我只是在您的 filters_messages 表中添加 distinct 以仅具有 author_id 而不是全部，以具有仅包含最少数据的 filters_messages 表

然后，因为我们已经有了不同的 this 点，所以你不需要在 where 子句中：

WHERE (authors.id_str in (select filtered_messages.author_id from filtered_messages))

你能试试这个并重新发布一个新的解释吗？

谢谢

【讨论】：

感谢您的回复，我学到了新东西。我的authors.id_str 和messages.author_id 之前都是字符串，但是我都将它们都更改为文本，因此在比较期间它不会强制转换，但结果时间没有什么不同。这是新的解释：link 或 explain.depesz.com/s/PuD
如果可以，请尝试使用此语法：SELECT authors.id_str FROM authors join messages on messages.author_id = authors.id_str AND messages.feed_id IN (1, 2, 3, 7) AND ( messages.posted_at >= '2014-11-24 00:00:00.000000') AND (messages.posted_at 'likes_count')::INT DESC NULLS LAST LIMIT 10 并给我结果;)
我都试过了，第一个只选择author_ids first没有帮助，查询计划没有改变。使用第二种方法，连接两个表而不是使用 ids IN ([])，查询计划使用这组数据改进了很多。 explain.depesz.com/s/iaR 。但是在我增加了feed_ids之后，消息数增加到2M，作者数增加到1M，这种方法又慢了。但还是比以前好很多。 explain.depesz.com/s/cGm
好的，现在您可以在解释中看到您对消息和作者进行了“seq 扫描”。 seq scan 是顺序扫描，因此读取表的每一行以与您的条件进行比较。要解决此问题，您需要在此表上添加索引，例如在消息（author_id、feed_id）上创建索引 idx_messages_authorid_feedid，然后尝试在消息（author_id、feed_id、postedat）上创建索引 idx_messages_authorid_feedid_postedat。然后尝试新的解释，看看你是否有索引扫描而不是序列扫描。然后在作者表上尝试同样的事情；）尝试评论您的订单并尝试新的解释；）
您的问题解决了吗？如果是，请您接受我的回答；）谢谢