【发布时间】:2014-12-09 16:17:37
【问题描述】:
工具:Rails 和 Postgresql
数据结构:
Feed:有很多消息
消息:有一个作者
messages:
feed_id
author_id
posted_at
作者:包含一个 hstore
authors:
account_stats->'likes_count'
我有一组按 feed_ids 和 posted_at 时间戳过滤的消息。
我想在这组消息中获取按 likes_count 排序的前 10 个作者 ID。
作者和消息集都非常大,大约有 1-2M 条记录。
最初我尝试了两个单独的查询,首先找到消息的 author_ids,然后找到该 author_ids 中的所有作者,但是这个 author_ids 列表太大,所以我尝试使用 CTE 将它们组合成一个查询。
这就是我所做的:
WITH "filtered_messages" AS (SELECT "messages".* FROM "messages" WHERE "messages"."feed_id" IN (1, 2, 3, 7) AND (messages.posted_at >= '2014-11-24 00:00:00.000000') AND (messages.posted_at < '2014-12-09 05:00:00.000000')) SELECT "authors"."id_str" FROM "authors" WHERE (authors.id_str in (select distinct filtered_messages.author_id from filtered_messages)) ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10
但是这个查询很慢:
这是使用解释分析的查询计划 http://explain.depesz.com/s/xCg
或
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=48170.47..48170.49 rows=10 width=88) (actual time=7352.169..7352.171 rows=10 loops=1)
CTE filtered_messages
-> Seq Scan on messages (cost=0.00..37190.11 rows=414468 width=2024) (actual time=0.179..271.049 rows=416269 loops=1)
Filter: ((posted_at >= '2014-11-24 00:00:00'::timestamp without time zone) AND (posted_at < '2014-12-09 05:00:00'::timestamp without time zone) AND (feed_id = ANY ('{1,2,3,7}'::integer[])))
Rows Removed by Filter: 21420
-> Sort (cost=10980.35..10980.85 rows=200 width=88) (actual time=7352.169..7352.171 rows=10 loops=1)
Sort Key: (((authors.account_stats -> 'likes_count'::text))::integer)
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=9330.45..10976.03 rows=200 width=88) (actual time=1555.397..7268.009 rows=304363 loops=1)
-> HashAggregate (cost=9330.03..9332.03 rows=200 width=516) (actual time=1555.342..1610.825 rows=304363 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=9325.53..9329.53 rows=200 width=516) (actual time=1360.939..1469.422 rows=304363 loops=1)
-> HashAggregate (cost=9325.53..9327.53 rows=200 width=516) (actual time=1360.938..1434.832 rows=304363 loops=1)
-> CTE Scan on filtered_messages (cost=0.00..8289.36 rows=414468 width=516) (actual time=0.183..1181.150 rows=416269 loops=1)
-> Index Scan using authors_pkey on authors (cost=0.42..8.20 rows=1 width=88) (actual time=0.017..0.018 rows=1 loops=304363)
Index Cond: ((id_str)::text = ("ANY_subquery".author_id)::text)
Total runtime: 7418.278 ms
总运行时间非常慢。我尝试为 likes_count 添加有序索引,但计划没有使用它。我还应该添加其他索引吗?
"CREATE INDEX index_authors_on_likes_count ON authors (((account_stats -> 'likes_count')::INT) DESC NULLS LAST) where (account_stats ? 'likes_count')"
编辑:使用连接表:
SELECT DISTINCT "authors"."id_str", (account_stats->'likes_count')::INT FROM "authors"
INNER JOIN "messages" ON "messages"."author_id" = "authors"."id_str"
WHERE ( "messages"."feed_id" IN (6, 4, 5, 1, 2, 3, 7) AND (messages.posted_at >= '2014-10-26 00:00:00.000000') AND (messages.posted_at < '2015-12-11 05:00:00.000000'))
ORDER BY (account_stats->'likes_count')::INT DESC NULLS LAST LIMIT 10;
两个不同的查询计划结果:
feed_ids = 1,2,3,7:explain.depesz.com/s/iaR
feed_ids = 6、4、5、1、2、3、7,大日期范围:explain.depesz.com/s/cGm
feed_ids = 6,4,5,1,2,3,7,相同的日期范围:http://explain.depesz.com/s/UbPg
查询计划大致相同,但是表大小增加后,连接似乎很慢。也许一些索引会有所帮助?我已经对 authors.id_str 和 messages.author_id 有了索引。
提前感谢您的帮助和解释。
【问题讨论】:
标签: sql ruby-on-rails performance postgresql