【发布时间】:2016-05-10 12:20:06
【问题描述】:
我有大表屑(大约 100M+ 行,100GB)。它只是存储为文本的 json 集合。它在具有大约 10K 唯一值的列 run_id 上具有索引。所以每次运行都很小(1K - 1M 行)。
简单查询:
explain analyze verbose select * from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z' limit 10
计划不错:
Limit (cost=0.56..36.89 rows=10 width=2262) (actual time=1.978..2.016 rows=10 loops=1)
Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
-> Index Scan using index_crumbs_on_run_id on public.crumbs c (cost=0.56..5533685.73 rows=1523397 width=2262) (actual time=1.975..1.996 rows=10 loops=1)
Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
Planning time: 0.117 ms
Execution time: 2.048 ms
但是,如果我尝试查看存储在其中一列中的 json,它会想要进行全扫描:
explain verbose select x from crumbs c,
lateral json_array_elements(c.content::json) x
where c.run_id='2016-04-26T19_02_01_015Z'
limit 10
计划:
Limit (cost=0.01..0.69 rows=10 width=32)
Output: x.value
-> Nested Loop (cost=0.01..10332878.67 rows=152343800 width=32)
Output: x.value
-> Seq Scan on public.crumbs c (cost=0.00..7286002.66 rows=1523438 width=895)
Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
Filter: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
-> Function Scan on pg_catalog.json_array_elements x (cost=0.01..1.01 rows=100 width=32)
Output: x.value
Function Call: json_array_elements((c.content)::json)
试过了:
analyze crumbs
但没什么区别。
更新 1 禁用对整个数据库的顺序扫描是可行的,但这不是我们的应用程序中的选项。在许多其他地方 seq 扫描应该保留:
set enable_seqscan=false;
计划:
Limit (cost=0.57..1.14 rows=10 width=32) (actual time=0.120..0.294 rows=10 loops=1)
Output: x.value
-> Nested Loop (cost=0.57..8580698.45 rows=152343400 width=32) (actual time=0.118..0.273 rows=10 loops=1)
Output: x.value
-> Index Scan using index_crumbs_on_run_id on public.crumbs c (cost=0.56..5533830.45 rows=1523434 width=895) (actual time=0.087..0.107 rows=10 loops=1)
Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
-> Function Scan on pg_catalog.json_array_elements x (cost=0.01..1.01 rows=100 width=32) (actual time=0.011..0.011 rows=1 loops=10)
Output: x.value
Function Call: json_array_elements((c.content)::json)
Planning time: 0.124 ms
Execution time: 0.337 ms
更新 2:
架构是:
CREATE TABLE crumbs
(
id serial NOT NULL,
run_id character varying(255),
content text,
created_at timestamp without time zone,
updated_at timestamp without time zone,
CONSTRAINT crumbs_pkey PRIMARY KEY (id)
);
CREATE INDEX index_crumbs_on_run_id
ON crumbs
USING btree
(run_id COLLATE pg_catalog."default");
更新 3
像这样重写查询:
select json_array_elements(c.content::json) x
from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z'
limit 10
得到正确的计划。仍然不清楚为什么第二次查询选择了错误的计划。
【问题讨论】:
-
((run_id)::text = '2016-04-26T19_02_01_015Z'::text)run_id 在我看来就像一个时间戳。为什么将其存储为文本字段?另外:请添加表定义,包括索引。 -
是的,run_id 是带有文本前缀的时间戳。我省略了有问题的前缀以避免引入不相关的复杂性。现在用解释分析详细更新输出。
-
听起来像是泰勒为 jsonb 制作的情景
-
@e4c5 或者可能是 MongoDB? ;-)
-
@asgs 基准测试实际上表明带有 JSON 的 postgresql 9.5 优于 mongo :))
标签: json postgresql postgres-9.4