【问题标题】:Why is PostgreSQL (9.1) not using index for simple equality select?为什么 PostgreSQL (9.1) 不使用索引进行简单的相等选择?
【发布时间】:2013-04-12 07:45:45
【问题描述】:

我的表lead有一个索引:

\d lead
...
Indexes:
    "lead_pkey" PRIMARY KEY, btree (id)
    "lead_account__c" btree (account__c)
    ...
    "lead_email" btree (email)
    "lead_id_prefix" btree (id text_pattern_ops)

为什么 PG (9.1) 不使用索引来进行这种简单的等式选择?电子邮件几乎都是独一无二的......

db=> explain select * from lead where email = 'blah';
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on lead  (cost=0.00..319599.38 rows=1 width=5108)
   Filter: (email = 'blah'::text)
(2 rows)

其他索引命中查询似乎没问题(虽然我不知道为什么这个不只使用 pkey 索引):

db=> explain select * from lead where id = '';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using lead_id_prefix on lead  (cost=0.00..8.57 rows=1 width=5108)
   Index Cond: (id = ''::text)
(2 rows)

db=> explain select * from lead where account__c = '';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using lead_account__c on lead  (cost=0.00..201.05 rows=49 width=5108)
   Index Cond: (account__c = ''::text)
(2 rows)

起初我认为这可能是由于email 的不同值不够。例如,如果统计数据声称email 对于大多数表来说是blah,那么seq 扫描会更快。但事实并非如此:

db=> select count(*), count(distinct email) from lead;
 count  | count
--------+--------
 749148 | 733416
(1 row)

即使我强制关闭 seq 扫描,规划器的行为就好像它别无选择:

db=> set enable_seqscan = off;
SET
db=> show enable_seqscan;
 enable_seqscan
----------------
 off
(1 row)

db=> explain select * from lead where email = 'foo@blah.com';
                            QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on lead  (cost=10000000000.00..10000319599.38 rows=1 width=5108)
   Filter: (email = 'foo@blah.com'::text)
(2 rows)

也试过EXPLAIN ANALYZE:

db=> explain analyze select * from lead where email = 'foo@blah.com';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on lead  (cost=10000000000.00..10000319732.76 rows=1 width=5102) (actual time=77845.244..77845.244 rows=0 loops=1)
   Filter: (email = 'foo@blah.com'::text)
 Total runtime: 77857.215 ms
(3 rows)

这是\d 输出(抱歉,必须隐藏列名,并裁剪以适应 SO 的限制;请参阅http://pastebin.com/ve3gzJpY 的未裁剪版本):

                                 Table "lead"
                   Column                   |            Type             | Modifiers 
--------------------------------------------+-----------------------------+-----------
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | boolean                     | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 email                                      | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | boolean                     | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 account__c                                 | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | text                        | 
 id                                         | text                        | not null
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
 ...
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | timestamp without time zone | 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | real                        | 
Indexes:
    "lead_pkey" PRIMARY KEY, btree (id)
    "lead_account__c" btree (account__c)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_XXXXXXXXXXXXXXXXXXXXXX" btree (XXXXXXXXXXXXXXXXXXXXXX)
    "lead_email" btree (email)
    "lead_id_prefix" btree (id text_pattern_ops)

这里是pg_dump --schema-only -t lead(再次查看未裁剪的http://pastebin.com/ve3gzJpY,还有唯一的列名,以防有助于重现性):

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: lead; Type: TABLE; Schema: public; Owner: pod; Tablespace: 
--

CREATE TABLE lead (
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX boolean,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX date,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    account__c text,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX text,
    id text NOT NULL,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real,
    ...
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX timestamp without time zone,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX real
);


ALTER TABLE lead OWNER TO pod;

--
-- Name: lead_pkey; Type: CONSTRAINT; Schema: public; Owner: pod; Tablespace: 
--

ALTER TABLE ONLY lead
    ADD CONSTRAINT lead_pkey PRIMARY KEY (id);


--
-- Name: lead_account__c; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_account__c ON lead USING btree (account__c);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_XXXXXXXXXXXXXXXXXXXX; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_XXXXXXXXXXXXXXXXXXXX ON lead USING btree (XXXXXXXXXXXXXXXXXXXX);


--
-- Name: lead_email; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_email ON lead USING btree (email);


--
-- Name: lead_id_prefix; Type: INDEX; Schema: public; Owner: pod; Tablespace: 
--

CREATE INDEX lead_id_prefix ON lead USING btree (id text_pattern_ops);


--
-- PostgreSQL database dump complete
--

一些PG目录咒语:

db=> select * from pg_index where indexrelid = 'lead_email'::regclass;
 indexrelid | indrelid  | indnatts | indisunique | indisprimary | indisexclusion | indimmediate | indisclustered | indisvalid | indcheckxmin | indisready | indkey | indcollation | indclass | indoption | indexprs | indpred
------------+-----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+--------+--------------+----------+-----------+----------+---------
  215251995 | 101034456 |        1 | f           | f            | f              | t            | f              | t          | t            | t          | 101    | 100          | 10043    | 0         | ¤        | ¤
(1 row)

一些地区信息:

db=> show lc_collate;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

db=> show lc_ctype;
  lc_ctype   
-------------
 en_US.UTF-8
(1 row)

我搜索了很多过去的 SO 问题,但没有一个是像这样的简单相等查询。

【问题讨论】:

  • 奇怪......简单的相等不应该需要text_pattern_ops 索引,所以这很难解释。你能在一个小样本中重现这个吗?如果是这样,请发布到 sqlfiddle.com 并在此处链接。
  • 请显示完整的表定义(最好通过pg_dump)。
  • @PeterEisentraut 用\dpg_dump 架构更新了问题。
  • @CraigRinger 我会尽力重现,但这可能需要一段时间 - 此表中有很多数据包含敏感的客户信息。
  • 您的列名修改没有很好地执行 - 您最终所有列的名称都相同。这使得重现您的场景变得困难,因为 CREATE INDEX 语句引用了不明确的名称(更不用说 CREATE TABLE 本身由于列名重复而失败的事实)。为每个列使用不同的名称会更好。另外,lc_collat​​e 和 lc_ctype 设置是什么?这些对于复制者可能很重要(索引的 indcollat​​e=100 表示“默认排序规则”)。无论如何,非默认排序规则将显示为“修饰符”..

标签: postgresql indexing postgresql-9.1


【解决方案1】:

要对这些问题进行故障排除,您必须在故障排除步骤之间对桌子运行 VACUUM ANALYZE 以查看哪些方法有效。否则你可能不知道究竟是什么地方发生了变化。因此,请先尝试并再次运行,看看它是否能解决问题。

接下来要运行的步骤(在每个步骤之间运行真空分析和测试用例)是:

ALTER TABLE lead ALTER COLUMN email SET STATISTICS 1000;

也许这会解决它。也许不会。

如果这不能解决问题,请仔细查看 pg_stat 视图:

SELECT * FROM pg_stat WHERE table_name = 'lead';

请仔细阅读以下内容,看看你能看到哪些 pg_stat 有问题;

http://www.postgresql.org/docs/9.0/static/planner-stats.html

编辑:很清楚,vacuum analyse 并不是故障排除的全部。但是,它必须在故障排除步骤之间运行,否则您无法确定规划器是否考虑了正确的数据。

【讨论】:

  • 问题已经神秘地消失了,但值得注意的是我已经尝试过运行VACUUM ANALYZE - 很多次。
  • 重点是在故障排除步骤之间运行它,以确保分析仪正在使用当前设置等。
【解决方案2】:

CREATE INDEX lead_id_prefix ON Lead USING btree (id text_pattern_ops);

text_pattern_ops 的使用在这里看起来很奇怪。如果您的 ID 是某种整数,我会尝试删除此索引作为测试。 (我会毫不犹豫地在开发服务器上删除此索引。)由于您在“lead.id”上有另一个 btree 索引,我希望删除 this 索引以诱使优化器使用“lead.id”上的其他索引。

如果这被证明是真的,那么我会尝试更深入地挖掘原因。

【讨论】:

  • 哈!我从“where email = 'blah'”读到“where id = ''”,然后“id”卡在我的脑海里!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2014-02-06
  • 2020-09-14
  • 1970-01-01
  • 1970-01-01
  • 2012-11-23
  • 2019-06-23
  • 1970-01-01
相关资源
最近更新 更多