设计 sql，count(*) 查询性能的索引答案

【问题标题】：Designing sql, indexes for count(*) query performance设计 sql，count(*) 查询性能的索引
【发布时间】：2013-05-31 02:08:03
【问题描述】：

大家好 :) 我正在构建一个工具来对我们的 Oracle 10g 数据库进行一些容量采样。这是查询：

SELECT count(*) 
FROM product
JOIN customer ON product.CUSTOMER_ID = customer.ID
WHERE 
 (    product.CATEGORY = 'some first category criteria'
  AND customer.REGION = 'some first region criteria'
  AND ...)
 OR
 (    product.CATEGORY = 'some second category criteria'
  AND customer.REGION = 'some second region criteria'
  AND ...)
 OR ...

我需要从这个查询中进行计数。问题是数据量很大：每个表大约有 3000 万行，我希望这个查询能够响应。

到目前为止，在 customer (<search criteria column>, CUSTOMER_ID) 上设置复合索引已经有很大帮助。我认为它有帮助 Oracle 在索引过滤操作后转到 JOIN。

每个(... AND ... AND ...) 块预计包含大约 50 000 行。搜索条件中使用的列都具有大小约为 1000 个值的集合中的值。

我想知道我可以实施什么方法，因为我只会做count(*)s，特别是因为 Oracle 有一个内置的 OLAP 模块（和一个 CUBE 操作？）。此外，我确信通过深思熟虑的索引和提示可以大大改善事情。

你会如何设计这个？

【问题讨论】：

索引不是免费的。我不会为了支持您的计数应用程序而在这些大表上添加新索引。另外，这些计数需要多新鲜？
@tbone 两个列的数据每天最多只刷新一次。所以一些预先计算可以在晚上进行。
那可能是你的答案。 Precalc 使用简单的物化视图保存您需要的计数。然后将您的应用指向 mat 视图，并在每天下班时间刷新它。
@tbone 问题是，每个标准都有大约 1000 种可能性。有 5 个搜索条件，计算 1000^5 个不同的案例：/
你仍然可以预先计算它，你真的在做 DW / 分析工作。您可能不想全天不断地对生产表运行实时查询。发布您的表结构和示例查询

标签： oracle count indexing oracle10g olap

【解决方案1】：

这看起来很适合bitmap indexes：

位图索引主要用于数据仓库或查询引用临时中的许多列的环境时尚。可能需要位图索引的情况包括：

索引列的基数较低，即与表行数相比，不同的值很小。

索引表要么是只读的，要么不受显着影响通过 DML 语句修改。

具体来说，位图连接索引在这里可能是理想的。手册中的示例甚至与您的数据模型相匹配。我尝试在下面重新创建您的模型和数据，位图连接索引的运行速度似乎比其他解决方案快几个数量级。

样本数据

--Create tables
create table customer
(
    customer_id number,
    region      varchar2(100) not null
) nologging;

create table product
(
    product_id  number,
    customer_id number not null,
    category    varchar2(100) not null
) nologging;


--Load 30M rows, 1M rows at a time.  Takes about 6 minutes.
begin
    for i in 1 .. 30 loop
        insert /*+ append */ into customer
        select (1000000*i)+level, 'Region '||trunc(dbms_random.value(1, 1000))
        from dual connect by level <= 1000000;
        commit;

        insert /*+ append */ into product
        select (1000000*i)+level, (1000000*i)+level
            ,'Category '||trunc(dbms_random.value(1, 1000))
        from dual connect by level <= 1000000;
        commit;
    end loop;
end;
/

--Add primary keys and foreign key constraints.
alter table customer add constraint customer_pk primary key (customer_id);
alter table product add constraint product_pk primary key (product_id);
alter table product add constraint product_customer_fk
    foreign key (customer_id) references customer(customer_id);

--Gather stats
begin
    dbms_stats.gather_table_stats(user, 'CUSTOMER');
    dbms_stats.gather_table_stats(user, 'PRODUCT');
end;
/

未编入索引 - 慢

正如预期的那样，性能很差。这个示例查询在我的机器上大约需要 75 秒。

SELECT count(*) 
FROM product
JOIN customer ON product.CUSTOMER_ID = customer.customer_id
WHERE (product.CATEGORY = 'Category 1' AND customer.REGION = 'Region 1')
 OR   (product.CATEGORY = 'Category 2' AND customer.REGION = 'Region 2')
 OR   (product.CATEGORY = 'Category 888' AND customer.REGION = 'Region 888');

B-tree 索引 - 仍然很慢

计划发生变化，但性能保持不变。我认为这可能是因为我的示例是最坏的索引场景，其中数据是真正随机的。

create index customer_idx on customer(region);
create index product_idx on product(category);

begin
    dbms_stats.gather_table_stats(user, 'CUSTOMER');
    dbms_stats.gather_table_stats(user, 'PRODUCT');
end;
/

位图索引 - 好一点

这稍微提高了性能，大约为 61 秒。

drop index customer_idx;
drop index product_idx;

create bitmap index customer_bidx on customer(region);
create bitmap index product_bidx on product(category);

begin
    dbms_stats.gather_table_stats(user, 'CUSTOMER');
    dbms_stats.gather_table_stats(user, 'PRODUCT');
end;
/

位图连接索引 - 非常快

现在查询几乎立即返回结果，我的 IDE 将其计为 0 秒。

drop index customer_idx;
drop index product_idx;

create bitmap index customer_product_bjix
on product(product.category, customer.region)
FROM product, customer
where product.CUSTOMER_ID = customer.customer_id;

begin
    dbms_stats.gather_table_stats(user, 'CUSTOMER');
    dbms_stats.gather_table_stats(user, 'PRODUCT');
end;
/

索引成本

位图连接索引的创建时间比 b 树或位图索引要长一些。与位图或位图连接索引相比，b-tree 索引非常大。

select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('CUSTOMER_IDX', 'PRODUCT_IDX'
    ,'CUSTOMER_BIDX', 'PRODUCT_BIDX',  'CUSTOMER_PRODUCT_BJIX');


SEGMENT_NAME            MB
------------            --
CUSTOMER_IDX            726
PRODUCT_IDX             792
CUSTOMER_BIDX            88
PRODUCT_BIDX             96
CUSTOMER_PRODUCT_BJIX   184

查询风格

这不会影响性能，但您可以像这样缩小查询：

SELECT count(*) 
FROM product
JOIN customer ON product.CUSTOMER_ID = customer.customer_id
WHERE (product.category, customer.region)
    in (('Category 1', 'Region 1'),
        ('Category 2', 'Region 2'),
        ('Category 888', 'Region 888'));

【讨论】：

我认为您只考虑查询的性能。对于具有中等 DML 活动的表来说，位图通常是个坏消息。海报没有透露的是公司如何使用这张桌子（不仅仅是这个特定的需求）。我见过太多带有大量索引（位图和其他）的表，因为大多数开发人员只考虑他们自己的直接需求（而且公司在添加它们之前几乎没有进行完整性检查）。无论如何都要考虑一下。
@tbone 你是对的，位图索引和 DML 存在问题。根据评论“两个列上的数据最多每天刷新一次”，应该可以建立一个流程来避免这些问题。可能很简单，就是删除索引，修改表，然后重新创建索引。
我认为他指的是计数的新鲜度。我怀疑他击中的表是经常使用的关键表并且具有高 DML 活动。不管怎样，我觉得我现在太在乎了 ;-)
@tbone 啊，在这种情况下，您对物化视图的想法可能效果最好，也许上面有一个常规的位图索引。 BenoitParis - 您能否说明这些表格是每天更新一次，还是您只想每天更新一次计数？