postgres 按整数类型列分组比字符类型列更快？答案

【问题标题】：postgres group by integer type columns faster than character type columns?postgres 按整数类型列分组比字符类型列更快？
【发布时间】：2014-04-18 07:00:58
【问题描述】：

我有 4 张桌子

create table web_content_3 ( content integer, hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_4 ( content character varying (128 ), hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_5 ( content character varying (128 ), hits bigint, bytes bigint, appid integer );
create table web_content_6 ( content integer, hits bigint, bytes bigint, appid integer );

我正在对大约 2 百万条记录的分组使用相同的查询即SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_{3,4,5,6} GROUP BY content,appid; 结果是：

 - Table Name    | Content   | appid     | Time Taken [In ms]
 - ===========================================================
 - web_content_3 | integer   | Character | 27277.931
 - web_content_4 | Character | Character | 151219.388
 - web_content_5 | Character | integer   | 127252.023
 - web_content_6 | integer   | integer   | 5412.096

这里的 web_content_6 查询只需要大约 5 秒，与其他三个组合相比，使用这个统计数据我们可以说 group by 的整数、整数组合要快得多，但问题是为什么？

我也有 EXPLAIN 结果，但它确实为我解释了 web_content_4 和 web_content_6 查询之间的巨大变化。

在这里。

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_4 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=482173.36..507552.31 rows=17680 width=63) (actual time=138099.612..151565.655 rows=17680 loops=1)
   ->  Sort  (cost=482173.36..487196.11 rows=2009100 width=63) (actual time=138099.202..149256.707 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 152488kB
         ->  Seq Scan on web_content_4  (cost=0.00..45218.00 rows=2009100 width=63) (actual time=0.010..349.144 rows=2009100 loops=1)
 Total runtime: 151613.569 ms
(6 rows)

Time: 151614.106 ms

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_6 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=368814.36..394194.51 rows=17760 width=24) (actual time=3282.333..5840.953 rows=17760 loops=1)
   ->  Sort  (cost=368814.36..373837.11 rows=2009100 width=24) (actual time=3282.176..3946.025 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 74632kB
         ->  Seq Scan on web_content_6  (cost=0.00..34864.00 rows=2009100 width=24) (actual time=0.011..297.235 rows=2009100 loops=1)
 Total runtime: 6172.960 ms

【问题讨论】：

因为比较。比较整数比比较“字符串”更快
可能在字符串的情况下，它正在逐个字符进行比较..所以排序也需要时间..您也可以在解释计划中看到..
这些表上有索引吗？

标签： sql postgresql group-by explain sql-execution-plan

【解决方案1】：

当然，戈登·林诺夫是对的。溢出到磁盘是昂贵的。

如果您可以节省内存，您可以告诉 PostgreSQL 使用更多的内存进行排序等。我构建了一个表，用随机数据填充它，并在运行此查询之前对其进行了分析。

EXPLAIN ANALYSE 
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
from web_content_4 
GROUP BY content,appid;

"GroupAggregate  (cost=364323.43..398360.86 rows=903791 width=96) (actual time=25059.086..29789.234 rows=1998067 loops=1)"
"  ->  Sort  (cost=364323.43..369323.34 rows=1999961 width=96) (actual time=25057.540..27907.143 rows=2000000 loops=1)"
"        Sort Key: content, appid"
"        Sort Method: external merge  Disk: 216016kB"
"        ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.010..475.187 rows=2000000 loops=1)"
"Total runtime: 30012.427 ms"

我得到了和你一样的执行计划。在我的例子中，这个查询执行需要大约 216MB 磁盘的外部合并排序。我可以通过设置 work_mem 的值来告诉 PostgreSQL 允许更多内存用于此查询。（以这种方式设置 work_mem 只会影响我当前的连接。）

set work_mem = '250MB';
EXPLAIN ANALYSE 
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
from web_content_4 
GROUP BY content,appid;

"HashAggregate  (cost=72472.22..81510.13 rows=903791 width=96) (actual time=3196.777..4505.290 rows=1998067 loops=1)"
"  ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.019..437.252 rows=2000000 loops=1)"
"Total runtime: 4726.401 ms"

现在 PostgreSQL 使用哈希聚合，执行时间减少了 6 倍，从 30 秒到 5 秒。

我没有测试 web_content_6，因为用整数替换文本通常需要几个连接来恢复文本。所以我不确定我们是否会将苹果与那里的苹果进行比较。

【讨论】：

【解决方案2】：

这种聚合的性能将由排序速度驱动。在所有条件相同的情况下，较大的数据将比较短的数据需要更多的时间。 “快速”的情况是对 74Mbytes 进行排序； “慢”，152Mbytes。

这会导致性能上的一些差异，但在大多数情况下不会产生 30 倍的差异。您会看到巨大差异的一种情况是较小的数据适合内存而较大的数据不适合。溢出到磁盘很昂贵。

一种怀疑是数据已按web_content_6(content, appid) 排序或几乎排序。这可能会缩短排序所需的时间。如果您比较这两种类型的实际时间和“成本”，您会发现“快速”版本的运行速度比预期的要快得多（假设成本相当）。

【讨论】：