【问题标题】:MySQL group-by query perfomance improved by adding aggregation通过添加聚合提高 MySQL group-by 查询性能
【发布时间】:2016-11-28 22:00:09
【问题描述】:

我在 MySQL 中有下表:

CREATE TABLE `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_bin NOT NULL,
  `time_stamp` bigint(20) unsigned NOT NULL,
  `event_type` varchar(40) COLLATE utf8mb4_bin NOT NULL,
  `has_data` tinyint(1) NOT NULL,
  `data` json DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=COMPRESSED;

ALTER TABLE `events`
 ADD PRIMARY KEY (`pv_name`,`time_stamp`),
 ADD UNIQUE KEY `has_data` (`pv_name`,`has_data`,`time_stamp`);

我试图找到一组不同的 pv_names,这些 pv_names 在两个给定时间之间有一些没有数据的行。以下两个查询似乎都会返回此信息:

mysql> EXPLAIN SELECT pv_name FROM events
         WHERE has_data = 0
           AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
         GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | index | PRIMARY,has_data | has_data | 251     | NULL | 1855281 |     1.11 | Using where; Using index |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+

mysql> EXPLAIN SELECT pv_name, MAX(events.time_stamp) FROM events
         WHERE has_data = 0
           AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
         GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows   | filtered | Extra                                 |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
|  1 | SIMPLE      | events | NULL       | range | PRIMARY,has_data | has_data | 251     | NULL | 203123 |   100.00 | Using where; Using index for group-by |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+

我不明白为什么第二个查询对返回的内容有额外的限制(我不需要),它的运行时间似乎比第一个要短。有没有办法在不聚合 time_stamp 列的情况下改进第一个查询以匹配第二个查询的效率?

编辑:

根据 Rick James 的建议,我更改了 has_data 索引:

ALTER TABLE `events`
 ADD PRIMARY KEY (`pv_name`,`time_stamp`), ADD KEY `has_data` (`has_data`,`pv_name`,`time_stamp`);

这将查询报告更改为:

mysql> EXPLAIN SELECT pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)

这似乎运行得更快。

编辑:

Rick James 要求的测试结果:

mysql> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
.
.
.
114480 rows in set (0.34 sec)

mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 2      |
| Handler_mrr_init           | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 125527 |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
18 rows in set (0.01 sec)

mysql> SELECT COUNT(*) FROM events;
+----------+
| COUNT(*) |
+----------+
|  3683887 |
+----------+
1 row in set (11.66 sec)

编辑:

运行时间:

mysql> SHOW INDEXES FROM events;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| events |          0 | PRIMARY  |            1 | pv_name     | A         |      216061 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | PRIMARY  |            2 | time_stamp  | A         |     4450791 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            1 | has_data    | A         |         258 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            2 | pv_name     | A         |      496542 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            3 | time_stamp  | A         |     4390035 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)


SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (0.37 sec)

SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (0.30 sec)


mysql> SHOW INDEXES FROM events;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| events |          0 | PRIMARY  |            1 | pv_name     | A         |      422951 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | PRIMARY  |            2 | time_stamp  | A         |     4321990 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            1 | pv_name     | A         |      240067 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            2 | has_data    | A         |      436525 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            3 | time_stamp  | A         |     4205163 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | index | PRIMARY,has_data | has_data | 251     | NULL | 4462633 |     1.11 | Using where; Using index |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows   | filtered | Extra                                 |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
|  1 | SIMPLE      | events | NULL       | range | PRIMARY,has_data | has_data | 251     | NULL | 240076 |   100.00 | Using where; Using index for group-by |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (6.79 sec)

SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (2.65 sec)

【问题讨论】:

  • 如果在表已将数据读入缓冲池后重复这两个查询,这两个查询是否都快?通常情况下,第一次运行查询时,它会比随后运行相同查询的速度慢,因为它必须填充缓冲池。

标签: mysql performance group-by


【解决方案1】:

根据[文档](http://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html 用于松散索引扫描):

除了查询中引用的 GROUP BY 之外,索引的任何其他部分都必须是常量(即,它们必须以与常量相等的方式引用),除了参数 MIN() 或 MAX( ) 函数

在您的第一个查询中,引用了 time_stamp 但不是常量。在您的第二个查询中,time_stamp 也在 MAX() 的参数中。因此,松散索引扫描适用于这种情况。

【讨论】:

  • 我仍然不确定为什么它没有为第一个查询做一些更有效或更有效的事情,因为第二个查询添加了一个限制,它不需要申请第一个查询。
  • 按照我的理解,松散索引扫描并没有真正扫描索引。它在有趣的行之间跳过。基于当前行,它构建查找下一行的键。在您的第一个查询中,time_stamp 没有特定值。因此,它无法构建可用于查找的密钥。对于第二个查询,由于请求了 MAX 值,因此可以使用 WHERE 子句指定的范围的结束值。然后,目标行将是小于此键的第一行。可以将第一个查询转换为第二个查询,但目前还没有完成。
【解决方案2】:

UNIQUE 替换为

INDEX(has_data, pv_name, time_stamp) -- in this order

除非您需要约束,否则通常最好不要创建索引UNIQUE。在这种情况下,您已经对子集 (pv_name, time_stamp) 进行了约束。

在构建索引时,从任何 = 列 (has_data) 开始。这允许其余的处理集中在必要的数据上,而不是因has_data 的不良值而绊倒。将一个范围(time_stamp)放在最后,因为(通常)不能使用超出范围的任何内容。在索引中包含这三列会给你一个“覆盖”索引,所以EXPLAIN 应该说“使用索引”。

我建议的索引应该对这两个查询都有帮助。

另见my index cookbook

【讨论】:

  • 您提出的索引会阻止 MySQL 使用松散的索引扫描优化,因此会减慢查询。
  • Patrick,你能证实 Shadow 的说法吗?影子,你能详细说明一下吗?
  • @RickJames 您的建议似乎有帮助。我已更新问题以显示测试结果。
  • 我还是很好奇为什么第一次使用原始索引的查询没有使用'Using index for group-by'优化。
  • @Patrick 显然您在第二次查询中丢失了松散的索引扫描,取而代之的是紧密的索引扫描,这是较慢的解决方案。
【解决方案3】:

在某些特定条件下可以优化分组依据。这就是第二个查询中发生的情况。该优化称为松散表索引扫描(see MySQL-Documentation

如果您在第一个查询中使用 DISTINCT 而不是 group by,也许这也会起作用?或者您可以查看文档如何为第一个查询实现分组优化。

松散索引扫描

处理 GROUP BY 最有效的方法是使用索引直接检索分组列。通过这种访问方法,MySQL 使用了一些索引类型的属性,即键是有序的(例如,BTREE)。此属性允许在索引中使用查找组,而不必考虑索引中满足所有 WHERE 条件的所有键。这种访问方法只考虑索引中的一小部分键,因此称为松散索引扫描。当没有 WHERE 子句时,松散索引扫描会读取与组数一样多的键,这可能比所有键的数量要小得多。如果 WHERE 子句包含范围谓词(请参阅第 9.8.1 节“使用 EXPLAIN 优化查询”中对范围连接类型的讨论),松散索引扫描会查找满足范围条件的每个组的第一个键,然后再次读取尽可能少的键数。这在以下条件下是可能的:

  • 查询针对的是单个表。
  • GROUP BY 仅命名构成索引最左侧前缀的列,而不命名其他列。 (如果查询有 DISTINCT 子句而不是 GROUP BY,则所有不同的属性都引用构成索引最左前缀的列。)例如,如果表 t1 在 (c1,c2,c3) 上有索引,如果查询有 GROUP BY c1, c2, 则适用松散索引扫描。如果查询有 GROUP BY c2、c3(列不是最左边的前缀)或 GROUP BY c1、c2、c4(c4 不在索引中),则不适用。
  • 选择列表(如果有)中使用的唯一聚合函数是 MIN() 和 MAX(),它们都引用同一列。该列必须在索引中,并且必须紧跟在 GROUP BY 中的列之后。
  • 除查询中引用的 GROUP BY 之外的索引的任何其他部分都必须是常量(即,它们必须以与常量相等的方式引用),但 MIN() 或 MAX() 函数的参数除外。

对于索引中的列,必须索引完整的列值,而不仅仅是前缀。例如c1 VARCHAR(20), INDEX (c1(10)),索引不能用于松散索引扫描。 如果松散索引扫描适用于查询,则 EXPLAIN 输出会在 Extra 列中显示 Using index for group-by。

希望对你有帮助

【讨论】:

  • group-by 相当于在第一个查询中使用 distinct:dev.mysql.com/doc/refman/5.7/en/distinct-optimization.html
  • 我想我知道它正在使用这种优化,但它似乎很奇怪它不能做一些与第一个一样或更有效的事情。
  • 我明白你的意思,但是多一栏对你来说真的有问题吗?
  • 第二个查询使用松散索引扫描优化,但这个答案无法解释为什么第一个不使用它。根据我的经验(和文档),它应该。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-05-06
  • 1970-01-01
  • 2018-02-14
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多