【发布时间】:2021-04-20 05:49:40
【问题描述】:
我正在处理一个非常大的表 - 目前我的任务是将所有设备的日志读入数据库并运行 SELECT 来执行指标。当前表定义如下:
mysql> describe device_events;
+-------------+---------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+-------------------+-----------------------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| device_type | varchar(255) | NO | MUL | NULL | |
| device_id | bigint(20) unsigned | NO | MUL | NULL | |
| message | json | NO | | NULL | |
| source | text | NO | MUL | NULL | |
| created_at | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| updated_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| file_date | date | YES | MUL | NULL | |
+-------------+---------------------+------+-----+-------------------+-----------------------------+```
Indexes:
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| device_events | 0 | PRIMARY | 1 | id | A | 40932772 | NULL | NULL | | BTREE | | |
| device_events | 1 | device_events_device_id_index | 1 | device_id | A | 44021 | NULL | NULL | | BTREE | | |
| device_events | 1 | device_events_device_type_index | 1 | device_type | A | 621 | NULL | NULL | | BTREE | | |
| device_events | 1 | device_events_source_index | 1 | source | A | 3085 | 255 | NULL | | BTREE | | |
| device_events | 1 | device_events_created_at_index | 1 | created_at | A | 2846551 | NULL | NULL | | BTREE | | |
| device_events | 1 | device_events_file_date_index | 1 | file_date | A | 25017 | NULL | NULL | YES | BTREE | | |
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
我正在研究分区,理想情况下希望为每个设备和每个日志源创建一个分区,但由于 MySQL 的固定分区范围,我认为我无法这样做。目前,SELECT 大约需要一分钟和大约 15 分钟来生成所有指标,但希望加快速度。有人对如何优化大型选择有任何其他想法吗?我预计数据库中有超过一万亿条记录。大多数 SELECT 将针对过去 30 天内的事件完成,并在 message 上使用 JSON_EXTRACT。请注意,我已经在时间戳上使用 BETWEEN 以避免月份(created_at)计算,并且很可能已尽可能优化查询 - 我主要是针对这个问题寻找结构优化。
--
-- Table structure for table `device_events`
--
DROP TABLE IF EXISTS `device_events`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `device_events` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`device_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`device_id` bigint(20) unsigned NOT NULL,
`message` json NOT NULL,
`source` text COLLATE utf8mb4_unicode_ci NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`file_date` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_events_device_id_index` (`device_id`),
KEY `device_events_device_type_index` (`device_type`),
KEY `device_events_source_index` (`source`(255)),
KEY `device_events_created_at_index` (`created_at`),
KEY `device_events_file_date_index` (`file_date`)
) ENGINE=InnoDB AUTO_INCREMENT=42771939 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
-- Dump completed on 2021-04-19 17:55:35
雄辩的 ORM 查询
public function scopeLastMonth($query) {
$lastMonth = Carbon::now()->subMonth();
$start = $lastMonth->firstOfMonth()->startOfDay()->toDateTimeString();
$end = $lastMonth->lastOfMonth()->endOfDay()->toDateTimeString();
return $query->whereBetween("created_at", [$start, $end]);
}
$topIdentitiesQuery = (clone $lastMonthEvents)->selectRaw("JSON_EXTRACT(message, '$.policy_identity') as identity")->selectRaw("count(*) as aggregate")->groupBy("identity")->orderBy("aggregate", "desc");
$topIdentities = [];
foreach($topIdentitiesQuery->take(self::NUM_TOP_IDENTITIES)->get() as $topIdentity) {
array_push($topIdentities, $topIdentity->identity);
}
$topIdentities = array_pad($topIdentities, self::NUM_TOP_IDENTITIES, "");
【问题讨论】:
-
提供表定义作为完整的 CREATE TABLE 脚本 - 当前信息没有用。还要提供最常见或最关键的要优化的查询,包括 EXPLAIN。
-
大多数 SELECT 将在过去 30 天内的事件上完成,并在消息上使用 JSON_EXTRACT 显示 JSON 值(2-3 个值)和最常见的 JSON_EXTRACT 表达式的示例(2-3 个变体)。 PS。按日期分区现在看起来像是一个选项..
-
@Akina 添加了 MySQL 转储并使用 Laravel Eloquent ORM 进行查询。上面是一个例子。我也在考虑为每个 JSON 值使用生成的列,但是日志消息会有所不同,最终可能会在每条记录上产生大约 20 个额外的空白列。非常感谢您的建设性回应
-
我正在使用 Laravel Eloquent ORM 进行查询 只有纯 SQL 查询可以刻意优化,而不是意外。因此,获取由您的代码生成的 SQL 文本并显示它。当然,还要加上 EXPLAIN。
-
请提供为
SELECT生成的SQL。
标签: mysql optimization