【问题标题】:Optimize MySQL Full outer join for massive amount of data针对海量数据优化 MySQL Full outer join
【发布时间】:2014-03-09 16:37:43
【问题描述】:

我们有以下 mysql 表(为直入主题而简化)

CREATE TABLE `MONTH_RAW_EVENTS` (
  `idEvent` int(11) unsigned NOT NULL,
  `city` varchar(45) NOT NULL,
  `country` varchar(45) NOT NULL,
  `ts` datetime NOT NULL,
  `idClient` varchar(45) NOT NULL,
  `event_category` varchar(45) NOT NULL,
  ... bunch of other fields
  PRIMARY KEY (`idEvent`),
  KEY `idx_city` (`city`),
  KEY `idx_country` (`country`),
  KEY `idClient` (`idClient`),
) ENGINE=InnoDB;

CREATE TABLE `compilation_table` (
  `idClient` int(11) unsigned DEFAULT NULL,
  `city` varchar(200) DEFAULT NULL,
  `month` int(2) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `events_profile` int(10) unsigned NOT NULL DEFAULT '0',
  `events_others` int(10) unsigned NOT NULL DEFAULT '0',
  `events_total` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `idx_month` (`month`),
  KEY `idx_year` (`year`),
  KEY `idx_idClient` (`idClient`),
  KEY `idx_city` (`city`)
) ENGINE=InnoDB;

MONTH_RAW_EVENTS 包含近 2000 万行用户在网站中执行的操作,其大小接近 4GB

compilation_table 每月有一个客户/城市摘要,我们用它在网站上实时显示统计数据

我们每月处理一次统计信息(从第一个表到第二个表),并且我们正在尝试优化执行此类操作的查询(因为直到现在我们都在 PHP 中处理需要很长时间的所有内容)

这是我们提出的查询,它在使用小数据子集时似乎可以完成工作, 整个数据集处理时间超过6小时的问题

INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)


    SELECT  IFNULL(OTHERS.idClient,AP.idClient) as idClient,
            IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
        01,2014,
    IFNULL(AP.cnt,0) as events_profile,
        IFNULL(OTHERS.cnt,0) as events_others           

    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 LEFT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

 UNION

    SELECT  IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
            IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
            01,2014,
            IFNULL(CLIPROFILE.cnt,0) as events_profile,
            IFNULL(OTHERS.cnt,0) as events_others           
    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 RIGHT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

我们要做的是在 Mysql 中进行 FULL Outer Join,因此查询的基本架构如下:the one proposed here

我们如何优化查询?我们一直在尝试不同的索引,在 8 小时后仍然没有完成运行,

MySQL 服务器是一台 Percona MySQL 5.5 专用机器,具有 2cpu、2GB 内存和 SSD 磁盘, 我们使用 Percona 工具优化了此类服务器的配置,

任何帮助将不胜感激,

谢谢

【问题讨论】:

    标签: mysql sql outer-join query-performance


    【解决方案1】:

    你正在做一个 UNION 导致 DISTINCT 处理。

    通常最好将完全连接重写为左连接加上右连接的不匹配行(如果它是正确的 1:n 连接)

    OTHERS LEFT JOIN CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
    union all
    OTHERS RIGHT JOIN CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
    WHERE OTHERS.idClient IS NULL 
    

    此外,您可能会在加入临时表之前将派生表的结果具体化,因此计算只进行一次(我不知道 MySQL 的优化器是否足够聪明,可以自动执行此操作)。

    另外,将城市/国家分组并作为单独的列加​​入并在外部步骤中执行 CONCAT(city,', ',country) as city 可能会更有效。 p>

    【讨论】:

    • 这太棒了伙计,现在我们已经能够将整个过程优化到 10 分钟而不是 10 小时 :) 我认为创建这些临时表也有帮助,首先尝试做 MEMORY 表但是它们太大了,无法放入我们的 2GB 服务器,所以最终使用了 MyISAM,现在就像一个魅力!
    • 我知道这篇文章有点老了,但感谢@dnoeth 的提示:性能的惊人改进
    猜你喜欢
    • 1970-01-01
    • 2013-03-10
    • 1970-01-01
    • 1970-01-01
    • 2021-09-06
    相关资源
    最近更新 更多