Brain-Dead MySQL 选择优化（使用临时，使用文件排序）答案

【问题标题】：Brain-Dead MySQL Select Optimization (Using Temporary, Using Filesort)Brain-Dead MySQL 选择优化（使用临时，使用文件排序）
【发布时间】：2013-06-13 04:33:48
【问题描述】：

我目前正在从事一项涉及专利的项目，该项目已从 USPTO 网站撤下，作为该项目的一部分，我正在使用由伊利诺伊大学人员创建的数据库
（论文：http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf）
（我正在使用的表的架构略微过时，仅缺少非索引/键值：http://i.imgur.com/44LHS3L.png）

现在正如标题所说，我正在尝试优化查询：

SELECT 
        PN,
        AN,
        grants.GrantID,
        grants.FileDate,
        grants.IssueDate,
        grants.Kind,
        grants.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    (
    SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
    FROM
        gracit_g, grants, grants as grants2
    Where
        grants.GrantID IN (*A 
                                         couple 
                                           Thousand
                                              keys*)
            and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID 
    LIMIT 500000) tbl1,
             grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
    grants.GrantID = tbl1.CitingID
        and grants.GrantID = assignee_g.GrantID
        and grants.GrantID = inventor_g.GrantID
        and grants.GrantID = usclass_g.GrantID
        and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000

几乎每个专利被其后的专利引用我想记录引用它的专利的信息。我似乎遇到的问题是我的“GROUP BY PN，GrantID”导致“使用临时，使用 Filesort”，这严重减慢了我的努力。

这是我的解释给我的（对不起，如果格式不完美，我找不到如何制作表格）

1
初级
派生2
全部
8716
可能的键：空
键：空
key_len: 空
参考：空
使用临时的；使用文件排序

1
初级
赠款
eq_ref
初级
初级
62
tbl1. CitingID
1

1
初级
受让人_g
参考
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
初级
62
tbl1. CitingID
1

1
初级
intclass_g
参考
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
初级
62
tbl1. CitingID
1

1
初级
发明家_g
参考
PRIMARY,fk_PublicationID_Inventor_p
初级
62
tbl1. CitingID
1

1
初级
usclass_g
参考
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
初级
62
tbl1. CitingID
2

2
衍生
赠款
范围
初级
初级
62
参考：空
2179
使用哪里；使用索引

2
衍生
gracit_g
参考
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,被引ID
初级
62
uspto_patents.grants.GrantID
4
在哪里使用

2
衍生
赠款2
eq_ref
初级
初级
62
uspto_patents.gracit_g. CitedID
1

gracit_g 的 SHOW CREATE 是：

CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants`   (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8

赠款的 SHOW CREATE 是：

 CREATE TABLE `grants` (
 `GrantID` varchar(20) NOT NULL,
 `Title` varchar(500) DEFAULT NULL,
 `IssueDate` date DEFAULT NULL,
 `Kind` varchar(2) DEFAULT NULL,
 `USSeriesCode` varchar(2) DEFAULT NULL,
 `Abstract` text,
 `ClaimsNum` int(11) DEFAULT NULL,
 `DrawingsNum` int(11) DEFAULT NULL,
 `FiguresNum` int(11) DEFAULT NULL,
 `ApplicationID` varchar(20) NOT NULL,
 `Claims` text,
 `FileDate` date DEFAULT NULL,
 `AppType` varchar(45) DEFAULT NULL,
 `AppNoOrig` varchar(10) DEFAULT NULL,
 `SourceName` varchar(100) DEFAULT NULL,
 PRIMARY KEY (`GrantID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8

非常感谢您抽出宝贵的时间，很遗憾我必须回到我的床上，因为我现在无法继续工作已经太晚了（或者太早了）

一个建议是将其更改为 1 个查询而不是子查询：

  SELECT 
        gracit_g.citedID,
        info_grant.GrantID,
        info_grant.FileDate,
        info_grant.IssueDate,
        info_grant.Kind,
        info_grant.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
        gracit_g.GrantID IN (*KEYS*)
        and info_grant.GrantID = gracit_g.GrantID
        and info_grant.GrantID = assignee_g.GrantID
        and info_grant.GrantID = inventor_g.GrantID
        and info_grant.GrantID = usclass_g.GrantID
        and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000

这已将其从 21 秒持续时间/10 秒获取时间缩短到 13 秒持续时间/8 秒获取时间，我仍然希望改进这一点，因为我有很多密钥要通过。

【问题讨论】：

一个问题：这需要多长时间？观察：您正在使用非标准 MySQL 扩展 GROUP BY：dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html 通常，您需要在 GROUP BY 子句中命名不是聚合的每一列。 MySQL 可能会为您提供您不需要的功能。观察：您已经告诉 MySQL 创建一个包含 50 万条记录的虚拟表，然后他们对其进行汇总。它使用Using Temporary, Using Filesort 策略来做到这一点并不奇怪。这些策略没有错。
在 MySQL Workbench 中对 1000-2000 个键的每个查询大约需要 30 秒，但我还没有看到它在我的 Java 程序中真正完成。我在帖子末尾对查询进行了一些更改。

标签： mysql database query-optimization database-performance

【解决方案1】：

您的查询格式为：

SELECT some_fields
FROM (
    SELECT other_fields
    FROM table1, table2
    WHERE join_condition_table1_table2 AND some_other_condition
) AS subquery, table3
WHERE join_condition_subquery_table3
GROUP BY another_field

你需要改写如下：

SELECT some_fields
FROM table1, table2, table3
WHERE
    join_condition_table1_table2
    AND join_condition_subquery_table3 -- actually rewrite this ans a join of either table1 and table3, or table2 and table3
    AND some_other_condition
GROUP BY another_field

正如@Ollie Jones 所指出的，选择既不属于GROUP BY 条件也不属于聚合函数的字段（在SELECT 子句中）是危险的。如果这些字段不唯一依赖于GROUP BY 条件中的字段，则这些字段的值是未定义的。

[编辑]

还有一些建议：

按此顺序 (ALTER TABLE gracit_g ADD INDEX(citedID, GrantID);) 在 gracit_g(citedID, GrantID) 上添加索引，并将您的 GROUP BY 子句更改为 GROUP BY gracit_g.citedID, gracit_g.GrantID。优化器可能会喜欢使用这个索引来计算GROUP BY 子句。
如果您的VARCHAR 主键实际上是数字，请将其类型更改为合适的整数类型。如果没有，请添加一个数字代理键并将其用作主键。整数比较要快得多，而且您在所有联接中都进行了很多比较。
在一个额外的列或一个额外的表中预先计算像CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') 这样的连接值（后者将需要每个表的额外连接）
增加tmp_table_size和max_heap_table_size服务器选项。如果临时表的增长大于这两个值中的任何一个（以字节为单位），则临时表将无法保存在内存中，并将被写入磁盘。您可能会从此处异常大的值中受益，因为您正在处理异常大的结果集。

我不知道是否还有其他事情要做。您可能需要考虑返回更小的结果集（更少的列、更多的过滤器或更小的LIMIT）。

【讨论】：

好的，我已经在帖子末尾进行了更改，现在速度更快了，但我真的希望它更快。
@JonathanBoisvert 还有一些提示。但解决方案可能会减少您的结果集。
好吧，现在我已经从执行 1000 多个语句，每个语句有 1000 个键切换到 1 000 000++ 为单个键准备的语句，这似乎工作得很好。你的想法很好，如果我以后需要回到这里，我会记住它们。