查询5000万行的大表答案

【问题标题】：Query large table with 50 million rows查询5000万行的大表
【发布时间】：2016-07-11 22:17:17
【问题描述】：

尝试查询具有接近 50M 行的大表 (senddb.order_histories)，这是我正在使用的 MySQL 查询：

FIRST APPROACH- 内连接：

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
inner join (
    select order_line_item_id, 
    order_number, 
    order_status, 
    order_status_description, 
    action, 
    modified_by, 
    created_at, 
    max(updated_at) as updated_at
from senddb.order_histories 
where order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b
on a.id = b.order_line_item_id
and a.fulfillment_status = '2';

解释输出：

第二种方法-嵌套选择：

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
where a.fulfillment_status = '2'
and a.id in (
select b.order_line_item_id from(
select order_line_item_id, 
    order_number, 
    order_status, 
    order_status_description, 
    action, 
    modified_by, 
    created_at, 
    max(updated_at) as updated_at
from senddb.order_histories 
where
order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b);

我相信嵌套选择对大数据来说是一种不好的方法，但我还是在这里添加了它，因为它适用于我的样本集。无论如何，这两个查询最终都会在 600 秒后超时，并显示以下消息：错误代码：2013。在查询期间丢失了与 MySQL 服务器的连接。

我想知道是否有任何方法可以更改查询以使其运行得更快。我已经尝试减少内部选择/内部连接中的列，但这不应该是 IMO 真正的问题。我还查找了一个解决方案，上面写着“创建聚集索引”，但我并不能真正理解。任何帮助表示赞赏。

表 order_histories ：

order_histories CREATE TABLE `order_histories` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`order_status_description` varchar(255) DEFAULT NULL,
`datetime_stamp` datetime DEFAULT NULL,
`action` varchar(32) DEFAULT NULL,
`fulfillment_location` int(8) DEFAULT NULL,
`order_status` int(8) DEFAULT NULL,
`user_id` int(8) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`order_line_item_id` int(11) DEFAULT NULL,
`pooled` tinyint(1) DEFAULT '0',
 PRIMARY KEY (`id`),
 KEY `order_histories_ecash_idx` (`order_number`),
 KEY `order_line_item_id` (`order_line_item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=454738178 DEFAULT CHARSET=latin1

表 order_line_items ：

order_line_items CREATE TABLE `order_line_items` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`sku_id` int(8) DEFAULT NULL,
`original_price` float DEFAULT NULL,
`dept_description` varchar(100) DEFAULT NULL,
`description` varchar(100) DEFAULT NULL,
`quantity_ordered` int(8) DEFAULT NULL,
`gift_indicator` char(1) DEFAULT NULL,
`gift_wrap_flag` char(1) DEFAULT NULL,
`shipping_record_flag` char(1) DEFAULT NULL,
`gift_comments` varchar(100) DEFAULT NULL,
`item_status` char(1) DEFAULT NULL,
`tax_amount` float DEFAULT NULL,
`tax_rate` float DEFAULT NULL,
`upc` varchar(20) DEFAULT NULL,
`final_price` float DEFAULT NULL,
`line_number` int(8) DEFAULT NULL,
`master_line_number` int(8) DEFAULT NULL,
`gift_wrap_flag_type` char(1) DEFAULT NULL,
`color_code` varchar(4) DEFAULT NULL,
`size_id` varchar(6) DEFAULT NULL,
`width_id` varchar(6) DEFAULT NULL,
`brand` varchar(15) DEFAULT NULL,
`vpn` varchar(30) DEFAULT NULL,
`dept_number` int(8) DEFAULT NULL,
`class_number` int(8) DEFAULT NULL,
`non_merch_item` char(1) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`chain_id` int(11) DEFAULT NULL,
`fulfillment_location` int(11) DEFAULT NULL,
`fulfillment_date` datetime DEFAULT NULL,
`fulfillment_status` int(11) DEFAULT NULL,
`fulfillment_sales_associate` int(11) DEFAULT NULL,
`gift_wrap_line_number` int(11) DEFAULT NULL,
`shipping_type` int(11) DEFAULT NULL,
`order_track_info_id` int(11) DEFAULT NULL,
`store_tlog_updated` varchar(1) DEFAULT NULL,
`shipping_tlx_code` int(11) DEFAULT NULL,
`store_closed` tinyint(1) DEFAULT NULL,
`flags` int(11) DEFAULT NULL,
`deal_based_index` int(11) DEFAULT NULL,
`tlog_calc_ret_price` float DEFAULT NULL,
`tlog_amount` float DEFAULT NULL,
`tlog_retail_price` float DEFAULT NULL,
`tlog_ext_amount` float DEFAULT NULL,
`tlog_flag_1` int(11) DEFAULT NULL,
`tlog_flag_2` int(11) DEFAULT NULL,
`tlog_flag_3` int(11) DEFAULT NULL,
`time_remaining` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `order_line_items_ecash_idx` (`order_number`),
KEY `order_line_item_fulfillment_location_idx` (`fulfillment_location`),
KEY `order_line_item_fulfillment_status_idx` (`fulfillment_status`),
KEY `upc_idx` (`upc`),
KEY `sku_id_idx` (`sku_id`),
KEY `order_line_items_idx001` (`order_number`,`id`,`fulfillment_status`),
KEY `order_track_info_id` (`order_track_info_id`),
KEY `shipping_type_idx` (`shipping_type`,`non_merch_item`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=11367052 DEFAULT CHARSET=latin1

【问题讨论】：

请问您选择50M条记录的目的是什么？首先添加限制并选择有限数量的记录，然后将其加入辅助表
我正在尝试从具有特定 where 条件的表 B（大表）中选择所有 ID，并从表 A 中提取这些 ID 的相应数据。
索引。我用那个尺寸做这些没问题。
所以在任一查询前面加上 EXPLAIN 这个词并运行它。两者都最多需要几秒钟。发布这些结果和所有相关表的架构。
@Drew 也为内部连接查询添加了解释输出。

标签： mysql sql database mysql-workbench

【解决方案1】：

这个查询可以简化：

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
inner join senddb.order_histories b on a.id = b.order_line_item_id
where b.order_status in ('x','y','z')
and b.fulfillment_location = 'abcd'    
and a.fulfillment_status = '2';

由于您仅从表 a 中选择值，因此您无需从表 b 中选择特定值，而只需应用您的条件即可。除此之外，您需要确保b.order_line_item_id 上有一个索引。你可以找到更多关于创建索引here。我不是 MySQL 方面的专家，但如果 senddb.order_histories.order_line_item_id 还不是主键，那么类似的东西应该可以工作。

CREATE INDEX IX_order_histories_order_line_item_id ON order_histories (order_line_item_id);

【讨论】：

@Nick 我尝试了这个简化的查询并且它有效！试图验证它是否确实提取了所需的数据。避免我在内部选择中使用的“分组依据”也可能使查询更快。不过我有一个问题.. 那么如果我从表 B 中选择这些特定值呢？理想情况下，这应该不会对性能产生巨大影响吧？
先试一试，如果添加时速度太慢，那么下一个问题很好。

【解决方案2】：

您需要阅读 MySQL 文档的 optimization section。它包含很多关于如何优化查询和数据集的信息。这里的主要思想是向用作 SQL 语句的 WHERE 子句中的条件的字段添加索引。

【讨论】：

【解决方案3】：

基本上，两个您的替代方案都使用“sub-SELECT，不是INNER JOIN。

真正的JOIN 的语法是以下之一：

SELECT ...
FROM X INNER JOIN Y USING (field_list)

...或...

SELECT ...
FROM X INNER JOIN Y ON (x.field1 = y.field2) ...

但在这两种情况下，被连接的对象都是 tables 或 views。

我会假设...诚然，没有检查... Nick Larsen 的答案 #1 使用 JOINs 充分地重新表达了您的原始查询。

（请注意，在他的回答中，速记标识符 A 和 B 是如何被引入来引用他的查询中提到的两个表名中的每一个的。）

【讨论】：

【解决方案4】：

首先，您需要确定 5000 万个结果集是否符合您的要求。不存在大数据表，因此您可以选择它们的所有行。他们在那里，以便您可以使用 sql 查询向他们提问。 SQL 是一种查询语言，它不是数据加载语言。

你的目的是什么？如果要复制数据，可以通过加载数据来实现，例如，在 for 循环中每个查询 1000 行。如果您正在加载数据进行处理，您可以以相同的方式执行此操作。

如果您想获取统计信息，您可以使用外连接并使用聚合函数返回少量行。但是你也不应该那样做，你“应该”做的是决定你想要从表中得到什么，最好运行聚合函数来将有用的信息存储在不同的表中。（主要是 SELECT INTO 查询）您一开始就不需要连接一个包含 5000 万条记录的表。

在这里告诉您如何使用索引做错事是不正确的。

【讨论】：

我的目的基本上是进行清理：清理/更新表 A (order_line_items) 中的所有订单，如果它们有在表 B (order_histories) 中被取消的历史记录。
@Maha_Balu2705 没有基本的清理方法。你想从数据库中提取什么样的信息？
坦率地说，这听起来像是应该由一个小脚本或程序来完成的事情。一个正确编写 inner join 查询应提供您需要确定哪些订单需要修改的结果集，但此查询可能不能直接更新。相反，您将在给定 order-id 的情况下发出其他查询。而且，您可能希望使用经过深思熟虑的 SQL transactions 系统来完成此操作。等等。
（此外，这样的程序可以更轻松地处理数据库的段，方法是发出多个一次“接触”较少行的查询。业务规则逻辑可以很容易实现。）
您的查询更有可能是 - 从存在的表 a 中删除 (select * from table b where b.somecolumn=a.somecolumn) 。为了安全起见，您可以创建一个位类型的列，并以类似的方式使用更新集查询将它们标记为删除。然后对标记为删除的记录运行删除查询