【发布时间】:2018-07-25 19:15:32
【问题描述】:
我有一个包含 2 列 email 和 id 的表。我需要找到密切相关的电子邮件。例如:
john.smith12@example.com
和
john.smith12@some.subdomains.example.com
这些应该被认为是相同的,因为用户名 (john.smith12) 和最顶级域 (example.com) 是相同的。它们目前在我的表中有 2 个不同的行。 我已经编写了下面的表达式,它应该进行比较,但执行需要几个小时(可能/可能是因为正则表达式)。有没有更好的写法:
select c1.email, c2.email
from table as c1
join table as c2
on (
c1.leadid <> c2.leadid
and
c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))
这个查询的解释返回为:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 577532, NULL
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 577532, Using where; Using join buffer (Block Nested Loop)
创建表是:
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
我猜索引没有被使用是因为正则表达式。
正则表达式如下:
john[.]smith12@[^@]*example[.]com
应该匹配两个地址。
更新:
我已将on 修改为:
on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))
而采用这种方法的explain 至少使用了索引。
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)
到目前为止,这已经执行了 5 分钟,如果有很大的改进,将会更新。
更新 2:
我已拆分电子邮件,因此用户名是一列,域是一列。我以相反的顺序存储了域,因此它的索引可以与尾随通配符一起使用。
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
`domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`),
KEY `domain` (`domain`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
查询以填充新列:
update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)),
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));
新查询:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
新的解释结果:
1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)
从该解释看来,domain 索引没有被使用。我还尝试使用 USE 强制使用,但这也没有用,导致没有使用索引:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
用use解释:
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)
【问题讨论】:
-
%在LIKE模式的开头阻止它使用索引。您希望模式为john.smith@% -
是什么让您认为这些电子邮件“被视为相同”?他们不是。
-
也许您可以使用生成的列来保存电子邮件的规范版本。
-
是的,类似
WHERE c1.email LIKE CONCAT(SUBSTR(c2.email, 1, POSITION(c2.email, '@')), '%') AND ...