Mysql匹配“相同”的电子邮件答案

【问题标题】：Mysql Matching "Same" EmailsMysql匹配“相同”的电子邮件
【发布时间】：2018-07-25 19:15:32
【问题描述】：

我有一个包含 2 列 email 和 id 的表。我需要找到密切相关的电子邮件。例如：

john.smith12@example.com

和

john.smith12@some.subdomains.example.com

这些应该被认为是相同的，因为用户名 (john.smith12) 和最顶级域 (example.com) 是相同的。它们目前在我的表中有 2 个不同的行。 ~~我已经编写了下面的表达式，它应该进行比较，但执行需要几个小时（可能/可能是因为正则表达式）。有没有更好的写法：~~

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

这个查询的解释返回为：

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

创建表是：

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

我猜索引没有被使用是因为正则表达式。

正则表达式如下：

john[.]smith12@[^@]*example[.]com

应该匹配两个地址。

更新：

我已将on 修改为：

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

而采用这种方法的explain 至少使用了索引。

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

到目前为止，这已经执行了 5 分钟，如果有很大的改进，将会更新。

更新 2：

我已拆分电子邮件，因此用户名是一列，域是一列。我以相反的顺序存储了域，因此它的索引可以与尾随通配符一起使用。

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

查询以填充新列：

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

新查询：

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

新的解释结果：

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

从该解释看来，domain 索引没有被使用。我还尝试使用 USE 强制使用，但这也没有用，导致没有使用索引：

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

用use解释：

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

【问题讨论】：

相关：stackoverflow.com/questions/12318083/…
% 在LIKE 模式的开头阻止它使用索引。您希望模式为john.smith@%
是什么让您认为这些电子邮件“被视为相同”？他们不是。
也许您可以使用生成的列来保存电子邮件的规范版本。
是的，类似WHERE c1.email LIKE CONCAT(SUBSTR(c2.email, 1, POSITION(c2.email, '@')), '%') AND ...

标签： mysql sql regex self-join

【解决方案1】：

您告诉我们该表有 700K 行。

这并不多，但您将其加入到自身中，因此在最坏的情况下，引擎将不得不处理 700K * 700K = 490 000 000 000 = 490B 行。

索引在这里绝对可以提供帮助。

最佳索引取决于数据分布。

以下查询返回什么？

SELECT COUNT(DISTINCT username) 
FROM table

如果结果接近 700K，比如说 100K，那么这意味着有很多不同的用户名，您最好关注它们，而不是 domain。如果结果较低，例如 100，则索引 username 不太可能有用。

我希望有很多不同的用户名，所以，我会在 username 上创建一个索引，因为查询使用简单的相等比较来连接该列，并且该连接将从该索引中受益匪浅。

另一个要考虑的选项是(username, domain) 上的复合索引，甚至是(username, domain, leadid, email) 上的覆盖索引。索引定义中的列顺序很重要。

我会删除所有其他索引，以便优化器无法做出其他选择，除非有其他查询可能需要它们。

在表上定义一个主键很可能也没有什么坏处。

还有一件不那么重要的事情需要考虑。你的数据真的有NULL吗？如果不是，请将列定义为NOT NULL。此外，在许多情况下，最好使用空字符串而不是 NULL，除非您有非常具体的要求并且必须区分 NULL 和 ''。

查询会稍微简单一些：

select 
    c1.email, c2.email, 
    c1.domain, c2.domain, 
    c1.username, c2.username, 
    c1.leadid, c2.leadid
from 
    table as c1
    join table as c2
        on  c1.username = c2.username 
        and c1.domain like concat(c2.domain, '%')
        and c1.leadid <> c2.leadid

【讨论】：

有 560739 条不同的记录。索引用户名并删除域索引似乎已经成功了。目前，是的，有 NULL，但我认为转换为空字符串更有意义，所以我也会这样做。谢谢。我将进行更多测试以确认此方法有效，然后接受。
这对我有用。为什么域上的索引会损害查询？我原以为在通配符列上有一个索引会有所帮助。
@user3783243，并不是domain 上的索引会影响查询；它没有帮助。与LIKE 连接时，可能 MySQL 不够聪明，无法使用索引。与= 连接时使用索引要容易得多。此外，数据分布告诉我们这是一个不错的选择。由于在 700K 行中有 560K 不同的用户名，这意味着只有少数用户名具有多个域，即在 = 比较在自联接中找到匹配的用户名之后要检查多行。

【解决方案2】：

不需要REGEXP_REPLACE，因此它适用于所有版本的 MySQL/MariaDB：

UPDATE tbl
    SET email = CONCAT(SUBSTRING_INDEX(email, '@', 1),
                       '@',
                       SUBSTRING_INDEX(
                           SUBSTRING_INDEX(email, '@', -1),
                           '.',
                           -2);

由于没有索引是有用的，您最好不要使用WHERE 子句。

【讨论】：

这使表格更有条理，但select 仍然需要很长时间，并且索引似乎没有被使用。在 +66,000 秒的执行时间后，我杀死了最新的选择（更新 2）。

【解决方案3】：

如果您搜索相关数据，您应该查看一些数据挖掘工具或弹性搜索，例如，它们可以按您的需要工作。

我有另一个可能的“仅数据库”解决方案，但我不知道它是否可行，或者它是否是最佳解决方案。如果我不得不这样做，我会尝试制作一个“单词参考”表，通过所有非字母数字字符拆分所有电子邮件来填充。

在您的示例中，此表将填充：john、smith12、some、subdomains、example 和 com。每个单词都有一个唯一的 id。然后，另一个表，一个联合表，它将电子邮件与其自己的单词链接起来。两个表都需要索引。

要搜索密切相关的电子邮件，您必须使用正则表达式拆分源电子邮件，并在每个子单词 like this one in the answer（带有连接的）上循环，然后对于每个单词，在单词引用表中找到它，然后联合表查找与其匹配的电子邮件。

根据此请求，您可以选择对所有匹配的电子邮件进行汇总，方法是按电子邮件分组以计算与找到的电子邮件匹配的字数，并仅保留最匹配的电子邮件（当然不包括源电子邮件）。

对于这个“不确定的答案”很抱歉，但评论太长了。我试着举个例子。

这是一个包含一些数据的示例（在 oracle 中，但应该与 MySQL 一起使用）：

---------------------------------------------
-- Table containing emails and people info
CREATE TABLE PEOPLE (
     ID NUMBER(11) PRIMARY KEY NOT NULL,
     EMAIL varchar2(100) DEFAULT NULL,
     USERNAME varchar2(500) DEFAULT NULL
);

-- Table containing word references
CREATE TABLE WORD_REF (
     ID number(11) NOT NULL PRIMARY KEY,
     WORD varchar2(20) DEFAULT NULL
);

-- Table containg id's of both previous tables
CREATE TABLE UNION_TABLE (
     EMAIL_ID number(11) NOT NULL,
     WORD_ID number(11) NOT NULL,
     CONSTRAINT EMAIL_FK FOREIGN KEY (EMAIL_ID) REFERENCES PEOPLE (ID),
     CONSTRAINT WORD_FK FOREIGN KEY (WORD_ID) REFERENCES WORD_REF (ID)
);

-- Here is my oracle sequence to simulate the auto increment
CREATE SEQUENCE MY_SEQ
  MINVALUE 1
  MAXVALUE 999999
  START WITH 1
  INCREMENT BY 1
  CACHE 20;

---------------------------------------------
-- Some data in the people table
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@example.com', 'jsmith12');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@some.subdomains.example.com', 'admin');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.doe@another.domain.eu', 'jdo');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'nathan.smith@example.domain.com', 'nsmith');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'david.cayne@some.domain.st', 'davidcayne');
COMMIT;

-- Word reference data from the people data
INSERT INTO WORD_REF (ID, WORD) 
  (select MY_SEQ.NEXTVAL, WORD FROM
   (select distinct REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
    from PEOPLE
    CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL
  ));
COMMIT;

-- Union table filling
INSERT INTO UNION_TABLE (EMAIL_ID, WORD_ID)
select words.ID EMAIL_ID, word_ref.ID WORD_ID
FROM 
(select distinct ID, REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
 from PEOPLE
 CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL) words
left join WORD_REF on word_ref.word = words.WORD;
COMMIT;    

---------------------------------------------
-- Finaly, the request which orders the emails which match the source email 'john.smith12@example.com'
SELECT COUNT(1) email_match
      ,email
FROM   (SELECT word_ref.id
              ,words.word
              ,uni.email_id
              ,ppl.email
        FROM   (SELECT DISTINCT regexp_substr('john.smith12@example.com'
                                             ,'\w+'
                                             ,1
                                             ,LEVEL) word
                FROM   dual
                CONNECT BY regexp_substr('john.smith12@example.com'
                                        ,'\w+'
                                        ,1
                                        ,LEVEL) IS NOT NULL) words
        LEFT   JOIN word_ref
        ON     word_ref.word = words.word
        LEFT   JOIN union_table uni
        ON     uni.word_id = word_ref.id
        LEFT   JOIN people ppl
        ON     ppl.id = uni.email_id)
WHERE  email <> 'john.smith12@example.com'
GROUP  BY email_match DESC;

请求结果：

    4    john.smith12@some.subdomains.example.com
    2    nathan.smith@example.domain.com
    1    john.doe@another.domain.eu

【讨论】：

【解决方案4】：

你得到的名字（即'@'之前的部分）

substring_index(email, '@', 1)

你得到域

substring_index(replace(email, '@', '.'), '.', -2))

（因为如果我们用点替换“@”，那么它总是在倒数第二个点之后的部分）。

因此你会发现重复的

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and substring_index(other.email, '@', 1) = 
        substring_index(users.email, '@', 1)
    and substring_index(replace(other.email, '@', '.'), '.', -2) =
        substring_index(replace(users.email, '@', '.'), '.', -2)
);

如果这太慢了，那么您可能希望在这两个组合上创建一个计算列并对其进行索引：

alter table users add main_email as 
  concat(substring_index(email, '@', 1), '@', substring_index(replace(email, '@', '.'), '.', -2));

create index idx on users(main_email);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.main_email = users.main_email
);

当然，您也可以将两者分开并将它们编入索引：

alter table users add email_name as substring_index(email, '@', 1);
alter table users add email_domain as substring_index(replace(email, '@', '.'), '.', -2);

create index idx on users(email_name, email_domain);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.email_name = users.email_name
    and other.email_domain = users.email_dome
);

当然，如果您允许在电子邮件地址栏中同时使用大写和小写，您还需要在上面的表达式中应用LOWER (lower(email))。

【讨论】：