【问题标题】:Find duplicate in multiple email fields在多个电子邮件字段中查找重复项
【发布时间】:2020-08-07 02:59:47
【问题描述】:

我有一个奇怪的要求。 我必须在我的数据库中找到重复的联系人记录(应该很简单),我的问题是我必须通过名字、姓氏和任何相互匹配的电子邮件字段进行匹配。

Exemple:
FirstName    | LastName    | Email           | WorkEmail            | AnotherEmail
John           Smith          jh@jh.com         test@test.com          yougettheIdea.com
John           Smith          test@test.com                              
John           Smith          imAdifferent.jh.com

我需要在此示例中确定,第 1 行和第 2 行中的 John Smith 是重复记录,但第 3 行不是。 基本上,我需要查询与 FirstName 匹配的 FirstName、与 LastName 匹配的 LastName,以及与任何字段匹配的任何电子邮件字段...... 这甚至可能吗?

我得到这个是为了匹配名字和姓氏,但电子邮件对我来说太多了:

SELECT * FROM
      (SELECT "FirstName","LastName","Email",","WorkEmail","AnotherEmail", count(*)
      OVER
        (PARTITION BY
          "FirstName",
          "LastName"
        ) AS count
      FROM Contact) tableWithCount
      WHERE tableWithCount.count > 1 ORDER BY count DESC;

【问题讨论】:

    标签: postgresql


    【解决方案1】:

    我将构建一个电子邮件列数组,然后在两个方向上使用包含运算符 @> 检查重复项。

    select *
    from contact c1
    where exists (select * 
                  from contact c2
                  where (c1.first_name, c1.last_name) = (c2.first_name, c2.last_name)
                    and (    
                        array_remove(array[c1.email, c1.work_email, c1.another_email],null) @> array_remove(array[c2.email, c2.work_email, c2.another_email], null)
                     or array_remove(array[c1.email, c1.work_email, c1.another_email],null) <@ array_remove(array[c2.email, c2.work_email, c2.another_email], null) 
                    )
                    and c1.ctid <> c2.ctid
                  );
    

    表达式c1.ctid &lt;&gt; c2.ctid 用于避免将一行与其自身进行比较。如果您的表中有主键或唯一键,请改用该列。

    Online example

    【讨论】:

      【解决方案2】:

      我会在这里使用存在逻辑,将每个电子邮件列与其他列进行检查。以下查询标识所有有重复记录的记录。

      SELECT *
      FROM Contact c1
      WHERE NOT EXISTS (SELECT 1 FROM Contact c2
                        WHERE c2.LastName = c1.LastName AND c2.FirstName = c1.FirstName AND
                              c2.id <> c1.id AND  -- assuming there is a PK column id
                              (c2.Email = c1.Email OR c2.WorkEmail = c1.Email OR
                               c2.AnotherEmail = c1.Email
                               OR
                               c2.Email = c1.WorkEmail OR c2.WorkEmail = c1.WorkEmail OR
                               c2.AnotherEmail = c1.WorkEmail
                               OR
                               c2.Email = c1.AnotherEmail OR c2.WorkEmail = c1.AnotherEmail OR
                               c2.AnotherEmail = c1.AnotherEmail));
      

      如果您想查找所有重复记录,请将NOT EXISTS 更改为EXISTS

      【讨论】:

        猜你喜欢
        • 2013-03-20
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-12-29
        • 2011-01-11
        • 1970-01-01
        • 2013-02-14
        • 1970-01-01
        相关资源
        最近更新 更多