在列表中查找近似重复项答案

【问题标题】：Find approximate duplicates in a list在列表中查找近似重复项
【发布时间】：2021-08-31 02:01:15
【问题描述】：

我有一个 300k 人的列表，其中有一些重复。但最重要的是，一些近似重复。

例如。 : Id LastName FirstName BirthDate

1 肯尼迪约翰 01/01/2000
2 肯尼迪·约翰·菲茨杰拉德 01/01/2000

我想找到这些重复项并将它们分开处理。我找到了一些关于 Linq 的 GroupBy 的例子，但我找不到这两个微妙之处的解决方案：

将名字与 StartsWith 匹配
完全保留整个对象（不仅仅是带有 Select new 的姓氏）

目前，我有以下内容。它完成了这项工作，但它非常非常慢，我很确定它可以更流畅：

var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
    List<Person> doubles = ListPersons
        .Where(x => x.Id != c.Id
        && x.LastName == c.LastName
        && (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
        && x.BirthDate == c.BirthDate 
        ).ToList();

    if (doubles.Any())
    {
       doubles.Add(c); // add the current guy
       dictionary.Add(key++, doubles);
    }

    // Ugly hack to remove the doubles already found
    ListPersons = ListPersons.Except(doubles).ToList();
}

// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)

最后：

借助下面的帮助和 IEqualityComparer：

// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();

// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
 foreach (Person c in listC)
 {
   // Some treatment
 }
}

【问题讨论】：

请参阅此问题以获取带有容差的字符串比较：Comparing strings with tolerance。您可以尝试在您的解决方案中实现这一点。
具有更深入搜索的 v2 的好主意 :) ！谢谢你:)

标签： c# list linq duplicates

【解决方案1】：

你总是可以建立自己的IEqualityComparer<T>：

public class PersonComparer : IEqualityComparer<Person>
{
    public bool Equals(Person x, Person y)
    {
        return x?.LastName == y?.LastName && x?.BirthDate == y?.BirthDate
            && (x?.FirstName?.StartsWith(y?.FirstName) == true || y?.FirstName?.StartsWith(x?.FirstName) == true) ;
    }

    public int GetHashCode(Person obj)
    {
        unchecked 
        {
            int hash = 17;
            hash = hash * 23 + (obj?.LastName?.GetHashCode() ?? 0);
            hash = hash * 23 + (obj?.BirthDate.GetHashCode() ?? 0);
            return hash;
        }
    }
}

如果您只想保留第一个，请删除其他重复项：

ListPersons = ListPersons
    .GroupBy(x => x, new PersonComparer())
    .Select(g => g.First())
    .ToList();

您可以将此比较器用于许多其他 LINQ 方法，甚至可以用于字典或 HashSet<T>。例如，您也可以通过这种方式删除重复项：

HashSet<Person> persons = new HashSet<Person>(ListPersons, new PersonComparer());

纯 LINQ 的另一种方式：

ListPersons = ListPersons.Distinct(new PersonComparer()).ToList();

【讨论】：

非常感谢！确实这样更有效率！ :) 我将使用最终代码完成我的原始帖子。
@Grimness：我猜 HashSet 方法最有效或类似：ListPersons.Distinct(new PersonComparer())。将其添加到答案中。 GroupBy 的优点是您可以轻松添加要保留的副本的逻辑：使用 g.OrderBy(logic).First()。
也许你已经看过我对原始帖子的编辑：确实我也想“保留”重复的内容:)。我毫不怀疑 Distinct 效率更高，但 GroupBy 效率如此之高，对我来说已经足够了 :)。无论如何，它可以帮助别人！再次感谢。