【发布时间】:2021-08-31 02:01:15
【问题描述】:
我有一个 300k 人的列表,其中有一些重复。但最重要的是,一些近似重复。
例如。 : Id LastName FirstName BirthDate
- 1 肯尼迪约翰 01/01/2000
- 2 肯尼迪·约翰·菲茨杰拉德 01/01/2000
我想找到这些重复项并将它们分开处理。我找到了一些关于 Linq 的 GroupBy 的例子,但我找不到这两个微妙之处的解决方案:
- 将名字与 StartsWith 匹配
- 完全保留整个对象(不仅仅是带有 Select new 的姓氏)
目前,我有以下内容。它完成了这项工作,但它非常非常慢,我很确定它可以更流畅:
var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
List<Person> doubles = ListPersons
.Where(x => x.Id != c.Id
&& x.LastName == c.LastName
&& (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
&& x.BirthDate == c.BirthDate
).ToList();
if (doubles.Any())
{
doubles.Add(c); // add the current guy
dictionary.Add(key++, doubles);
}
// Ugly hack to remove the doubles already found
ListPersons = ListPersons.Except(doubles).ToList();
}
// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)
最后:
借助下面的帮助和 IEqualityComparer:
// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();
// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
foreach (Person c in listC)
{
// Some treatment
}
}
【问题讨论】:
-
请参阅此问题以获取带有容差的字符串比较:Comparing strings with tolerance。您可以尝试在您的解决方案中实现这一点。
-
具有更深入搜索的 v2 的好主意 :) !谢谢你:)
标签: c# list linq duplicates