如何在列表中查找重复值并将它们合并答案

【问题标题】：How to find duplicate values in a list and merge them如何在列表中查找重复值并将它们合并
【发布时间】：2019-03-17 11:04:27
【问题描述】：

所以基本上例如你有一个类似的列表：

l = ['a','b','a','b','c','c']

输出应该是：

[['a','a'],['b','b'],['c','c']]

所以基本上把重复的值放在一个列表中，

我试过了：

l = ['a','b','a','b','c','c']
it=iter(sorted(l))
next(it)
new_l=[]
for i in sorted(l):
   new_l.append([])
   if next(it,None)==i:
      new_l[-1].append(i)
   else:
      new_l.append([])

但不起作用，如果它起作用了，它不会有效率

【问题讨论】：

标签： python list merge duplicates nested-lists

【解决方案1】：

排序列表然后使用itertools.groupby:

>>> from itertools import groupby
>>> l = ['a','b','a','b','c','c']
>>> [list(g) for _, g in groupby(sorted(l))]
[['a', 'a'], ['b', 'b'], ['c', 'c']]

编辑：这可能不是最快的方法，对于平均情况，排序是 O(n log n) 时间复杂度，并非所有解决方案都需要（参见 cmets）

【讨论】：

然而，这需要 O(n log n) 的平均时间复杂度。
@blhsing 是的，我知道，我实际上并不确定这是不是最好的解决方案，这只是我的第一个想法（需要快速解决），我将把判断推迟到 @987654323 @基准
@Chris_Rands 众所周知，Python 的sorted 函数的平均时间复杂度为 O(n log n)。
@blhsing 是的，你刚才这么说，我同意 :)
@U9-Forward 谢谢，但我不相信这是最好的方法，Austin 或 Blhsing 的解决方案可能会更快，如果添加了 OrderedCounter 配方，将保留订单

【解决方案2】：

使用collections.Counter:

from collections import Counter

l = ['a','b','a','b','c','c']
c = Counter(l)

print([[x] * y for x, y in c.items()])
# [['a', 'a'], ['b', 'b'], ['c', 'c']]

【讨论】：

也可以，很好
这是最好的解决方案。易于阅读且不需要排序（如果您使用字典记住插入顺序的 Python 版本）。
@timgeb 同意！虽然当然排序和保留插入顺序，并不总是会产生相同的输出（尽管它们会为这些数据做）；不知道 OP 到底想要什么

【解决方案3】：

你可以使用collections.Counter:

from collections import Counter
[[k] * c for k, c in Counter(l).items()]

这会返回：

[['a', 'a'], ['b', 'b'], ['c', 'c']]

`%%timeit`比较

给定一个包含 100000 个值的样本数据集，这个答案是最快的方法。

【讨论】：

也可以，很好
请注意，Counter() 的平均时间复杂度为 O(n)。

【解决方案4】：

另一种方法是使用zip 方法。

l = ['a','b','a','b','c','c','b','c', 'a']
l = sorted(l)
grouped = [list(item) for item in list(zip(*[iter(l)] * l.count(l[0])))]

输出

[['a', 'a', 'a'], ['b', 'b', 'b'], ['c', 'c', 'c']]

【讨论】：

也可以，很好

【解决方案5】：

这是通过itertools.groupby 提供的functional 解决方案。由于需要排序，所以时间复杂度为 O(n log n)。

from itertools import groupby
from operator import itemgetter

L = ['a','b','a','b','c','c']

res = list(map(list, map(itemgetter(1), groupby(sorted(L)))))

[['a', 'a'], ['b', 'b'], ['c', 'c']]

语法很麻烦，因为 Python 不提供原生函数组合。第三方库 toolz 支持此功能：

from toolz import compose

foo = compose(list, itemgetter(1))
res = list(map(foo, groupby(sorted(L))))

【讨论】：

也可以，不错

【解决方案6】：

我使用列表理解的解决方案是（l 是一个列表）：

[l.count(x) * [x] for x in set(l)]

set(l) 将检索所有出现在 l 中的元素，没有重复
l.count(x) 将返回特定元素 x 在给定列表 l 中出现的次数
* 运算符创建一个新列表，列表中的元素（在本例中为 [x]）重复指定次数（在本例中为l.count(x) 是次数）

【讨论】：

【解决方案7】：

l = ['a','b','a','b','c','c']

want = []
for i in set(l):
    want.append(list(filter(lambda x: x == i, l)))
print(want)

【讨论】：

时间复杂度 O(n**2)
也可以，很好
Timgeb 你是对的，但也许大小/速度无关紧要。
虽然这可能会回答作者的问题，但它缺少一些解释性文字和文档链接。如果没有围绕它的一些短语，原始代码 sn-ps 并不是很有帮助。您可能还会发现how to write a good answer 非常有帮助。请编辑您的答案。

【解决方案8】：

可能不是最有效的，但这是可以理解的：

l = ['a','b','a','b','c','c']
dict = {}
for i in l:
    if dict[i]:
        dict[i] += 1
    else:
         dict[i] = 1

new = []
for key in list(dict.keys()):
    new.append([key] * dict[key])

【讨论】：

这会导致 KeyError 另外，不要使用内置的 python 函数 (dict) 作为变量名。

%%timeit比较

`%%timeit`比较