如何以最佳方式计算python列表中的元素答案

【问题标题】：how to optimally count elements in a python list如何以最佳方式计算python列表中的元素
【发布时间】：2010-12-16 01:35:23
【问题描述】：

这与here 几乎是同一个问题，只是我问的是排序结果的最有效解决方案。

我有一个列表（0到12之间随机大约10个整数），例如：

the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]

我想创建一个函数，它返回按第一个元素排序的元组（项目、计数）列表，例如

output = [(4, 3), (5, 4), (6, 1), (7, 2)]

到目前为止我用过：

def dupli(the_list):
    return [(item, the_list.count(item)) for item in sorted(set(the_list))]

但是我调用这个函数几乎是一百万的时间，我需要尽可能快地完成它（python）。因此我的问题是：如何减少此功能的使用时间？（内存呢？）

我玩了一会儿，但没有什么明显的发现：

from timeit import Timer as T
number=10000
setup = "the_list=[5, 7, 6, 5, 5, 4, 4, 7, 5, 4]"

stmt = "[(item, the_list.count(item)) for item in sorted(set(the_list))]"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[230]: 0.058799982070922852

stmt = "L = []; \nfor item in sorted(set(the_list)): \n    L.append((item, the_list.count(item)))"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[233]: 0.065041065216064453

stmt = "[(item, the_list.count(item)) for item in set(sorted(the_list))]"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[236]: 0.098351955413818359

谢谢
克里斯托夫

【问题讨论】：

你用的是哪个python版本？
作为一名程序员，我不会问自己“我怎样才能让这件事花更少的时间？”但是“我怎样才能避免做一百万次呢？”您确定需要此函数的算法从一开始就在更大规模上是最优的吗？
如果你调用你的函数“几乎一百万次”，这将需要大约 5 秒——这真的有问题吗？
致 DGH：我在模拟扑克牌。根据循环内代码的复杂性（运行近百万次），我认为我不能对其进行矢量化，或者我可以避免在每个循环中至少调用一次 dupli。
致 Sven Marnach：这不是问题，因为这个扑克程序只是为了好玩，我只是借此机会学习更多 Python。但是，我很有可能会多次运行这 100 万手牌，或者我想要一个即时答案（例如，如果我并行在线玩）。

标签： python list performance count

【解决方案1】：

更改排序位置可节省约 20%。

改变这个：

def dupli(the_list):
    return [(item, the_list.count(item)) for item in sorted(set(the_list))]

到这里：

def dupli(the_list):
    count = the_list.count # this optimization added courtesy of Sven's comment
    result = [(item, count(item)) for item in set(the_list)]
    result.sort()
    return result

这更快的原因是sorted 迭代器必须创建一个临时列表，而对结果进行排序。

编辑： 这是另一种比原来快 35% 的方法：

def dupli(the_list):
    counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    for n in the_list:
        counts[n] += 1
    return [(i, counts[i]) for i in (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) if counts[i]]

注意：您可能希望随机化 the_list 的值。我的最终版本 dupli 使用其他随机数据集（import random; the_list=[random.randint(0,12) for i in xrange(10)]）测试得更快

【讨论】：

这是迄今为止我见过的最快的方法（使用普通的 CPython 2.6.6）。通过在列表理解之外查找.count()（即count = the_list.count 在result = ... 之前，然后在列表理解中使用(item, count(item))）可以稍微改进它。
@Sven Marnach：很好的优化。我已经更新了我的答案以包含它。我还添加了另一种基于 John Machin 答案的方法，但由于消除了 enumerate 并且因为它将 [0] * 13 扩展到其结果，它的测试速度更快。

【解决方案2】：

我会尝试：

from collections import defaultdict
output = defaultdict(lambda: 0)
for item in the_list: output[item] += 1
return sorted(output.items())

【讨论】：

在我的机器上，这大约是 OP 中 dupli() 函数时间的两倍。
使用defaultdict(int)而不是使用lambda会更快

【解决方案3】：

利用“0 到 12 之间”的资格：

>>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
>>> answer1 = [0] * 13
>>> for i in the_list:
...    answer1[i] += 1
...
>>> answer1
[0, 0, 0, 0, 3, 4, 1, 2, 0, 0, 0, 0, 0]
>>> # You might be able to use that as-is:
...
>>> for i, v in enumerate(answer1):
...     if v: print i, v
...
4 3
5 4
6 1
7 2
>>> # Otherwise you can build the list that you specified:
...
>>> answer2 = [(i, v) for i, v in enumerate(answer1) if v]
>>> answer2
[(4, 3), (5, 4), (6, 1), (7, 2)]
>>>

【讨论】：

这是我尝试的第一件事。在我的机器上，它的平均速度与原始 dupli() 相同——至少如果输出转换为请求的格式 (answer2)。

【解决方案4】：

编写自己的函数来计算一次遍历列表中的数字可能会更快。您正在为集合中的每个数字调用 count 函数，并且每个调用都需要遍历列表。

counts = {}
for n in the_list:
    if n not in counts:
        counts[n] = 0
    counts[n] += 1
sorted(counts.items())

【讨论】：

这比我机器上的 OP 中的函数慢，但比目前所有其他建议的幅度要小。

【解决方案5】：

这在空间和速度方面似乎相当理想：

def dupli2(list_):                                    
    dict_ = {}                                       
    for item in list_:                               
        dict_[item] = dict_.get(item, 0) + 1         
    return sorted(dict_.items())

或者这个：

def dupli3(list_):                                            
    last = None                                               
    list_ = sorted(list_)                                     

    i = 0                                                     
    for item in list_:                                        
        if item != last and last is not None:                 
            yield last, i                                     
            i = 0                                             
        i += 1                                                
        last = item                                           

    yield last, i

但不确定速度。为此，我建议您使用 C 或使用 Psyco ;)

使用 Psyco：

In [33]: %timeit list(dupli3(test.the_list))
100000 loops, best of 3: 6.46 us per loop

In [34]: %timeit list(dupli2(test.the_list))
100000 loops, best of 3: 2.37 us per loop

In [35]: %timeit list(dupli(test.the_list))
100000 loops, best of 3: 2.7 us per loop

【讨论】：

这两个函数在我的机器上比 OP 中的函数慢得多。此外，第二个函数返回错误的结果。
@Sven Marnach：这取决于，如果你使用psyco 比dupli2 方法在这里更快。你对dupli3 方法是对的，我在那里犯了一个愚蠢的错误，不小心在这里发布了一个早期版本。我会更新它:)
好的，我用的是 CPython 2.6.6。如果将.get() 的属性查找移出循环，它会稍微快一些。
在计时 dupli() 和 dupli2() 时，您使用 list() 复制了返回的列表。这似乎有点无意义。
@gnibbler：将setdefault 与整数一起使用有什么意义？无论如何你不能增加它，所以你会做几乎相同的事情。只有defaultdict 会使其更短/更好。

【解决方案6】：

itertools.groupby 非常适合：

>>> from itertools import groupby
>>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
>>> gb = groupby(sorted(the_list))
>>> print [(i,len(list(j))) for i,j in gb]
[(4, 3), (5, 4), (6, 1), (7, 2)]

【讨论】：

我喜欢你对迭代器的使用。就优化而言，您的解决方案所花费的时间是 OP 的最大努力的 2.5 倍。