生成器函数（yield）比迭代器类（__next__）快得多答案

【问题标题】：Generator function (yield) much faster then iterator class (__next__)生成器函数（yield）比迭代器类（__next__）快得多
【发布时间】：2017-09-29 10:56:13
【问题描述】：

UPDATE（反映最先进的知识水平）状态：2017-05-12

这次更新的原因是，当我问这个问题时，我并不知道我发现了一些关于 Python3 如何“在幕后”工作的东西。

所有接下来的结论是：

如果您为迭代器编写自己的 Python3 代码并关心执行速度，则应将其编写为生成器函数，而不是迭代器类。

下面是一个简约的代码示例，它展示了相同的算法（此处：Pythons 的自制版本range()） 表示为生成器函数比表示为迭代器类运行得快得多：

def   gnrtYieldRange(startWith, endAt, step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step
class iterClassRange:
    def __init__(self, startWith, endAt, step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration

N = 10000000
print("    Size of created list N = {} elements (ints 1 to N)".format(N))

from time import time as t
from customRange import gnrtYieldRange as cthnYieldRange
from customRange import cintYieldRange
from customRange import iterClassRange as cthnClassRange
from customRange import cdefClassRange

iterPythnRangeObj =          range(1, N+1)
gnrtYieldRangeObj = gnrtYieldRange(1, N)
cthnYieldRangeObj = cthnYieldRange(1, N)
cintYieldRangeObj = cintYieldRange(1, N)
iterClassRangeObj = iterClassRange(1, N)
cthnClassRangeObj = cthnClassRange(1, N)
cdefClassRangeObj = cdefClassRange(1, N)

sEXECs = [ 
    "liPR = list(iterPythnRangeObj)",
    "lgYR = list(gnrtYieldRangeObj)",
    "lcYR = list(cthnYieldRangeObj)",
    "liGR = list(cintYieldRangeObj)",
    "liCR = list(iterClassRangeObj)",
    "lcCR = list(cthnClassRangeObj)",
    "ldCR = list(cdefClassRangeObj)"
 ]

sCOMMENTs = [ 
    "Python3 own range(1, N+1) used here as reference for timings  ",
    "self-made range generator function using yield (run as it is) ",
    "self-made range (with yield) run from module created by Cython",
    "Cython-optimized self-made range (using yield) run from module",
    "self-made range as iterator class using __next__() and return ",
    "self-made range (using __next__) from module created by Cython",
    "Cython-optimized self-made range (using __next__) from module "
 ]

for idx, sEXEC in enumerate(sEXECs): 
    s=t();exec(sEXEC);e=t();print("{} takes: {:3.1f} sec.".format(sCOMMENTs[idx], e-s))
print("All created lists are equal:", all([liPR == lgYR, lgYR == lcYR, lcYR == liGR, liGR == liCR, liCR == lcCR, lcCR == ldCR]) )
print("Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'")

上面的代码放入一个文件并运行打印到标准输出：

>python3.6 -u "gnrtFunction-fasterThan-iterClass_runMe.py"
    Size of created list N = 10000000 elements (ints 1 to N)
Python3 own range(1, N+1) used here as reference for timings   takes: 0.2 sec.
self-made range generator function using yield (run as it is)  takes: 1.1 sec.
self-made range (with yield) run from module created by Cython takes: 0.5 sec.
Cython-optimized self-made range (using yield) run from module takes: 0.3 sec.
self-made range as iterator class using __next__() and return  takes: 3.9 sec.
self-made range (using __next__) from module created by Cython takes: 3.3 sec.
Cython-optimized self-made range (using __next__) from module  takes: 0.2 sec.
All created lists are equal: True
Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'
>Exit code: 0

从上面的时序可以看出，自制range() 迭代器的生成器函数变体比迭代器类变体运行得更快，并且当不涉及代码优化时，这种行为也会传播到 C 的 C 代码级别-Cython 创建的代码。

如果您想知道为什么会这样详细说明，您可以通读提供的答案或自己玩一下提供的代码。

在运行上述代码所需的缺失代码片段下方：

customRange.pyx - Cython 文件从以下位置创建 customRange 模块：

def gnrtYieldRange(startWith, endAt, step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step

class iterClassRange:
    def __init__(self, startWith, endAt, step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration

def cintYieldRange(int startWith, int endAt, int step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step

cdef class cdefClassRange:
    cdef int startWith
    cdef int endAt
    cdef int step

    def __init__(self, int startWith, int endAt, int step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration

以及用于创建 Python customRange 模块的设置文件 customRange-setup.py：

import sys
sys.argv += ['build_ext',  '--inplace']

from distutils.core import setup
from Cython.Build   import cythonize

setup(
  name = 'customRange',
  ext_modules = cythonize("customRange.pyx"),
)

现在一些进一步的信息可以更容易地理解所提供的答案：

在我提出这个问题的时候，我正忙于一个相当复杂的问题使用yield从生成器函数形式可用的非唯一列表生成唯一组合的算法。我的目标是使用该算法创建一个用 C 语言编写的 Python 模块，以使其运行得更快。为此，我将使用yield 的生成器函数重写为使用__next__() 和return 的迭代器类。当我比较算法的两种变体的速度时，我很惊讶迭代器类比生成器函数慢两倍，并且我（错误地）假设它与我的方式有关重写了算法（如果你想更好地理解这里的答案是关于什么的，你需要知道这一点），因此

原问如何让迭代器类版本和生成器函数运行速度相同，速度差异从何而来？。

下面是有关问题历史的更多信息：

在下面提供的 Python 脚本代码中，使用 Python function 和 yield 以及使用 class 和 __next__ 实现了从非唯一元素列表创建唯一组合的完全相同的算法。代码在复制/粘贴之后就可以运行了，所以你可以自己看看我在说什么。

在纯 Python 代码中观察到的相同现象会传播到由 Cython 从脚本代码创建的 Python 扩展模块的 C 代码中，因此它不限于 Python 级别的代码，因为它不会在C 代码级别。

问题是：

执行速度的巨大差异从何而来？有什么办法可以让两种代码变体以相当的速度运行？与函数/产量变体相比，类/下一个实现是否有问题？据我所知，两者都是完全相同的代码...

这里的代码（调整突出显示行中的数字会改变列表中元素的唯一性级别，生成的组合对运行时间有巨大影响）：

def uniqCmboYieldIter(lstItems, lenCmbo):
    dctCounter = {}
    lenLstItems = len(lstItems)
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]
    lenUniqs   = len(lstUniqs)
    cmboAsIdxUniqs = [None] * lenCmbo
    multiplicities = [0] * lenUniqs
    idxIntoCmbo, idxIntoUniqs = 0, 0

    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1

    if idxIntoCmbo != lenCmbo:
        return

    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)

        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1

            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return

        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1
            # print("# multiplicities:", multiplicities)


            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1

            if y == lenUniqs:
                break


class uniqCmboClassIter:
    # ----------------------------------------------------------------------------------------------
    def __iter__(self):
       return self

    # ----------------------------------------------------------------------------------------------
    def __init__(self, lstItems, lenCmbo):
        dctCounter = {}
        lenLstItems = len(lstItems)
        for idx in range(lenLstItems):
            item = lstItems[idx]
            if item in dctCounter.keys(): 
                dctCounter[item] += 1
            else: 
                dctCounter[item]  = 1
            #:if
        #:for     

        self.lstUniqs   = sorted(dctCounter.keys())
        self.lenUniqs   = len(self.lstUniqs)
        self.lstCntRpts = [dctCounter[item] for item in self.lstUniqs]

        self.lenCmbo        = lenCmbo
        self.cmboAsIdxUniqs = [None] * lenCmbo
        self.multiplicities = [0] * self.lenUniqs
        self.idxIntoCmbo, self.idxIntoUniqs = 0, 0

        while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
            count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
            self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
            self.multiplicities[self.idxIntoUniqs] = count
            self.idxIntoCmbo  += count
            self.idxIntoUniqs += 1
            # print("self.multiplicities:", self.multiplicities)
            # print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)

        if self.idxIntoCmbo != self.lenCmbo:
            return

        self.stopIteration = False
        self.x = None
        self.y = None

        return

    # ----------------------------------------------------------------------------------------------
    def __next__(self):

        if self.stopIteration is True:
            raise StopIteration
            return

        nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)

        for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.y = self.x + 1

            if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
                break
        else:
            self.stopIteration = True
            return nextCmbo

        for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
            self.multiplicities[self.x] -= 1
            self.multiplicities[self.y] += 1
            # print("# multiplicities:", multiplicities)


            while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
                self.y += 1

            if self.y == self.lenUniqs:
                break

        return nextCmbo

# ============================================================================================================================================
lstSize   = 48 # 48

uniqLevel =  12 # (7 ~60% unique) higher level => more unique items in the generated list

aList = []
from random import randint
for _ in range(lstSize):
    aList.append( ( randint(1,uniqLevel), randint(1,uniqLevel) ) )
lenCmbo = 6
percUnique = 100.0 - 100.0*(lstSize-len(set(aList)))/lstSize
print("========================  lenCmbo:", lenCmbo, 
      "   sizeOfList:", len(aList), 
      "   noOfUniqueInList", len(set(aList)), 
      "   percUnique",  int(percUnique) ) 

import time
from itertools import combinations
# itertools.combinations
# ---
# def   uniqCmboYieldIter(lstItems, lenCmbo):
# class uniqCmboClassIter: def __init__(self, lstItems, lenCmbo):
# ---
start_time = time.time()
print("Combos:%9i"%len(list(combinations(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(     combinations(aList, lenCmbo)))):",  "{:9.5f}".format(duration), "seconds.")

start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboYieldIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):",  "{:9.5f}".format(duration), "seconds.")

start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboClassIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboClassIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")

还有我盒子上的时间：

>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 32    percUnique 66
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.04635 seconds.
Combos:  1296058  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):   3.25447 seconds.
Combos:  1296058  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):   5.97371 seconds.
>Exit code: 0
  [2017-05-02_03:23]  207474 <-Chrs,Keys-> 1277194 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 22    percUnique 45
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.05199 seconds.
Combos:   191072  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):   0.47343 seconds.
Combos:   191072  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):   0.89860 seconds.
>Exit code: 0
  [2017-05-02_03:23]  207476 <-Chrs,Keys-> 1277202 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 43    percUnique 89
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.17285 seconds.
Combos:  6560701  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):  16.72573 seconds.
Combos:  6560701  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):  31.17714 seconds.
>Exit code: 0

更新（状态 2017-05-07）：

在提出问题并提供赏金时，我不知道有一种方法可以轻松地使用 Cython 从 Python 脚本代码中为迭代器对象创建扩展模块的 C 代码，并且这样的 C 代码也可以使用 yield 从迭代器函数创建。

考虑到生成的更快版本的 C 扩展模块仍然不够快，无法与 itertools.combinations 竞争一个迭代器函数以及如何克服这个问题。找到一种使用 Cython 加速更快版本的方法更有意义，特别是因为我是编写 Python 扩展模块的新手的itertools.combinations 进行了自己的修改，因为Segmentation Fault 错误我无法理解其原因。

目前我认为我使用的 Cython 代码仍有加速的空间，无需自己编写 C 代码。

在运行正常的 Cython 代码和速度优化的 Cython 代码下方，该代码以某种方式改变（我目前看不到原因）算法的工作方式并因此产生错误的结果。 Cython 优化背后的想法是在 Cython 代码中使用 Python/Cython 数组而不是 Python 列表。欢迎任何提示如何以新手“安全”的方式从使用的算法中获得更快运行的 Python 扩展模块。

def subbags_by_loops_with_dict_counter(lstItems, int lenCmbo):

    dctCounter = {}
    cdef int lenLstItems = len(lstItems)
    cdef int idx = 0
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]

    cdef int lenUniqs   = len(lstUniqs)

    cmboAsIdxUniqs = [None] * lenCmbo
    multiplicities = [0] * lenUniqs
    cdef int idxIntoCmbo
    cdef int idxIntoUniqs
    cdef int count        
    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1

    if idxIntoCmbo != lenCmbo:
        return

    cdef int x
    cdef int y
    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)

        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1

            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return

        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1

            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1

            if y == lenUniqs:
                break

以下优化的 CYTHON 代码会产生错误的结果：

def subbags_loops_dict_cython_optimized(lstItems, int lenCmbo):

    dctCounter = {}
    cdef int lenLstItems = len(lstItems)
    cdef int idx = 0
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]

    cdef int lenUniqs   = len(lstUniqs)
    cdef array.array cmboAsIdxUniqs = array.array('i', [])
    array.resize(cmboAsIdxUniqs, lenCmbo)
    # cmboAsIdxUniqs = [None] * lenCmbo 
    cdef array.array multiplicities = array.array('i', [])
    array.resize(multiplicities, lenUniqs)
    # multiplicities = [0] * lenUniqs
    cdef int idxIntoCmbo
    cdef int maxIdxCmbo
    cdef int curIdxCmbo
    cdef int idxIntoUniqs
    cdef int count        

    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        maxIdxCmbo = idxIntoCmbo + count
        curIdxCmbo = idxIntoCmbo
        while curIdxCmbo < maxIdxCmbo: 
            cmboAsIdxUniqs[curIdxCmbo] = idxIntoUniqs
            curIdxCmbo += 1
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1
    # print("multiplicities:", multiplicities)
    # print("cmboAsIdxUniqs:", cmboAsIdxUniqs)

    if idxIntoCmbo != lenCmbo:
        return

    cdef int x
    cdef int y
    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)

        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1

            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return

        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1
            # print("# multiplicities:", multiplicities)


            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1

            if y == lenUniqs:
                break

【问题讨论】：

标签： performance python-3.x iterator generator yield

【解决方案1】：

当我将 itertools 文档的一些配方重写为 C 扩展时，我获得了一些经验。我想我可能有一些见解可以帮助你。

生成器与迭代器类。

当您编写纯 Python 代码时，需要在速度（生成器）和功能（迭代器）之间进行权衡。

yield 函数（称为生成器）用于提高速度，通常可以编写它们而无需担心内部状态。因此编写它们的工作量更少，而且速度很快，因为 Python 只管理所有“状态”。

生成器更快（或至少不慢）的原因主要是因为：

除了__next__-方法之外，它们还直接实现__next__-slot（通常是tp_iternext）。在这种情况下，Python 不必查找 __next__ 方法 - 这基本上就是在以下示例中使其更快的原因：

from itertools import islice

def test():
    while True:
        yield 1

class Test(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以它几乎快 3 倍，因为生成器直接填充 __next__-slot。

yield-函数和类具有状态，但 yield 函数保存和加载状态的速度比使用类和属性访问快得多：

def test():
    i = 0
    while True:
        yield i
        i += 1

class Test(object):
    def __init__(self):
        self.val = 0

    def __iter__(self):
        return self

    def __next__(self):
        current = self.val
        self.val += 1
        return current

%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这次课程已经慢了 4 倍（与几乎 3 倍相比，没有涉及状态）。这是一种累积效应：所以你拥有的“状态”越多，类变体就越慢。

yield vs. 类方法就这么多。请注意，实际时间将取决于操作的种类。例如，如果在调用next 时运行的实际代码是slow（即time.sleep(1)），那么生成器和类之间几乎没有区别！

赛通

如果你想要一个快速的 cython 迭代器类，它必须是 cdef class。否则，您将无法获得真正快速的课程。原因是只有cdef class 创建了一个直接实现tp_iternext 字段的扩展类型！我将使用 IPythons %%cython 编译代码（因此我不必包含设置）：

%%cython

def test():
    while True:
        yield 1

class Test(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

cdef class Test_cdef(object):
    def __iter__(self):
        return self

    def __next__(self):
        return 1

%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

时间已经表明，生成器和基本类比纯 Python 等价物更快，但它们的相对性能大致保持不变。然而cdef class 变体击败了它们，这主要是因为使用了tp_iternext 插槽，而不是仅仅实现__next__ 方法。（如果您不信任我，请检查 Cython 生成的 C 代码 :)）

不过，它只比 Python 生成器快 2 倍，这还不错，但也不是很厉害。要获得真正惊人的加速，您需要找到一种方法来表达您的程序没有 Python 对象（Python 对象越少，加速越多）。例如，如果您使用字典来存储项目并且它是多重性的，那么您仍然存储 Python 对象，并且必须使用 Python 字典方法完成任何查找 - 即使您可以通过 C API 函数调用它们而不必查找真正的方法：

%%cython

cpdef cython_count(items):
    cdef dict res = dict()
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res

import random

def count(items):
    res = {}
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res

l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这里有一个问题，你没有使用collections.Counter，它具有优化的 C 代码（至少在 python-3 中）用于这种操作：

from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里有个简短的说明：不要使用something in some_dict.keys()，因为keys() 在Python2 中类似于列表，并且只有实现O(n) 包含操作，而something in some_dict 通常是O(1)（两个Python）！这将使两个版本的速度更快，尤其是在 Python2 上：

def count2(items):
    res = {}
    for item in items:
        if item in res.keys():  # with "keys()"
            res[item] += 1
        else:
            res[item] = 1
    return res

# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop  <--- WHOOPS!!!

这表明，当你使用 python 结构时，你只能希望 Cython（和 C 扩展）的速度提高 3-4 倍，但即使是像使用“.keys()”这样的小错误也会让你付出更多的代价使用不当会影响性能。

优化 Cython

如果你想要更快，你能做什么？答案相对简单：基于 C 类型而不是 Python 类型创建您自己的数据结构。

这意味着您必须考虑设计：

您想在uniqComb** 中支持哪些类型？你想要整数吗（示例是这样说的，但我想你想要任意 Python 对象）。
您想从 Python 进行自省（如当前状态）吗？如果您愿意，将多重性保留为 python 对象是有意义的，但如果您不在乎，可以将它们保存为类整数对象而不是 python 对象。
您需要传递给uniqComb** 函数的对象是可排序的吗？您使用了sorted，但您也可以使用OrderedDict，并按照出现的顺序而不是数值来保持键。

这些问题的答案（这些只是我立即问自己的问题，可能还有很多！）可以帮助您决定可以在内部使用哪种结构。例如，使用 Cython，您可以与 C++ 交互，并且可以使用包含整数键和整数值的 map 而不是字典。它是默认排序的，因此您不需要自己手动对它们进行排序，并且您可以对本机整数而不是 Python 对象进行操作。但是您失去了在 uniqComb 中处理任意 python 对象的能力，您需要知道如何在 Cython 中使用 C++ 类型。不过它可能会非常快！

我不会走那条路，因为我假设您想支持任意可排序的 Python 类型，我坚持使用 Counter 作为起点，但我会将多重性保存为整数 array.arrays 而不是 @ 987654363@。我们称其为“侵入性最小”的优化。实际上，如果您使用list 或array 用于lstCntRpts 和multiplicities，这实际上并不重要，因为它们不是瓶颈 - 但它更快一点并且节省了一点内存并且更重要的是，它展示了如何将同质的arrays 包含在 cython 中：

%%cython

from cpython.list cimport PyList_Size  # (most) C API functions can be used with cython!

from array import array
from collections import Counter

cdef class uniqCmboClassIter:

    cdef list lstUniqs
    cdef Py_ssize_t lenUniqs
    cdef int[:] lstCntRpts   # memoryview
    cdef Py_ssize_t lenCmbo
    cdef list cmboAsIdxUniqs
    cdef int[:] multiplicities  # memoryview
    cdef Py_ssize_t idxIntoCmbo
    cdef Py_ssize_t idxIntoUniqs
    cdef bint stopIteration
    cdef Py_ssize_t x
    cdef Py_ssize_t y

    def __init__(self, lstItems, lenCmbo):
        dctCounter = Counter(lstItems)

        self.lstUniqs = sorted(dctCounter)
        self.lenUniqs = PyList_Size(self.lstUniqs)
        self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])

        self.lenCmbo        = lenCmbo
        self.cmboAsIdxUniqs = [None] * lenCmbo
        self.multiplicities = array('i', [0] * self.lenUniqs)
        self.idxIntoCmbo, self.idxIntoUniqs = 0, 0

        while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
            count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
            self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
            self.multiplicities[self.idxIntoUniqs] = count
            self.idxIntoCmbo += count
            self.idxIntoUniqs += 1
            # print("self.multiplicities:", self.multiplicities)
            # print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)

        if self.idxIntoCmbo != self.lenCmbo:
            return

        self.stopIteration = False
        self.x = 0
        self.y = 0

        return

    def __iter__(self):
        return self

    def __next__(self):
        if self.stopIteration is True:
            raise StopIteration

        nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)

        for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.y = self.x + 1

            if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
                break
        else:
            self.stopIteration = True
            return nextCmbo

        for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
            self.multiplicities[self.x] -= 1
            self.multiplicities[self.y] += 1
            # print("# multiplicities:", multiplicities)

            while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
                self.y += 1

            if self.y == self.lenUniqs:
                break

        return nextCmbo

你实际上没有分享你的时间参数，但我用我的一些人尝试过：

from itertools import combinations

import random
import time

def create_values(maximum):

    vals = [random.randint(0, maximum) for _ in range(48)]
    print('length: ', len(vals))
    print('sorted values: ', sorted(vals))
    print('uniques: ', len(set(vals)))
    print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))

    return vals

class Timer(object):
    def __init__(self):
        pass

    def __enter__(self):
        self._time = time.time()

    def __exit__(self, *args, **kwargs):
        print(time.time() -  self._time)

vals = create_values(maximum=50)  # and 22 and 75 and 120
n = 6

with Timer():
    list(combinations(vals, n))

with Timer():
    list(uniqCmboClassIter(vals, n))

with Timer():
    list(uniqCmboClassIterOriginal(vals, n))

with Timer():
    list(uniqCmboYieldIterOriginal(vals, n))

length:  48
sorted values:  [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22]
uniques:  21
uniques in percent: 43.750000%
6.250450611114502
0.4217393398284912
4.250436305999756
2.7186365127563477

length:  48
sorted values:  [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50]
uniques:  33
uniques in percent: 68.750000%
6.2034173011779785
4.343803882598877
42.39261245727539
26.65750527381897

length:  48
sorted values:  [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99]
uniques:  39
uniques in percent: 81.250000%
6.859697341918945
10.437987327575684
104.12988543510437
65.25306582450867

length:  48
sorted values:  [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147]
uniques:  41
uniques in percent: 85.416667%
6.484673023223877
13.610010623931885
136.28764533996582
84.73834943771362

它的性能确实比原来的方法好得多，实际上使用 just 类型声明要快几倍。可能还有更多可以优化的地方（禁用边界检查，使用 Python C API 函数调用，如果您知道多重性的“最大值”和“最小值”，则使用无符号整数或更小的整数，......） - 但事实即使对于 80% 的独特项目，它也不比 itertools.combinations 慢多少，而且比任何原始实现都快得多，这对我来说已经足够好了。 :-)

【讨论】：

从技术上讲，所有迭代器都有一个tp_iternext 槽，但是 Python 类和天真的非扩展类型 Cython 类有一个 tp_iternext 查找 __next__ 方法并调用它，而生成器和cdef class 有一个不涉及方法查找的tp_iternext。当问题是关于编写 C 扩展模块时，我认为提问者应该熟悉如何做到这一点并且知道使用 tp_iternext 之类的东西，但这是一个糟糕的假设。
@Claudio 明确知道插槽并不一定很重要。将它们视为 C 扩展类的“快速访问”操作。例如tp_iternext 与"C API: Type objects" 中的其他插槽一起解释。它大致相当于 C 扩展中的 __next__。但是您不需要使用 Cython 显式设置它们（cython 自己使用 cdef 类进行设置）。
为了清晰和正确，可以重写关于populating __next__ 的解释。幕后还有更多事情要做——生成器将状态保存在一个框架中（'s locals），而一个支持实例迭代协议的类（'s attributes）。
@Claudio 因为你会失去动态性。例如，您可以重新分配__next__：Test.__next__ = lambda self: 2。但是一旦你有了一个 C 扩展类，你就不能重新分配方法（实际上这是可能的，只是不容易，但我认为它不适用于特殊方法）。同样通常它只是一个很小的常数因子，请记住，与__next__ 方法中完成的操作相比，“插槽查找方法”的开销通常很小。所以它很少会慢 2/3 倍以上。此外，还可以使用几乎完全缓解这种情况的生成器。
是的，是IPython command。

【解决方案2】：

当您使用 yield 编写生成器函数时，保存和恢复状态的开销由 CPython 内部（在 C 中实现）处理。使用__iter__/__next__，您必须管理每次调用的保存和恢复状态。在 CPython 中，Python 级代码比 C 级内置代码慢，因此 extr Python 级代码涉及状态管理（包括通过dict 查找访问self 的属性等简单的东西，而不是加载局部变量，只有数组索引开销）最终会花费你很多。

如果您在 C 扩展模块中实现自己的支持类型的迭代器协议，您将绕过此开销；保存和恢复状态应该是一些 C 级变量访问的问题（与 Python 生成器函数所产生的开销相比，开销相似或更少，也就是说，很少）。实际上，这就是生成器函数的含义，它是一种 C 扩展类型，在每次调用 tp_iternext（C 级别相当于 __next__）时保存和恢复 Python 帧。

【讨论】：

@Claudio：没有看到 Cython 代码，我帮不了你。没有类型声明的 Cython 很少能提高速度，即使有声明的类型，它也经常错过简单的优化机会。做到“正确”的唯一方法是使用 Python C API 直接在 C 中实际实现您的类。或者你可以看how generator objects are actually implemented；他们已经完成了tp_iternext，没有其他扩展无法模仿的特殊魔法。
请查看底部的 Cython 代码更新问题。

【解决方案3】：

__next__ 版本的类是适合实现的类作为 Python 扩展模块，因为没有等效的 yield 在 C 中，因此找出如何按顺序改进它是有意义的执行与具有产量变体的功能相当。

已经用 C 写了。您看到的性能差异完全是由于 Python 实现的属性不适用于您计划编写的 C 扩展模块。可以应用于 Python 类的优化不适用于 C 代码。

例如，在 Python 代码中访问实例变量比访问局部变量更昂贵，因为访问实例变量需要多次 dict 查找。您的 C 实现不需要此类 dict 查找。

【讨论】：

@Claudio：您是直接在文件上运行 Cython 而不做任何修改，还是实际上是 generate an extension type？
我在没有任何修改的情况下运行了 Cython。