N个颜色元素的有效组合，颜色数量有限制答案

【问题标题】：Efficient combinations of N colored elements with restriction in the number of colorsN个颜色元素的有效组合，颜色数量有限制
【发布时间】：2013-12-31 05:00:47
【问题描述】：

给定一组用 C 种颜色着色的 N 个元素，我如何找到长度为 L 且最多包含 M 种颜色的所有可能组合？

我尝试了这个算法，它使用 itertools.combinations 生成所有可能的组合，然后过滤掉那些不符合最大颜色条件的组合。

from itertools import combinations as cb

def allowed_combinations(elements, combination_size=4, max_colors=3):

    colors = set([c for k, c in elements.items()])
    combinations = cb(elements, combination_size)
    for combination in combinations:
        colors = set([elements[element] for element in combination])
        if len(colors) > max_colors:
            continue
        yield combination


elements = dict()
elements['A'] = 'red'
elements['B'] = 'red'
elements['C'] = 'blue'
elements['D'] = 'blue'
elements['E'] = 'green'
elements['F'] = 'green'
elements['G'] = 'green'
elements['H'] = 'yellow'
elements['I'] = 'white'
elements['J'] = 'white'
elements['K'] = 'black'

combinations = allowed_combinations(elements)

for c in combinations:
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

输出如下：

A-red
C-blue
B-red
E-green


A-red
C-blue
B-red
D-blue


A-red
C-blue
B-red
G-green


A-red
C-blue
B-red
F-green

...

问题是生成所有可能的组合在计算上可能非常昂贵。例如，在我的例子中，L 通常是 6，元素数量 N 大约是 50，所以它给了我们 Bin(50,6) = 15890700 种可能的组合。如果组合中允许的最大颜色数量很小，则大多数组合都是“无用的”，因此它们在过滤步骤中被丢弃。我的直觉是我应该将过滤步骤放在组合步骤内部/之前，以避免组合的扩展，但我不知道如何。

【问题讨论】：

标签： python algorithm combinations combinatorics

【解决方案1】：

粗略的轮廓。

你总共有 C 种不同的颜色。对于每个k, 1 <= k <= M，选择k 颜色以Bin(C,k) 方式。（我在这里使用你的符号，假设 Bin 平均二项式系数）。

对于上述每个选项，收集具有所选颜色的所有元素。假设它给出了P 不同的元素。然后以Bin(P, L)不同的方式从这些P元素中选择L。

以上所有内容都经过明显检查，M <= C、L <= P 等。

这种方法的优点是它只会生成有效组合，并且每个有效组合只会生成一次。（编辑：正如评论中指出的，这不是真正的重复，可以生成组合）。

PS。这是上述算法的一个实现，并修复了重复组合：

from itertools import combinations


elts  = { 'A' : 'red', 'B' : 'red', 'C' : 'blue', 'D' : 'blue',
          'E': 'green', 'F' : 'green', 'G' : 'green', 'H' : 'yellow',
          'I' : 'white', 'J' : 'white', 'K' : 'black' }

def combs (elts, size = 4, max_colors = 3):
    # Count different colors
    colors = {}
    for e in elts.values():
        colors [e] = 1
    ncolors = len(colors)

    # for each different number of colors between 1 and 'max_colors' 
    for k in range (1, max_colors + 1):
        # Choose 'k' different colors
        for selected_colors in combinations (colors, k):
            # Select ell the elements with these colors
            selected_elts = []
            for e, c in elts.items():
                if c in selected_colors:
                    selected_elts.append (e)
            # Choose 'size' of these elements
            for chosen_elts in combinations (selected_elts, size):
                # Check the chosen elements are of exactly 'k' different colors
                t = {}
                for e in chosen_elts:
                    t[elts[e]] = 1
                if len(t) == k:
                    yield chosen_elts


#for e in combs (elts):
#    print (e)

print (len (list (combs (elts))))

PS。我还用程序 here 为 Tim 的 comb2、我自己的 comb 和 Gareth 的 constrained_combinations 计时，结果如下：

combs2 =  5.214529
constr combs = 5.290079
combs = 4.952063
combs2 = 5165700
constr combs = 5165700
combs = 5165700

【讨论】：

唉，比这更难。例如，当 k=2 时，在 OP 的示例中，{'white', 'black'} 是一组 2 种颜色。这对应于元素集{'I', 'J', 'K'}。如果L==2，则该集合的 3 个 2 组合中只有 2 个实际上跨越了两种不同的颜色。另一种组合 - ('I', 'J') - 具有颜色为 'white' 的两个元素。因此，它复制了为 k=1 提供的结果之一。
@TimPeters，确实如此。这可以通过接受一个组合来纠正，如果它恰好有 k 不同的颜色，这是不幸的，我的目标是避免“过滤”。
是的，我第一次尝试编写代码时遇到了同样的问题 ;-) “自定义一直向下”似乎是解决这个问题的最佳方法。

【解决方案2】：

组合问题以易于陈述但可能难以解决而臭名昭著。对于这个，我根本不会使用itertools，而是编写一个自定义生成器。例如，

def combs(elt2color, combination_size=4, max_colors=3):

    def inner(needed, index):
        if needed == 0:
            yield result
            return
        if n - index < needed:
            # not enough elements remain to reach
            # combination_size
            return
        # first all results that don't contain elts[index]
        for _ in inner(needed, index + 1):
            yield result
        # and then all results that do contain elts[index]
        needed -= 1
        elt = elts[index]
        color = elt2color[elt]
        color_added = color not in colors_seen
        colors_seen.add(color)
        if len(colors_seen) <= max_colors:
            result[needed] = elt
            for _ in inner(needed, index + 1):
                yield result
        if color_added:
            colors_seen.remove(color)

    elts = tuple(elt2color)
    n = len(elts)
    colors_seen = set()
    result = [None] * combination_size
    for _ in inner(combination_size, 0):
        yield tuple(result)

然后：

elt2color = dict([('A', 'red'), ('B', 'red'), ('C', 'blue'),
                  ('D', 'blue'), ('E', 'green'), ('F', 'green'),
                  ('G', 'green'), ('H', 'yellow'), ('I', 'white'),
                  ('J', 'white'), ('K', 'black')])
for c in combs(elt2color):
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

产生与您的后处理代码相同的 188 种组合，但在内部放弃部分组合时，它会跨越超过 max_colors 颜色。无法更改 itertools 函数在内部执行的操作，因此当您想要控制它时，您需要自己动手。

使用 itertools

这是另一种方法，首先生成恰好 1 种颜色的所有解决方案，然后恰好生成 2 种颜色，依此类推。 itertools 可以直接用于其中大部分，但在最低级别仍需要自定义生成器。我发现这比完全自定义的生成器更难理解，但对你来说可能更清楚：

def combs2(elt2color, combination_size=4, max_colors=3):
    from collections import defaultdict
    from itertools import combinations
    color2elts = defaultdict(list)
    for elt, color in elt2color.items():
        color2elts[color].append(elt)

    def at_least_one_from_each(iterables, n):
        if n < len(iterables):
            return # impossible
        if not n or not iterables:
            if not n and not iterables:
                yield ()
            return
        # Must have n - num_from_first >= len(iterables) - 1,
        # so num_from_first <= n - len(iterables) + 1
        for num_from_first in range(1, min(len(iterables[0]) + 1,
                                           n - len(iterables) + 2)):
            for from_first in combinations(iterables[0],
                                           num_from_first):
                for rest in at_least_one_from_each(iterables[1:],
                                             n - num_from_first):
                    yield from_first + rest

    for numcolors in range(1, max_colors + 1):
        for colors in combinations(color2elts, numcolors):
            # Now this gets tricky.  We need to pick
            # combination_size elements across all the colors, but
            # must pick at least one from each color.
            for elements in at_least_one_from_each(
                    [color2elts[color] for color in colors],
                    combination_size):
                yield elements

我没有对这些进行计时，因为我不在乎 ;-) 完全自定义生成器的单个 result 列表被重复用于构建每个输出，从而降低了动态内存周转率。第二种方法通过将多个级别的from_first 和rest 元组粘贴在一起会产生大量内存流失——这几乎是不可避免的，因为它使用itertools 在每个级别生成from_first 元组。

在内部，itertools 函数几乎总是以更类似于第一个代码示例的方式工作，并且出于同样的原因，尽可能重用内部缓冲区。

还有一个

这更多是为了说明一些微妙之处。我想如果我要在 C 中将这个功能实现为 itertools 函数，我会怎么做。所有itertools 函数首先在 Python 中进行原型设计，但以半低级方式，简化为使用小整数向量（没有“内循环”使用集合、字典、序列切片或将部分结果粘贴在一起）序列 - 尽可能坚持O(1) 在初始化后对简单的原生 C 类型进行最坏情况时间操作。

在更高的级别上，itertools 函数将接受任何可迭代作为其主要参数，并且几乎可以肯定地保证从字典索引顺序返回组合。所以这里的代码可以完成所有这些。除了iterable 参数之外，它还需要一个elt2ec 映射，它将每个元素从可迭代对象映射到其等价类（对您来说，这些是命名颜色的字符串，但任何可用作字典键的对象都可以 用作等价类）：

def combs3(iterable, elt2ec, k, maxec):
    # Generate all k-combinations from `iterable` spanning no
    # more than `maxec` equivalence classes.
    elts = tuple(iterable)
    n = len(elts)
    ec = [None] * n  # ec[i] is equiv class ordinal of elts[i]
    ec2j = {} # map equiv class to its ordinal
    for i, elt in enumerate(elts):
        thisec = elt2ec[elt]
        j = ec2j.get(thisec)
        if j is None:
            j = len(ec2j)
            ec2j[thisec] = j
        ec[i] = j
    countec = [0] * len(ec2j)
    del ec2j

    def inner(i, j, totalec):
        if i == k:
            yield result
            return
        for j in range(j, jbound[i]):
            thisec = ec[j]
            thiscount = countec[thisec]
            newtotalec = totalec + (thiscount == 0)
            if newtotalec <= maxec:
                countec[thisec] = thiscount + 1
                result[i] = j
                yield from inner(i+1, j+1, newtotalec)
                countec[thisec] = thiscount

    jbound = list(range(n-k+1, n+1))
    result = [None] * k
    for _ in inner(0, 0, 0):
         yield (elts[i] for i in result)

（注意这是 Python 3 代码。）正如所宣传的那样，inner() 中没有什么比用一个小整数索引一个向量更有趣的了。让它直接翻译成 C 的唯一剩下的就是删除递归生成。这很乏味，因为它不会在这里说明任何特别有趣的事情，所以我将忽略它。

不管怎样，有趣的是计时。如评论中所述，计时结果受您使用的测试用例的强烈影响。 combs3() 这里有时最快，但不经常！它几乎总是比我原来的combs() 快，但通常比我的combs2() 或@GarethRees 的可爱constrained_combinations() 慢。

那么当combs3() 已经被优化“几乎一直到无意识的;-) C 级操作”时，怎么会这样呢？简单！它仍然是用 Python 编写的。 combs2() 和 constrained_combinations() 使用 C 编码的 itertools.combinations() 完成大部分工作，这让世界变得与众不同。 combs3() 会绕着他们转圈如果它是用 C 编码的。

当然，这些中的任何一个都可以比原始帖子中的allowed_combinations() 运行得更快——但那个也可以是最快的（例如，选择max_colors 如此大以至于不排除任何组合的任何输入- 然后allowed_combinations() 几乎不会浪费任何精力，而所有这些其他方法都会增加额外的大量额外开销来“优化”从未发生过的修剪）。

【讨论】：

谢谢@Tim！我计时了：combs = 791 ns；梳子2 = 632 ns；另一方面，@Gareth 的解决方案大约需要 14 µs。所以为了速度我会牺牲清晰度:)
@AlbertoLumbreras：你能说出这些时间是从哪里来的吗？在我的计时测试中，Tim Peters 的combs2 和我的constrained_combinations 花费了相似的时间。（请参阅my answer 的“性能”部分。）
我在 ipython 笔记本上使用了魔法 %%timeit 并在该单元格中执行了constrained_combinations(nodes, 4, 3)。但是，如果我运行您的测试（即使仍在笔记本中），那么您是对的，combs 和 constrained_combinations 非常相似。
时间很棘手，因为算法的不同部分往往会根据输入占主导地位。例如，combination_size 越大，“粘贴在一起”结果元组的成本就越高：总体而言，这在 combination_size 中基本上需要二次时间。对于微小的 3、4、5、6，这可能无关紧要。 combs 不会受此影响（它重复使用单个 result 长度列表 combination_size） - 但 combs 在早期切断注定的部分元组方面并不像其他人那样努力。所以，使用所有 3 个，并为当前输入选择最快的 - LOL ;-)

【解决方案3】：

这是一个比迄今为止发布的其他答案更简单的实现。基本方法是：

选择一个迄今为止尚未选择的值（您的术语中的“颜色”）；
循环 i，与将包含在输出中的值关联的键（“元素”）的数量；
循环c，这些长度为i的键的组合；
递归选择下一个值。

from collections import defaultdict, deque
from itertools import combinations

def constrained_combinations(elements, r, s):
    """Generate distinct combinations of 'r' keys from the dictionary
    'elements' using at most 's' different values. The values must be
    hashable.

        >>> from collections import OrderedDict
        >>> elements = OrderedDict(enumerate('aabbc'))
        >>> cc = constrained_combinations
        >>> list(cc(elements, 2, 1))
        [(0, 1), (2, 3)]
        >>> list(cc(elements, 3, 2))
        [(0, 1, 2), (0, 1, 3), (0, 1, 4), (0, 2, 3), (1, 2, 3), (2, 3, 4)]
        >>> list(cc(elements, 3, 3)) == list(combinations(range(5), 3))
        True
        >>> sum(1 for _ in cc(OrderedDict(enumerate('aabbcccdeef')), 4, 3))
        188

    """
    # 'value_keys' is a map from value to a list of keys associated
    # with that value; 'values' is a list of values in reverse order of
    # first appearance.
    value_keys = defaultdict(list)
    values = deque()
    for k, v in elements.items():
        if v not in value_keys:
            values.appendleft(v)
        value_keys[v].append(k)

    def helper(current, r, s):
        if r == 0:
            yield current
            return
        if s == 0 or not values:
            return
        value = values.pop()
        keys = value_keys[value]
        for i in range(min(r, len(keys)), -1, -1):
            for c in combinations(keys, i):
                for result in helper(current + c, r - i, s - min(i, 1)):
                    yield result
        values.append(value)

    return helper((), r, s)

注意事项

在 Python 3.3 或更高版本中，您可以使用 yield from 语句来简化递归调用：
```
yield from helper(current + c, r - i, s - min(i, 1))
```
如果您想知道为什么 doctests 使用 collections.OrderedDict，这样可以以可预测的顺序返回组合，这是测试工作所必需的。

性能

这在速度上与Tim Peter's combs2 大致相似：

>>> from timeit import timeit
>>> elements = dict(enumerate('abcde' * 10))
>>> test = lambda f:timeit(lambda:sum(1 for _ in f(elements, 6, 3)), number=1)
>>> test(combs2)
11.403807007009163
>>> test(constrained_combinations)
11.38378801709041

【讨论】：