使用numpy构建两个数组的所有组合的数组答案

【问题标题】：Using numpy to build an array of all combinations of two arrays使用numpy构建两个数组的所有组合的数组
【发布时间】：2010-11-15 12:56:50
【问题描述】：

在尝试对其进行任何复杂操作之前，我正在尝试遍历 6 参数函数的参数空间以研究其数值行为，因此我正在寻找一种有效的方法来执行此操作。

我的函数将 6 维 numpy 数组中给出的浮点值作为输入。我最初尝试做的是：

首先，我创建了一个函数，它接受 2 个数组并生成一个包含两个数组中所有值组合的数组：

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

然后，我使用 reduce() 将其应用于同一数组的 m 个副本：

def combs(a,m):
    return reduce(comb,[a]*m)

最后，我这样评估我的功能：

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

这可行，但它方式太慢了。我知道参数的空间很大，但这不应该这么慢。在这个例子中，我只采样了 10⁶（一百万）个点，创建数组 values 需要超过 15 秒。

你知道用 numpy 更有效的方法吗？

如果有必要，我可以修改函数F 接受它的参数的方式。

【问题讨论】：

有关我发现的最快的笛卡尔积，请参阅this answer。（由于这个问题的措辞与这个问题完全不同，我认为这些问题不是重复的，但两个问题的最佳解决方案是相同的。）

标签： python arrays numpy multidimensional-array cartesian-product

【解决方案1】：

Pandas merge 提供了一个简单、快速的解决方案：

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x)))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y)))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z)))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, right_index=True),
              left_index=True, right_index=True)

【讨论】：

【解决方案2】：

这是一个纯 numpy 实现。它比使用 itertools 快大约 5 倍。

Python 3：

import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    #m = n / arrays[0].size
    m = int(n / arrays[0].size) 
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in range(1, arrays[0].size):
        #for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

Python 2：


import numpy as np

def cartesian(arrays, out=None):
    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

【讨论】：

曾经考虑过提交这个以包含在 numpy 中吗？这不是我第一次去寻找这个功能并找到你的帖子。
仅供参考：似乎已在from sklearn.utils.extmath import cartesian 将其放入 scikit-learn 包中
我刚刚意识到：这与 itertools.combinations 略有不同，因为此函数尊重值的顺序，而组合则不，因此此函数返回的值比组合多。仍然非常令人印象深刻，但不幸的是不是我想要的:(
对于后代，可以在此处找到仅使用 itertools.combinations 的高性能替代方案：stackoverflow.com/questions/16003217/…
TypeError: slice indices must be integers or None or have an __index__ method 被cartesian(arrays[1:], out=out[0:m,1:])抛出

【解决方案3】：

在较新版本的 numpy (>1.8.x) 中，numpy.meshgrid() 提供了更快的实现：

@pv 的解决方案

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid() 以前只能是 2D，现在可以 ND。在这种情况下，3D：

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

请注意，最终结果的顺序略有不同。

【讨论】：

np.stack(np.meshgrid([1, 2, 3], [4, 5], [6, 7]), -1).reshape(-1, 3) 会给出正确的顺序
@CT Zhu 有没有一种简单的方法来转换它，以便将包含不同数组作为列的矩阵用作输入？
应该注意，meshgrid 只适用于较小的范围集，我有一个大的，我得到错误：ValueError: ndarray 的最大支持维度是 32，找到 69
@mikkom，不会处理大于 32 的集合。即使每个大小为 2，组合的数量也将是 2**32、4 Gb。

【解决方案4】：

你可以使用np.array(itertools.product(a, b))

【讨论】：

np.array(list(itertools.product(l, l2)))

【解决方案5】：

对于一维数组（或平面 python 列表）的笛卡尔积的纯 numpy 实现，只需使用 meshgrid()，使用 transpose() 滚动轴，然后重新调整为所需的输出：

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

请注意，这具有最后一个轴变化最快的约定（“C 风格”或“行专业”）。

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

如果你想最快地改变 first 轴（“FORTRAN 风格”或“column-major”），只需像这样改变reshape() 的order 参数：reshape((-1, N), order='F')

【讨论】：

【解决方案6】：

这里还有另一种方式，使用纯 NumPy，没有递归，没有列表理解，也没有显式的 for 循环。它比原来的答案慢了大约 20%，而且它基于 np.meshgrid。

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

例如，

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

给予

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

【讨论】：

【解决方案7】：

以下 numpy 实现应该是大约。给定答案的 2 倍速度：

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

【讨论】：

看起来不错。根据我的初步测试，这看起来比 {1,2,...,100} 的所有对、三元组和 4 元组的原始答案要快。之后，原始答案获胜。此外，对于希望生成 {1,...,n} 的所有 k 元组的未来读者，np.indices((n,...,n)).reshape(k,-1).T 也可以。
这仅适用于整数，而接受的答案也适用于浮点数。

【解决方案8】：

你可以这样做

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

给了

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

【讨论】：

有没有办法让 NumPy 接受超过 32 个用于 meshgrid 的数组？只要我传递的数组不超过 32 个，这种方法就适用于我。

【解决方案9】：

看起来你想要一个网格来评估你的函数，在这种情况下你可以使用numpy.ogrid（打开）或numpy.mgrid（充实）：

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

【讨论】：

【解决方案10】：

itertools.combinations 通常是从 Python 容器获取组合的最快方法（如果您确实想要组合，即没有重复且独立于顺序的排列；这不是您的代码似乎在做的事情，但我不知道这是因为您的代码有问题还是因为您使用了错误的术语）。

如果您想要不同于组合的东西，也许 itertools 中的其他迭代器 product 或 permutations 可能会为您提供更好的服务。例如，您的代码看起来与以下内容大致相同：

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

所有这些迭代器都产生元组，而不是列表或 numpy 数组，因此如果您的 F 对专门获取一个 numpy 数组很挑剔，您将不得不接受在每一步构建或清除和重新填充一个数组的额外开销。

【讨论】：