numpy是如何实现多维广播的？答案

【问题标题】：How did numpy implement multi-dimensional broadcasting?numpy是如何实现多维广播的？
【发布时间】：2016-09-21 20:42:04
【问题描述】：

内存（行主要顺序）：

[[A(0,0), A(0,1)]
 [A(1,0), A(1,1)]]

has this memory layout: 
[A(0,0), A(0,1), A(1,0), A(1,1)]

我猜算法在以下情况下是这样工作的。

广播维度是最后一个维度：

[[0, 1, 2, 3]         [[1]
                  x
 [4, 5, 6, 7]]         [10]]

   A (2 by 4)            B (2 by 1)

Iterate 0th dimensions of A and B simultaneously {
    Iterate last dimension of A{
        multiply;
    } 
}

广播维度是第0个维度：

[[0, 1, 2, 3]   
                  x    [[1,10,100,1000]]
 [4, 5, 6, 7]]

   A (2 by 4)              B (1 by 4)

Iterate 0th dimension of A{
    Iterate 1st dimensions of A and B simultaneously{
        multiply;
    }
}

问题：

numpy 如何知道哪种乘法顺序最好。（按顺序读取内存比到处读取内存要好。但是 numpy 是如何计算出来的呢？）
如果数组的维度超过二维，numpy 会做什么
如果广播维度不是最后一个维度，numpy 会做什么？

第二次猜测发生了什么：

#include <iostream>
int main(void){
    const int nA = 12;
    const int nB = 3;
    int A[nA];
    int B[nB];
    for(int i = 0; i != nA; ++i) A[i] = i+1;
    for(int i = 0; i != nB; ++i) B[i] = i+1;
    //dimension
    int dA[] = {2,3,2};
    int dB[] = {1,3,1};

    int* pA = A;
    int* pB = B;
    int* pA_end = A + nA;
    //is it possible to make the compiler
    //generate the iA and sA?
    int iB = 0;
    int iB_max = 2;
    int sB[] = {1,0};

    while(pA != pA_end){
        std::cout << "*pA, *pB: " << *pA << ", " << *pB <<std::endl;
        std::cout << "iB: " << iB <<std::endl;
        *(pA) *= *(pB);
        ++pA;
        pB += sB[iB];
        ++iB;
        if (iB == iB_max) {iB = 0; pB = B;}
    }

    for(pA = A; pA != pA_end; ++pA){
        std::cout << *(pA) << ", ";
    }
    std::cout << std::endl;       
}

【问题讨论】：

第二个猜测似乎是错误的。 numpy 似乎使用多索引。

标签： python c numpy

【解决方案1】：

要真正了解广播细节，您需要了解数组形状和步幅。但是现在很多工作都是使用nditer 在c 代码中实现的。您可以在http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html 阅读有关它的信息。 np.nditer 使您可以在 Python 级别访问该工具，但它的真正价值在于与 cython 或您自己的 c 代码一起使用。

np.lib.stride_tricks 具有让您大步前进的功能。它的一项功能有助于可视化数组是如何一起广播的。在实践中，工作是使用nditer 完成的，但这个函数可能有助于理解操作：

In [629]: np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                  np.array([[1],[2]]))
Out[629]: 
[array([[0, 1, 2],
        [3, 4, 5]]), 
 array([[1, 1, 1],
        [2, 2, 2]])]

请注意，实际上第二个数组已被复制以匹配第一个的形状。但是复制是通过跨步技巧完成的，而不是实际的副本。

In [631]: A,B=np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                      np.array([[1],[2]]))
In [632]: A.shape
Out[632]: (2, 3)
In [633]: A.strides
Out[633]: (12, 4)
In [634]: B.shape
Out[634]: (2, 3)
In [635]: B.strides
Out[635]: (4, 0)

正是这个(4,0) strides 进行复制而不复制。

==================

使用python级别nditer，这是它在广播期间的作用。

In [1]: A=np.arange(6).reshape(2,3)
In [2]: B=np.array([[1],[2]])

普通的 nditer 一次提供一组元素 http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#using-an-external-loop

In [5]: it =np.nditer((A,B))
In [6]: for a,b in it:
   ...:     print(a,b)

0 1
1 1
2 1
3 2
4 2
5 2

但是当我打开extenal_loop 时，它会以块的形式迭代，这里是广播数组的相应行：

In [7]: it =np.nditer((A,B), flags=['external_loop'])
In [8]: for a,b in it:
   ...:     print(a,b)

[0 1 2] [1 1 1]
[3 4 5] [2 2 2]

通过更复杂的广播，external_loop 仍会生成一维数组，允许简单的c 样式迭代：

In [13]: A1=np.arange(24).reshape(3,2,4)
In [18]: it =np.nditer((A1,np.arange(3)[:,None,None]), flags=['external_loop'])
In [19]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([0, 0, 0, 0, 0, 0, 0, 0]))
(array([ 8,  9, 10, 11, 12, 13, 14, 15]), array([1, 1, 1, 1, 1, 1, 1, 1]))
(array([16, 17, 18, 19, 20, 21, 22, 23]), array([2, 2, 2, 2, 2, 2, 2, 2]))

请注意，虽然A1 是 (3,2,4)，但 nditer 循环会产生 3 个步骤（第一个轴）和 2*4 长度的元素。

我在另一个cython/nditer SO 问题中发现，第一种方法并没有产生很大的速度提升，但第二种方法帮助很大。在c 或cython 中，external_loop 的情况会进行简单的低级迭代。

================

如果我在第 1 和第 3 轴上广播，迭代器需要 2*3 步（有效地压平第 2 轴，并馈送第 3 轴）：

In [20]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop'])
In [21]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([0, 1, 2, 3]), array([0, 0, 0, 0]))
(array([4, 5, 6, 7]), array([1, 1, 1, 1]))
(array([ 8,  9, 10, 11]), array([0, 0, 0, 0]))
(array([12, 13, 14, 15]), array([1, 1, 1, 1]))
(array([16, 17, 18, 19]), array([0, 0, 0, 0]))
(array([20, 21, 22, 23]), array([1, 1, 1, 1]))

但是使用buffered，它会迭代一次，为我提供 2 个一维数组：

In [22]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop','buffered'])
In [23]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]), 
 array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]))

Does Cython offer any reasonably easy and efficient way to iterate Numpy arrays as if they were flat? 有一些速度测试，表明缓冲的外部循环是最快的

cython 将其转换为快速简单的c 迭代：

for xarr in it:
   x = xarr
   size = x.shape[0]
   for i in range(size):
       x[i] = x[i]+1.0

【讨论】：

np.lib.stride_tricks.broadcast_arrays 拼写更好np.broadcast_arrays
如果您需要更深入地挖掘，虽然整个 stride_tricks.py 文件是很好的阅读材料。
我尝试阅读 nditer c 代码，但还不太明白。 nditer 会牺牲性能吗？ niter 是否将嵌套循环视为具有可变步幅的单个循环？
nditer 代码我还没有研究过；有了所有选项，这很复杂。但是我添加了一些示例来展示它是如何迭代的。需要external_loop 才能获得最佳的速度提升。