Armadillo (C++) 中的快速数组置换（广义张量转置）答案

【问题标题】：Fast array permutation (generalised tensor transpose) in Armadillo (C++)Armadillo (C++) 中的快速数组置换（广义张量转置）
【发布时间】：2018-08-14 13:09:59
【问题描述】：

我有一个项目涉及 3D 数组 (arma::Cube<cx_double>) 上的大量排列。特别是，所需的排列是按切片交换列。在 Matlab 中，这由permute(cube,[1,3,2]) 有效计算，在 Python 中由numpy.transpose(cube,axis=[0,2,1]) 有效计算。

不幸的是，犰狳本身没有permute 函数。我尝试了不同的方法，但与 Matlab 相比，它们都相当慢。 我想知道在犰狳中置换（相当大）立方体的更快方法是什么。使用gprof 分析代码，大部分时间花在我在下面尝试过的置换函数上，而在 Matlab 中，对于同一个移植项目，大部分时间花在 SVD 或 QR 矩阵分解（重塑和置换在 matlab 中速度很快）。

我想了解在犰狳中执行这种排列的最快方法是什么，以及为什么某些方法比其他方法效果更好。

选项 1：原始排列（最快的选项）（有更快的方法吗？）

输入多维数据集到输出多维数据集的元素分配。

template <typename T>
static Cube<T> permute (Cube<T>& cube){

uword D1=cube.n_rows;
uword D2=cube.n_cols;
uword D3=cube.n_slices;

Cube<T> output(D1,D3,D2);

for (uword s = 0; s < D3; ++s){
for (uword c = 0; c < D2; ++c){
for (uword r = 0; r < D1; ++r){
    output.at(r, s, c) = cube.at(r, c, s);
    // output[ D1*D3*c + D1*s+ r ] = cube[ D1*D2*s + D1*c + r ]; 

}
}
}

return output;
}

选项 2：填充切片（非常慢）

用非连续的subcube 视图填充输出立方体的切片。

template <typename T>
static Cube<T> permute (Cube<T>& cube_in){

    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    Cube<T> output; 
    output.zeros(D1, D3, D2);

    for (uword c=0; c<D2; ++c) {
        output.slice(c) = cube_in.subcube( span(0,D1-1),span(c),span(0,D3-1) );
    }

    return output;
}

选项 3：转置层（比原始排列慢但可比）

我们可以遍历输入立方体的层（固定行）并转置它们。

template <typename T>
static Cube<T> permute (Cube<T>& cube_in){
    // in a cube, permute {1,3,2} (permute slices by columns)

    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    if(D3 > D2){
        cube_in.resize(D1,D3,D3);
    } else if (D2 > D3) {
        cube_in.resize(D1,D2,D2);
    }

    for (uword r=0; r<D1; ++r) {        
        static cmat layer = cmat(cube_in.rows(r,r)); 
        inplace_strans(layer);
        cube_in.rows(r,r)=layer;               
    }

    cube_in.resize(D1,D3,D2);

    return cube_in;
}

选项 4：查找表 通过读取向量中的索引来获取非连续访问。

template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in){
    // in a cube, permute {1,3,2} (permute slices by columns)
    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    cx_vec onedcube = cube_in.elem(gen_trans_idx(cube_in));            

    return arma::Cube<T>(onedcube.memptr(), D1, D3, D2, true ) ;
}

其中gen_trans_idx 是一个生成置换多维数据集索引的函数：

template <typename T>
uvec gen_trans_idx(Cube<T>& cube){

    uword D1 = cube.n_rows;
    uword D2 = cube.n_cols;
    uword D3 = cube.n_slices;

    uvec perm132(D1*D2*D3);    
    uword ii = 0;                
    for (int c = 0; c < D2; ++c){
    for (int s = 0; s < D3; ++s){
    for (int r = 0; r < D1; ++r){
        perm132.at(ii) = sub2ind(size(cube), r, c, s);
        ii=ii+1;
    }}} 

    return perm132;
}

理想情况下，如果事先确定了多维数据集的尺寸，则可以预先计算这些查找表。

选项 5（就地转置）非常慢，内存高效

// Option: In-place transpose
template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in, uvec permlist ){    

    T* Qpoint = cube_in.memptr(); // pointer to first element of cube_in

    uvec updateidx = find(permlist - arma::linspace<uvec>(0,cube_in.n_elem-1,cube_in.n_elem)); // index of elements that change position in memory           
    uvec skiplist(updateidx.n_elem,fill::zeros);

    uword rr = 0; // aux index for updatelix
    for(uword jj=0;jj<updateidx.n_elem;++jj){                

        if(any(updateidx[jj] == skiplist)){ // if element jj has already been updated
            // do nothing
        } else {

            uword scope = updateidx[jj];
            T target = *(Qpoint+permlist[scope]);        // store the value of the target element             

            while(any(scope==skiplist)-1){              // while wareyou has not been updated

                T local  = *(Qpoint+scope);                  // store local value                                                                
                *(Qpoint+scope) = target;       
                skiplist[rr]=scope;
                ++rr;            
                uvec wareyou = find(permlist==scope);        // find where the local value will appear                
                scope = wareyou[0];
                target = local;                            
            }

        }
    }
   cube_in.reshape(cube_in.n_rows,cube_in.n_slices,cube_in.n_cols);

    return cub

e_in;
}

【问题讨论】：

我认为 .t 或 .st 转置了 amadrillo matix。 arma.sourceforge.net/docs.html
.tor .st 是 arma::Mat 的方法，而不是 arma::Cube
有人告诉我，RcppArmadillo 用户经常使用 Cube，也许我可以从他们的经验中获得一些见解。
相关问题：stackoverflow.com/questions/35796248/…
嗨@CrafterKolyan，你对output的初始化是正确的（我编辑了我的答案）。关于使用memcpy 的替代方案，我不确定如何实现此选项。在选项 4 中，我在向量中不连续地访问元素，并使用指向该向量的指针构造一个新的 Cube。您能否将您的评论发展为答案？非常感谢

标签： c++ permutation rcpp transpose armadillo

【解决方案1】：

此代码是我对memcpy hack 的评论的补充。另外不要忘记尝试添加const reference 以防止复制对象。

template <typename T>
static Cube<T> permute(const Cube<T> &cube){
    const uword D1 = cube.n_rows;
    const uword D2 = cube.n_cols;
    const uword D3 = cube.n_slices;
    const uword D1_mul_D3 = D1 * D3;
    const Cube<T> output(D1, D3, D2);

    const T * from = cube.memptr();
    T *to = output.memptr();

    for (uword s = 0; s < D3; ++s){
        T *to_tmp = to + D1 * s;
        for (uword c = 0; c < D2; ++c){
            memcpy(to_tmp, from, D1 * sizeof(*from));
            from += D1;
            to_tmp += D1_mul_D3;
        }
    }

    return output;
}

【讨论】：