使用 CUDA/Thrust 置换矩阵（和相应向量）的行答案

【问题标题】：Permuting the rows of a matrix (and corresponding vector) using CUDA/Thrust使用 CUDA/Thrust 置换矩阵（和相应向量）的行
【发布时间】：2016-05-06 05:06:33
【问题描述】：

我想置换存储为交错数组的矩阵的行（即由行主要 C 样式格式的向量支持），并将相同的置换应用于相应向量的元素。

假设矩阵维度为 RxC，对应的向量有 R 个元素。

我目前的想法是生成 R 索引的排列，然后使用 thrust::stable_sort_by_key 排列向量，如图所示 here。

然后我可以创建另一个置换向量，它重复我之前创建 C 次的每个元素。

因此，如果 R = 4, C = 3 并且原始置换索引向量为 [4, 2, 3, 1]，则矩阵的置换向量将为 [4, 4, 4, 2, 2, 2, 3 , 3, 3, 1, 1, 1]。通过使用稳定排序，矩阵行中的元素不应该被置换。

我的问题是，是否有更好/更有效的方法来做到这一点，使用 Thrust 或普通 CUDA。

例子：

原始矩阵：

[ 1 1 1 1 ]
[ 2 2 2 2 ]
[ 3 3 3 3 ]
[ 4 4 4 4 ]
[ 5 5 5 5 ]

原始向量：

[1 2 3 4 5]

排列顺序：

[5 3 1 2 4]

置换矩阵：

[ 5 5 5 5 ]
[ 3 3 3 3 ]
[ 1 1 1 1 ]
[ 2 2 2 2 ]
[ 4 4 4 4 ]

置换向量：

[5 3 1 2 4]

我的用例是每个示例都有一个特征矩阵和一个对应标签的向量。我想置换矩阵并在向量上应用相同的置换，作为 SGD 迭代之前的洗牌步骤。我想要连续行并遍历它们的原因是我计划使用 cuBLAS gemv 来执行矩阵向量操作，它假设矩阵在内存中以类似的方式布局（尽管采用列主要格式意味着我需要像this一样称呼它

【问题讨论】：

您能否添加一个小而完整的示例来说明您想要实现的目标？什么是输入矩阵，你想要的输出是什么？在排列之后你想做什么？
由于对全局内存的读/写非常昂贵，如果您只是在四处移动数据，那么很难从 CUDA 中获益。如果矩阵非常大（1000 列），您可能会勉强取得一些改进，但效果不会很显着。
如果您在推力中工作，我建议根本不要对矩阵进行置换或排序。使用计算的查找索引根据所需的行排列检索正确的元素。
@RobertCrovella 我的想法（并显示在 SGD 的 CPU 实现中）是因为我必须遍历整个特征矩阵，确保数据访问是连续的，而不是随机的，从而克服了成本在每次迭代之前对矩阵进行置换。为了确保收敛，洗牌是必要的。
在为 GPU 编码时，确保数据访问是连续的是一个非常好的目标/目标。它促进了内存子系统的有效使用（合并访问）。然而，每一行的连续性质，无论是否置换，都应该足以满足这个合理大宽度矩阵的目标。我并不是建议你不应该在算法上洗牌——当然这是必要的。我的建议是避免与洗牌相关的数据移动，并且您应该能够在不牺牲合并访问权限的情况下这样做。

标签： matrix cuda permutation thrust

【解决方案1】：

我的问题是，如果有更好/更有效的方法来做到这一点，使用 Thrust

我相信有。置换向量为您提供了将输入矩阵的内容直接复制到置换矩阵所需的所有信息，而无需进行排序。

一个有用的thrust 功能是permutation_iterator。置换迭代器允许我们即时重新排序我们选择的输入元素以用于任何操作。如果我们提供适当的索引计算函子，我们可以将线性索引（通过counting_iterator）传递给索引函子，以（通过transform_iterator）为复制操作中的任何元素创建适当的置换输入索引。

这是一个有效的例子：

$ cat t1061.cu
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <iostream>
#include <assert.h>

typedef int mytype;

struct copy_idx_func : public thrust::unary_function<unsigned, unsigned>
{
  size_t c;
  unsigned *p;
  copy_idx_func(const size_t _c, unsigned *_p) : c(_c),p(_p) {};
  __host__ __device__
  unsigned operator()(unsigned idx){
    unsigned myrow = idx/c;
    unsigned newrow = p[myrow]-1;
    unsigned mycol = idx%c;
    return newrow*c+mycol;
  }
};


int main(){

  const mytype mat[]   = {1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5};
  const mytype vec[]   = {1,2,3,4,5};
  const unsigned per[] = {5,3,1,2,4};

  const size_t msize = sizeof(mat)/sizeof(mytype);
  const size_t vsize = sizeof(vec)/sizeof(mytype);
  const size_t psize = sizeof(per)/sizeof(unsigned);
  const size_t cols  = msize/vsize;
  // const size_t rows  = vsize;
  assert(msize%vsize == 0);
  assert(vsize == psize);

  thrust::device_vector<mytype>   d_m(mat, mat+msize);
  thrust::device_vector<mytype>   d_v(vec, vec+vsize);
  thrust::device_vector<unsigned> d_p(per, per+psize);
  thrust::device_vector<mytype>   d_rm(msize);
  thrust::device_vector<mytype>   d_rv(vsize);
  std::cout << "Initial Matrix:" << std::endl;
  thrust::copy_n(d_m.begin(), msize, std::ostream_iterator<mytype>(std::cout, ","));

  // permute the matrix
  thrust::copy_n(thrust::make_permutation_iterator(d_m.begin(), thrust::make_transform_iterator(thrust::counting_iterator<unsigned>(0), copy_idx_func(cols,thrust::raw_pointer_cast(d_p.data())))), msize, d_rm.begin());

  std::cout << std::endl << "Permuted Matrix:" << std::endl;
  thrust::copy_n(d_rm.begin(), msize, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl << "Initial Vector:" << std::endl;
  thrust::copy_n(d_v.begin(), vsize, std::ostream_iterator<mytype>(std::cout, ","));

  // permute the vector
  thrust::copy_n(thrust::make_permutation_iterator(d_v.begin(), thrust::make_transform_iterator(thrust::counting_iterator<unsigned>(0),  copy_idx_func(1,thrust::raw_pointer_cast(d_p.data())))), vsize, d_rv.begin());

  std::cout << std::endl << "Permuted Vector:" << std::endl;
  thrust::copy_n(d_rv.begin(), vsize, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;
}

$ nvcc -o t1061 t1061.cu
$ ./t1061
Initial Matrix:
1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,
Permuted Matrix:
5,5,5,5,5,3,3,3,3,3,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,
Initial Vector:
1,2,3,4,5,
Permuted Vector:
5,3,1,2,4,
$

注意事项：

在操作上置换向量与置换矩阵相同。我们简单地将向量视为一列的矩阵。
正如 cmets 中所讨论的，如果用例完全在推力范围内，则可能根本不需要复制元素。 permutation_iterator 允许我们以任何排列顺序从原始矩阵中选择元素，我们可以简单地将这个构造传递给任何需要以排列顺序排列的原始矩阵的推力操作。

【讨论】：

在这种情况下是否可以使用 swap_ranges 而不是 copy_n？我知道这是过早的优化，但理论上如果我的矩阵太大以至于副本无法放入全局内存中，我可以使用它来交换矩阵中的行而不是将它们复制到新的行，对吗？
我认为你应该能够做到这一点，如果你用一系列交换替换你的排列向量。我认为它可能比单个复制操作要慢。顺便说一句，如果矩阵太大以至于副本无法放入全局内存中，那么您的排序方法也不太可能起作用。排序需要O(n)临时存储，当你调用sort时，推力会尝试分配