CUDA 排序 Z 轴 3D 数组 C++/推力答案

【问题标题】：CUDA Sort Z-Axis 3D Array C++/ThrustCUDA 排序 Z 轴 3D 数组 C++/推力
【发布时间】：2021-05-27 07:48:13
【问题描述】：

我希望沿 z 轴对大型 3D 数组进行排序。

示例数组为 X x Y x Z (1000x1000x5)

我想沿 z 轴排序，所以我会沿 z 轴对 5 个元素执行 1000x1000 排序。

编辑更新：尝试在下面使用推力。它很实用，我会将输出存储回来，但这非常慢，因为我在每个 (x,y) 位置一次对 5 个元素进行排序：

#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include <thrust/device_ptr.h>
#include <thrust/sort.h>
#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>

int main(){
int x = 1000, y = 1000, z = 5;
float*** unsorted_cube = new float** [x];

for (int i = 0; i < x; i++) 
{
    // Allocate memory blocks for 
    // rows of each 2D array 
    unsorted_cube[i] = new float* [y];

    for (int j = 0; j < y; j++) 
    {
        // Allocate memory blocks for 
        // columns of each 2D array 
        unsorted_cube[i][j] = new float[z];
    }
}


for (int i = 0; i < x; i++)
{
    for (int j = 0; j < y; j++)
    {
        unsorted_cube[i][j][0] = 4.0f;
        unsorted_cube[i][j][1] = 3.0f;
        unsorted_cube[i][j][2] = 1.0f;
        unsorted_cube[i][j][3] = 5.0f;
        unsorted_cube[i][j][4] = 2.0f;
    }
}

for (int i = 0; i < 5; i++)
{
    printf("unsorted_cube first 5 elements to sort at (0,0): %f\n", unsorted_cube[0][0][i]);
}

float* temp_input;
float* temp_output;
float* raw_ptr;
float raw_ptr_out[5];
cudaMalloc((void**)&raw_ptr, N_Size * sizeof(float));
for (int i = 0; i < x; i++)
{ 
    for (int j = 0; j < y; j++)
    {
        temp_input[0] = unsorted_cube[i][j][0];
        temp_input[1] = unsorted_cube[i][j][1];
        temp_input[2] = unsorted_cube[i][j][2];
        temp_input[3] = unsorted_cube[i][j][3];
        temp_input[4] = unsorted_cube[i][j][4];

        cudaMemcpy(raw_ptr, temp_input, 5 * sizeof(float), cudaMemcpyHostToDevice);
        thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(raw_ptr);
        thrust::sort(dev_ptr, dev_ptr + 5);
        thrust::host_vector<float> host_vec(5);
        thrust::copy(dev_ptr, dev_ptr + 5, raw_ptr_out);

        if (i == 0 && j == 0)
        {
            for (int i = 0; i < 5; i++)
            {
                temp_output[i] = raw_ptr_out[i];
            }
            printf("sorted_cube[0,0,0] : %f\n", temp_output[0]);
            printf("sorted_cube[0,0,1] : %f\n", temp_output[1]);
            printf("sorted_cube[0,0,2] : %f\n", temp_output[2]);
            printf("sorted_cube[0,0,3] : %f\n", temp_output[3]);
            printf("sorted_cube[0,0,4] : %f\n", temp_output[4]);
        }
    }
}
}

【问题讨论】：

嗨，很有趣，不确定这是否有帮助 stackoverflow.com/questions/49818185/…
嗨！谢谢你。这是重组数组的一个很好的参考。但是，如果可能的话，我想坚持使用 C++ 来实现它。
为每个 (x,y) 坐标生成一个唯一键，用于标记 z 轴子数组进行排序并使用键排序。如果您很聪明，您可能可以使用花哨的迭代器来生成密钥，这样您就不需要将密钥存储在内存中
刚刚更新了我的代码，尝试有效，但速度非常慢。我会研究一下钥匙。我不知道该怎么做，但会尝试找到一个例子。任何参考都会很棒......不确定键是如何工作的。
无论您的解决方案如何，我认为它不会使用thrust::sort，因为这是对非常大的列表进行排序。可以通过使用 thrust::zip_iterator 将所有 5 个值压缩到一个元组中，然后使用 thrust::for_each 获取 xy 平面中的每个坐标来完成。对于要排序的少量元素，您可以查看此答案，例如stackoverflow.com/a/2748811/10107454（假设您不打算在未来的 z 方向上进行更多切片）。排序算法将是您传递给for_each 的一元函数。

标签： c++ sorting cuda thrust

【解决方案1】：

假设数据的格式是每个 xy 平面中的值在内存中是连续的：data[((z * y_length) + y) * x_length + x]（这也最适合在 GPU 上合并内存访问）

#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
#include <thrust/zip_iterator.h>

void sort_in_z_dir(thrust::device_vector<float> &data,
                   int x_length, int y_length) { // z_length == 5
  auto z_stride = x_length * y_length;
  
  thrust::for_each(
      thrust::make_zip_iterator(thrust::make_tuple(
          data.begin(),
          data.begin() + z_stride,
          data.begin() + 2 * z_stride,
          data.begin() + 3 * z_stride,
          data.begin() + 4 * z_stride)),
      thrust::make_zip_iterator(thrust::make_tuple(
          data.begin() + z_stride,
          data.begin() + 2 * z_stride,
          data.begin() + 3 * z_stride,
          data.begin() + 4 * z_stride,
          data.begin() + 5 * z_stride)),
      [](thrust::tuple<float, float, float, float, float> &values) {
        float local_data[5] = {thrust::get<0>(values),
                               thrust::get<1>(values),
                               thrust::get<2>(values),
                               thrust::get<3>(values),
                               thrust::get<4>(values)};
        thrust::sort(thrust::seq, local_data, local_data + 5);
        thrust::get<0>(values) = local_data[0];
        thrust::get<1>(values) = local_data[1];
        thrust::get<2>(values) = local_data[2];
        thrust::get<3>(values) = local_data[3];
        thrust::get<4>(values) = local_data[4];
      });
}

这个解决方案在硬编码z_length 方面肯定是非常丑陋的。可以使用一些 C++ 模板——“魔法”将z_length 变成模板参数，但这对于关于 Thrust 的答案似乎有点过头了。

请参阅Convert std::tuple to std::array C++11 和How to convert std::array to std::tuple?，了解有关数组和元组之间接口的示例。

这个解决方案的好处在于，就排序算法本身而言，它在性能方面应该是最佳的。我不知道thrust::sort 是否针对如此小的输入数组进行了优化，但是您可以将其替换为我在 cmets 中提出的任何自写排序算法。

如果您希望能够使用不同的z_length 而没有所有这些麻烦，您可能更喜欢这种解决方案，它在全局内存中排序，这远非最佳，并且感觉有点hacky，因为它几乎只使用 Thrust启动内核。在这里，您希望数据以相反的方式排序：data[((x * y_length) + y) * z_length + z]

#include <thrust/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>

void sort_in_z_dir_alternative(thrust::device_vector<float> &data,
                               int x_length, int y_length, int z_length) {
  int n_threads = x_length * y_length;
  
  thrust::for_each(
      thrust::make_counting_iterator(0),
      thrust::make_counting_iterator(n_threads),
      [ddata = thrust::raw_pointer_cast(data.data()), z_length](int idx) {
        thrust::sort(thrust::seq, 
                     ddata + z_length * idx,
                     ddata + z_length * (idx + 1));
      });
}

如果您同意 z_length 作为模板参数，这可能是一个结合了两全其美的解决方案（如第一个示例中的数据格式）：

#include <thrust/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>

template <int z_length>
void sort_in_z_dir_middle_ground(thrust::device_vector<float> &data,
                                 int x_length, int y_length) {
  int n_threads = x_length * y_length; // == z_stride
  
  thrust::for_each(
      thrust::make_counting_iterator(0),
      thrust::make_counting_iterator(n_threads),
      [ddata = thrust::raw_pointer_cast(data.data()),
       z_length, n_threads](int idx) {
        float local_data[z_length];
        #pragma unroll
        for (int i = 0; i < z_length; ++i) {
          local_data[i] = ddata[idx + i * n_threads];
        }
        thrust::sort(thrust::seq, 
                     local_data,
                     local_data + z_length);
        #pragma unroll
        for (int i = 0; i < z_length; ++i) {
          ddata[idx + i * n_threads] = local_data[i];
        }
      });
}

【讨论】：