如何将 int 的向量传递给 CUDA 全局函数 [重复]答案

【问题标题】：How to Pass Vector of int into CUDA global function [duplicate]如何将 int 的向量传递给 CUDA 全局函数 [重复]
【发布时间】：2022-02-19 01:12:54
【问题描述】：

我正在编写我的第一个 CUDA 程序并遇到很多问题，因为我的主要编程语言不是 C++。

在我的控制台应用程序中，我有一个 vector 或 int，其中包含一个恒定的数字列表。我的代码应该创建新向量并检查与原始常量向量的匹配。

我不知道如何将向量的指针传递/复制到 GPU 设备中。在我尝试将我的代码从 C# 转换为 C++ 并使用内核后，我收到此错误消息：

"从 global 函数调用 host 函数时出错("std::vector ::vector()") ("MagicSeedCUDA::bigCUDAJob") 是不允许的"

这是我的代码的一部分：

std::vector<int> selectedList;
FillA1(A1, "0152793281263155465283127699107744880041");
selectedList = A1;
bigCUDAJob<< <640, 640, 640>> >(i, j, selectedList);

__global__ void bigCUDAJob(int i, int j, std::vector<int> selectedList)
    {    
        std::vector<int> tempList;
        // here comes code that adds numbers to tempList
        // code to find matches between tempList and the 
        // parameter selectedList 
    }

如何修改我的代码以免出现编译器错误？我也可以使用 int 数组。

【问题讨论】：

您既不能在 GPU 上使用std::vector，也不能只使用正常分配的内存。如果您想使用 C++ 容器，请查看 Thrust 库。他们提供device_vector 和universal_vector（和host_vector，但那个不太重要）。即使这样，您也不想将整个向量传递给内核，而只是将原始 CUDA 指针传递给内核（即thrust::raw_pointer_cast(my_vector.data())）。一个更优雅的解决方案是将原始指针包装在来自gsl-lite 的span 实现中，这可用于设备代码（内核）。
对于tempList，即使是推力也不是解决方案（参见here）。推力向量为您提供设备内存，但其成员函数仍然只能在主机上使用。

标签： c++ cuda

【解决方案1】：

我不知道如何将向量的指针传递/复制到 GPU 设备中

首先，提醒自己如何将不在std::vector 中的内存传递给 CUDA 内核。（重新）阅读vectorAdd example program，这是 NVIDIA 的 CUDA 示例的一部分。

cudaError_t status;
std::vector<int> selectedList;

// ... etc. ...

int *selectedListOnDevice = NULL;
std::size_t selectedListSizeInBytes = sizeof(int) * selectedList.size();
status = cudaMalloc((void **)&selectedListOnDevice, selectedListSizeInBytes);
if (status != cudaSuccess) { /* handle error */ }
cudaMemcpy(selectedListOnDevice, selectedList.data(), selectedListSizeInBytes);
if (status != cudaSuccess) { /* handle error */ }

// ... etc. ...

// eventually:
cudaFree(selectedListOnDevice);

这是使用官方的 CUDA 运行时 API。但是，如果您使用 my CUDA API wrappers（您绝对不必这样做），则上述内容变为：

auto selectedListOnDevice = cuda::memory::make_unique<int[]>(selectedList.size());
cuda::memory::copy(selectedListOnDevice.get(), selectedList.data());

而且您不需要自己处理错误 - 出错时，将抛出 exception。

另一种选择是使用NVIDIA's thrust library，它提供了一个类似于std::vector 的类，称为“设备向量”。这允许您编写：

thrust::device_vector<int> selectedListOnDevice = selectedList;

它应该“正常工作”。

我收到此错误消息：

Error calling a host function("std::vector<int, ::std::allocator >
::vector()") from a global function("MagicSeedCUDA::bigCUDAJob") is
not allowed

正如@paleonix 所提到的，Using std::vector in CUDA device code 涵盖了该问题。简而言之：无论您如何尝试编写，您都不能让 std::vector 出现在您的 __device__ 或 __global__ 函数中。

我正在编写我的第一个 CUDA 程序并遇到很多问题，因为我的主要编程语言不是 C++。

那么，不管您对std::vector 的具体问题是什么，您都应该花一些时间来学习C++ 编程。或者，您可以复习 C 编程，因为您可以编写 C'ish 而不是 C++'ish 的 CUDA 内核；但 C++ 的特性实际上在编写内核时非常有用，而不仅仅是在主机端。

【讨论】：