在传输到 CUDA GPU 时保持主机数据完整答案

【问题标题】：Keeping host data intact while transferring to CUDA GPU在传输到 CUDA GPU 时保持主机数据完整
【发布时间】：2015-08-05 00:58:29
【问题描述】：

所以我有一个问题让我卡了一段时间。我正在使用 NSight Eclipse Edition (CUDA 7.0) 在 GT 630（Kepler 版本）GPU 上进行编程。

基本上，我有一个类（Static_Box）的数组，我修改主机（CPU）上的数据。然后我想将数据发送到 GPU 进行计算，但是，我的代码没有这样做。这是我的一些代码：

#define SIZE_OF_BOX_ARRAY 3

class Edge {
    int x1, y1, x2, y2;
}

class Static_Box {
    Static_Box(int x, int y, int width, int height);
    Edge e1, e2, e3, e4;
}

Static_Box::Static_Box(int x, int y, int width, int height) {
    e1.x1 = x;
    e1.y1 = y;
    e1.x2 = x+width;
    e1.y2 = y;
    // e2.x1 = x+width;  Continuing in this manner (no other calculations)
}

// Storage of the scene. d_* indicates GPU memory
// Static_Box is a class I have defined in another file, it contains a
// few other classes that I wrote as well.
Static_Box *static_boxes;
Static_Box *d_static_boxes;

int main(int argc, char **argv) {
    // Create the host data storage
    static_boxes = (Static_Box*)malloc(SIZE_OF_BOX_ARRAY*sizeof(Static_Box));

    // I then set a few of the indexes of static_boxes here, which is
    // the data I need written while on the CPU.
    // Example:
    static_boxes[0] = Static_Box(

    // Allocate the memory on the GPU
    // CUDA_CHECK_RETURN is from NVIDIA's bit reverse example (exits the application if the GPU fails)
    CUDA_CHECK_RETURN(cudaMalloc((void**)&d_static_boxes, SIZE_OF_BOX_ARRAY * sizeof(Static_Box)));

    int j = 0;
    for (; j < SIZE_OF_BOX_ARRAY; j++) {
    //  Removed this do per Mai Longdong's suggestion
    //    CUDA_CHECK_RETURN(cudaMalloc((void**)&(static_boxes[j]), sizeof(Static_Box)));
        CUDA_CHECK_RETURN(cudaMemcpy(&(d_static_boxes[j]), &(static_boxes[j]), sizeof(Static_Box), cudaMemcpyHostToDevice));
    }
}

我在这里搜索了很长一段时间，并从 Robert Crovella 那里找到了一些有用的信息，并使用他的提示取得了一些进展，但他给出的答案与我的问题并不完全相关。 有没有人可以在传输到 GPU 时保持主机数据完好无损？

非常感谢您的帮助！

编辑，包括来自 MaiLongdong 的第一个 cudaMalloc 的更改

编辑 2，包括对麦龙东的第二次更改，并提供了完整的示例。

【问题讨论】：

不要在 C++ 中使用malloc。如果您确实需要动态分配，请使用new，但在此示例中您不需要，请使用std::array。此外，您的 cudaMalloc 分配了 sizeof(static_boxes) 字节，这是 a 指针 的大小，这不是您想要的。最后，第二个cudaMalloc 将其结果存储在static_boxes，而不是d_static_boxes。
好的，到了。感谢您指出 sizeof(static_boxes) 我已将其换成 SIZE_OF_BOX_ARRAY * sizeof(Static_Box) 我只是尝试将第二个 cudaMalloc 更改为使用 d_static_boxes 但它给了我一个 SIGBUS:Bus 错误。我现在将着手从 GPU 复制数据，看看情况如何。感谢您的意见@MaiLongdong！
这是一个想法，你不能cudaMalloc 进入设备指针，我什至不知道我为什么这么说，甚至不是星期一早上。完全删除第二个cudaMalloc。另外，也许你应该买一本关于 C++ 的书，因为你似乎对基本语义很困惑。
除非Static_Box 包含指针（您没有显示哪个定义），否则您在第一个cudaMalloc 之后就完成了。写一个你的实际问题描述的程度是“我在这样做时遇到麻烦”的问题是很不清楚的，特别是当你没有提供MCVE的事实时，SO @ 987654321@（由于缺少 MCVE，我投票结束了这个问题。）如果Static_Box 确实包含指针，那么代码就有点复杂了。试试this
将“已解决”放在问题标题中不适合 SO。相反，请投票或将其中一个答案标记为已接受，或者提供您自己的答案并接受。这是将问题标记为“已解决”的 SO 方式。顺便说一句，我删除了我的近距离投票，因为你现在提供了一些类似于 MCVE 的东西（尽管它仍然有不可编译的垃圾。）

标签： c++ cuda

【解决方案1】：

如果Static_Box 不包含指针（需要独立分配的指针引用的成员数据），那么复制它们的数组与复制POD 类型的数组实际上并没有什么不同，例如int。这应该就是您所需要的：

#define SIZE_OF_BOX_ARRAY 3

Static_Box *static_boxes;
Static_Box *d_static_boxes;

int main(int argc, char **argv) {

    static_boxes = (Static_Box*)malloc(SIZE_OF_BOX_ARRAY*sizeof(Static_Box));
    CUDA_CHECK_RETURN(cudaMalloc((void**)&d_static_boxes, SIZE_OF_BOX_ARRAY * sizeof(Static_Box)));
    CUDA_CHECK_RETURN(cudaMemcpy(d_static_boxes, static_boxes, SIZE_OF_BOX_ARRAY*sizeof(Static_Box), cudaMemcpyHostToDevice));

如果您认为这不起作用，您需要给出一个具体示例，说明您正在做什么以及究竟是什么让您相信它不起作用（数据不匹配、抛出 CUDA 运行时错误等）您提供的示例应该是完整的，以便其他人可以编译、运行它并查看您报告的任何问题。如果您在问题中发布的代码无法编译，则它不是MCVE（我的意见，这会影响我的投票模式。）

【讨论】：

哇，看起来这一切都可以追溯到我认为指针是数组的实际大小。将其切换回旧的复制方式（不使用 for 循环按您的描述工作。感谢您的所有帮助！我将把它标记为已接受的答案。