【问题标题】:Array of cudaArray for multi-GPU texture code用于多 GPU 纹理代码的 cudaArray 数组
【发布时间】:2018-12-01 00:42:33
【问题描述】:

我有一些代码正在尝试为一般的多 GPU 案例工作,对于 n 数量相等的设备,其中 n 在编译时是未知的。

对于这段代码,我需要将一些数组绑定到纹理内存,并且我需要将完全相同的数据绑定到不同的 GPU。

我用于 3D 纹理绑定的单个 GPU 内存代码如下所示:

cudaArray *d_imagedata = 0;
const cudaExtent extent = make_cudaExtent(geo.nVoxelX, geo.nVoxelY, geo.nVoxelZ);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMalloc3DArray(&d_imagedata, &channelDesc, extent);
cudaCheckErrors("cudaMalloc3D error 3D tex");

cudaMemcpy3DParms copyParams = { 0 };
copyParams.srcPtr = make_cudaPitchedPtr((void*)img, extent.width*sizeof(float), extent.width, extent.height);
copyParams.dstArray = d_imagedata;
copyParams.extent = extent;
copyParams.kind = cudaMemcpyHostToDevice;
cudaMemcpy3D(&copyParams);

cudaCheckErrors("cudaMemcpy3D fail");

// Configure texture options
tex.normalized = false;
tex.filterMode = cudaFilterModePoint; 
tex.addressMode[0] = cudaAddressModeBorder;
tex.addressMode[1] = cudaAddressModeBorder;
tex.addressMode[2] = cudaAddressModeBorder;

cudaBindTextureToArray(tex, d_imagedata, channelDesc);

它的标准副本到cudaArray,然后绑定和设置过程,这里没有什么新鲜事。

要将此代码转换为多 GPU,我知道我不需要更改 tex 全局纹理参考,因为 CUDA 会知道不同的 GPU 有不同的 tex,但我确实需要 n @987654328 @ 实例,每个 GPU 一个。

如何创建(和分配)cudaArrays 的数组?

如果是全局内存指针会更容易,只需在双指针上使用 CPU malloc,然后在每个指针上使用 cudaMalloc 即可,但由于 cudaArray 不是标准类型,我没有不知道如何从中创建一个灵活的数组。

【问题讨论】:

    标签: cuda


    【解决方案1】:

    我建议使用纹理对象,而不是纹理引用。

    使用texture objects,对here 提供的代码进行微不足道的修改似乎对我来说可以正常工作:

    $ cat t341.cu
    #include <helper_cuda.h>
    #include <curand.h>
    #define NUM_TEX 4
    
    const int SizeNoiseTest = 32;
    const int cubeSizeNoiseTest = SizeNoiseTest*SizeNoiseTest*SizeNoiseTest;
    static cudaTextureObject_t texNoise[NUM_TEX];
    
    __global__ void AccesTexture(cudaTextureObject_t my_tex)
    {
            float test = tex3D<float>(my_tex,(float)threadIdx.x,(float)threadIdx.y,(float)threadIdx.z);//by using this the error occurs
            printf("thread: %d,%d,%d, value: %f\n", threadIdx.x, threadIdx.y, threadIdx.z, test);
    }
    
    void CreateTexture()
    {
    
        for (int i = 0; i < NUM_TEX; i++){
            cudaSetDevice(i);
            float *d_NoiseTest;//Device Array with random floats
            cudaMalloc((void **)&d_NoiseTest, cubeSizeNoiseTest*sizeof(float));//Allocation of device Array
            //curand Random Generator (needs compiler link -lcurand)
            curandGenerator_t gen;
            curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);
            curandSetPseudoRandomGeneratorSeed(gen,1235ULL+i);
            curandGenerateUniform(gen, d_NoiseTest, cubeSizeNoiseTest);//writing data to d_NoiseTest
            curandDestroyGenerator(gen);
    
            //cudaArray Descriptor
            cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
            //cuda Array
            cudaArray *d_cuArr;
            checkCudaErrors(cudaMalloc3DArray(&d_cuArr, &channelDesc, make_cudaExtent(SizeNoiseTest*sizeof(float),SizeNoiseTest,SizeNoiseTest), 0));
            cudaMemcpy3DParms copyParams = {0};
    
    
            //Array creation
            copyParams.srcPtr   = make_cudaPitchedPtr(d_NoiseTest, SizeNoiseTest*sizeof(float), SizeNoiseTest, SizeNoiseTest);
            copyParams.dstArray = d_cuArr;
            copyParams.extent   = make_cudaExtent(SizeNoiseTest,SizeNoiseTest,SizeNoiseTest);
            copyParams.kind     = cudaMemcpyDeviceToDevice;
            checkCudaErrors(cudaMemcpy3D(&copyParams));
            //Array creation End
    
            cudaResourceDesc    texRes;
            memset(&texRes, 0, sizeof(cudaResourceDesc));
            texRes.resType = cudaResourceTypeArray;
            texRes.res.array.array  = d_cuArr;
            cudaTextureDesc     texDescr;
            memset(&texDescr, 0, sizeof(cudaTextureDesc));
            texDescr.normalizedCoords = false;
            texDescr.filterMode = cudaFilterModeLinear;
            texDescr.addressMode[0] = cudaAddressModeClamp;   // clamp
            texDescr.addressMode[1] = cudaAddressModeClamp;
            texDescr.addressMode[2] = cudaAddressModeClamp;
            texDescr.readMode = cudaReadModeElementType;
            checkCudaErrors(cudaCreateTextureObject(&texNoise[i], &texRes, &texDescr, NULL));}
    }
    
    int main(int argc, char **argv)
    {
            CreateTexture();
            cudaSetDevice(0);
            AccesTexture<<<1,dim3(2,2,2)>>>(texNoise[0]);
            cudaSetDevice(1);
            AccesTexture<<<1,dim3(2,2,2)>>>(texNoise[1]);
            cudaSetDevice(2);
            AccesTexture<<<1,dim3(2,2,2)>>>(texNoise[2]);
            checkCudaErrors(cudaPeekAtLastError());
            cudaSetDevice(0);
            checkCudaErrors(cudaDeviceSynchronize());
            cudaSetDevice(1);
            checkCudaErrors(cudaDeviceSynchronize());
            cudaSetDevice(2);
            checkCudaErrors(cudaDeviceSynchronize());
            return 0;
    }
    $ nvcc -arch=sm_30 -I/usr/local/cuda/samples/common/inc -lcurand -o t341 t341.cu
    $ cuda-memcheck ./t341
    ========= CUDA-MEMCHECK
    thread: 0,0,0, value: 0.310691
    thread: 1,0,0, value: 0.627906
    thread: 0,1,0, value: 0.638900
    thread: 1,1,0, value: 0.665186
    thread: 0,0,1, value: 0.167465
    thread: 1,0,1, value: 0.565227
    thread: 0,1,1, value: 0.397606
    thread: 1,1,1, value: 0.503013
    thread: 0,0,0, value: 0.809163
    thread: 1,0,0, value: 0.795669
    thread: 0,1,0, value: 0.808565
    thread: 1,1,0, value: 0.847564
    thread: 0,0,1, value: 0.853998
    thread: 1,0,1, value: 0.688446
    thread: 0,1,1, value: 0.733255
    thread: 1,1,1, value: 0.649379
    thread: 0,0,0, value: 0.040824
    thread: 1,0,0, value: 0.087417
    thread: 0,1,0, value: 0.301392
    thread: 1,1,0, value: 0.298669
    thread: 0,0,1, value: 0.161962
    thread: 1,0,1, value: 0.316443
    thread: 0,1,1, value: 0.452077
    thread: 1,1,1, value: 0.477722
    ========= ERROR SUMMARY: 0 errors
    $
    

    请注意,为简单起见,此CreateTexture() 函数在循环处理期间覆盖先前分配的设备指针,例如d_NoiseTestd_cuArr。这不是非法的或功能性问题,但它增加了内存泄漏的可能性。 (但请参阅下面的示例,了解如何避免这种情况。)

    编辑:根据 cmets 中的一个问题,这些都不应该依赖于编译时间。以下是对上述代码的修改:

    $ cat t342.cu
    #include <helper_cuda.h>
    #include <curand.h>
    
    const int SizeNoiseTest = 32;
    const int cubeSizeNoiseTest = SizeNoiseTest*SizeNoiseTest*SizeNoiseTest;
    
    __global__ void AccesTexture(cudaTextureObject_t my_tex)
    {
            float test = tex3D<float>(my_tex,(float)threadIdx.x,(float)threadIdx.y,(float)threadIdx.z);//by using this the error occurs
            printf("thread: %d,%d,%d, value: %f\n", threadIdx.x, threadIdx.y, threadIdx.z, test);
    }
    
    void CreateTexture(int num, cudaTextureObject_t *texNoise, cudaArray **d_cuArr, float **d_NoiseTest)
    {
    
        for (int i = 0; i < num; i++){
            cudaSetDevice(i);
            cudaMalloc((void **)&d_NoiseTest[i], cubeSizeNoiseTest*sizeof(float));//Allocation of device Array
            //curand Random Generator (needs compiler link -lcurand)
            curandGenerator_t gen;
            curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);
            curandSetPseudoRandomGeneratorSeed(gen,1235ULL+i);
            curandGenerateUniform(gen, d_NoiseTest[i], cubeSizeNoiseTest);//writing data to d_NoiseTest
            curandDestroyGenerator(gen);
    
            //cudaArray Descriptor
            cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
            //cuda Array
            checkCudaErrors(cudaMalloc3DArray(&d_cuArr[i], &channelDesc, make_cudaExtent(SizeNoiseTest*sizeof(float),SizeNoiseTest,SizeNoiseTest), 0));
            cudaMemcpy3DParms copyParams = {0};
    
    
            //Array creation
            copyParams.srcPtr   = make_cudaPitchedPtr(d_NoiseTest[i], SizeNoiseTest*sizeof(float), SizeNoiseTest, SizeNoiseTest);
            copyParams.dstArray = d_cuArr[i];
            copyParams.extent   = make_cudaExtent(SizeNoiseTest,SizeNoiseTest,SizeNoiseTest);
            copyParams.kind     = cudaMemcpyDeviceToDevice;
            checkCudaErrors(cudaMemcpy3D(&copyParams));
            //Array creation End
    
            cudaResourceDesc    texRes;
            memset(&texRes, 0, sizeof(cudaResourceDesc));
            texRes.resType = cudaResourceTypeArray;
            texRes.res.array.array  = d_cuArr[i];
            cudaTextureDesc     texDescr;
            memset(&texDescr, 0, sizeof(cudaTextureDesc));
            texDescr.normalizedCoords = false;
            texDescr.filterMode = cudaFilterModeLinear;
            texDescr.addressMode[0] = cudaAddressModeClamp;   // clamp
            texDescr.addressMode[1] = cudaAddressModeClamp;
            texDescr.addressMode[2] = cudaAddressModeClamp;
            texDescr.readMode = cudaReadModeElementType;
            checkCudaErrors(cudaCreateTextureObject(&texNoise[i], &texRes, &texDescr, NULL));}
    }
    void FreeTexture(int num, cudaTextureObject_t *texNoise, cudaArray **d_cuArr, float **d_NoiseTest)
    {
       for (int i = 0; i < num; i++){
         cudaFree(d_NoiseTest[i]);
         cudaDestroyTextureObject(texNoise[i]);
         cudaFreeArray(d_cuArr[i]);}
    }
    
    int main(int argc, char **argv)
    {
            int num_dev = 1;
            if (argc > 1) num_dev = atoi(argv[1]);
            cudaTextureObject_t *texNoise = new cudaTextureObject_t[num_dev];
            cudaArray **d_cuArr = new cudaArray*[num_dev];
            float **d_NoiseTest = new float*[num_dev];
            CreateTexture(num_dev, texNoise, d_cuArr, d_NoiseTest);
            for (int i = 0; i < num_dev; i++){
              cudaSetDevice(i);
              AccesTexture<<<1,dim3(2,2,2)>>>(texNoise[i]);}
            checkCudaErrors(cudaPeekAtLastError());
            for (int i = 0; i < num_dev; i++){
              cudaSetDevice(i);
              checkCudaErrors(cudaDeviceSynchronize());}
            FreeTexture(num_dev, texNoise, d_cuArr, d_NoiseTest);
            delete[] d_cuArr;
            delete[] d_NoiseTest;
            delete[] texNoise;
            return 0;
    }
    $ nvcc -I/usr/local/cuda/samples/common/inc -lcurand -o t342 t342.cu
    $ cuda-memcheck ./t342
    ========= CUDA-MEMCHECK
    thread: 0,0,0, value: 0.310691
    thread: 1,0,0, value: 0.627906
    thread: 0,1,0, value: 0.638900
    thread: 1,1,0, value: 0.665186
    thread: 0,0,1, value: 0.167465
    thread: 1,0,1, value: 0.565227
    thread: 0,1,1, value: 0.397606
    thread: 1,1,1, value: 0.503013
    ========= ERROR SUMMARY: 0 errors
    $ cuda-memcheck ./t342 2
    ========= CUDA-MEMCHECK
    thread: 0,0,0, value: 0.310691
    thread: 1,0,0, value: 0.627906
    thread: 0,1,0, value: 0.638900
    thread: 1,1,0, value: 0.665186
    thread: 0,0,1, value: 0.167465
    thread: 1,0,1, value: 0.565227
    thread: 0,1,1, value: 0.397606
    thread: 1,1,1, value: 0.503013
    thread: 0,0,0, value: 0.809163
    thread: 1,0,0, value: 0.795669
    thread: 0,1,0, value: 0.808565
    thread: 1,1,0, value: 0.847564
    thread: 0,0,1, value: 0.853998
    thread: 1,0,1, value: 0.688446
    thread: 0,1,1, value: 0.733255
    thread: 1,1,1, value: 0.649379
    ========= ERROR SUMMARY: 0 errors
    $ cuda-memcheck ./t342 3
    ========= CUDA-MEMCHECK
    thread: 0,0,0, value: 0.310691
    thread: 1,0,0, value: 0.627906
    thread: 0,1,0, value: 0.638900
    thread: 1,1,0, value: 0.665186
    thread: 0,0,1, value: 0.167465
    thread: 1,0,1, value: 0.565227
    thread: 0,1,1, value: 0.397606
    thread: 1,1,1, value: 0.503013
    thread: 0,0,0, value: 0.809163
    thread: 1,0,0, value: 0.795669
    thread: 0,1,0, value: 0.808565
    thread: 1,1,0, value: 0.847564
    thread: 0,0,1, value: 0.853998
    thread: 1,0,1, value: 0.688446
    thread: 0,1,1, value: 0.733255
    thread: 1,1,1, value: 0.649379
    thread: 0,0,0, value: 0.040824
    thread: 1,0,0, value: 0.087417
    thread: 0,1,0, value: 0.301392
    thread: 1,1,0, value: 0.298669
    thread: 0,0,1, value: 0.161962
    thread: 1,0,1, value: 0.316443
    thread: 0,1,1, value: 0.452077
    thread: 1,1,1, value: 0.477722
    ========= ERROR SUMMARY: 0 errors
    $
    

    此代码在具有(至少)3 个 GPU 的系统上运行。我还更新了上面的示例,因此它演示了如何创建指向 cudaArray 类型的指针数组,并演示了如何避免内存泄漏。

    【讨论】:

    • 谢谢,这似乎是要走的路。只是一个问题:使用这个系统,您需要在编译时知道系统上的 GPU 数量吗?您有 4 个NUM_TEX,但在您的测试中只使用了 3 个设备。这是因为您没有安装 4 个设备,因此纹理的第 4 次创建没有发生吗?我只是不确定我是否遗漏了什么
    • 这一切都不应该依赖于编译时间。我在答案中添加了一个变体来证明这一点。
    • 啊,我现在明白了,我对那个全局定义有点困惑。谢谢,一如既往的精彩回答!
    • 嗨罗伯特,只是一个与我原来的问题有某种关联的后续问题。在这里,您不会存储d_cuArr 的每个实例,正如您明确提到的,但为了安全编码,我需要存储这些实例并在不需要时正确释放它们。这再次出现了标题中的问题,我如何创建一个数组以便以后正确释放它们?我知道如何为d_NoiseTest 执行此操作,但不适用于d_cuArr,因为它的cudaArray 类型。
    • 没问题。在某种程度上,这些事情迫使我写出更好的答案。现在,将来不太可能有人说“您知道您在此处发布的这段代码存在内存泄漏......”而且可以说d_NoiseTest 数组可以封装在 create 函数中。它实际上不需要来回传递给main,除非您希望该数据用于其他用途。
    猜你喜欢
    • 2015-06-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-09-18
    • 2012-02-14
    • 2021-03-19
    • 2014-06-13
    • 1970-01-01
    相关资源
    最近更新 更多