Cuda 程序不工作答案

【问题标题】：Cuda program not workingCuda 程序不工作
【发布时间】：2016-04-16 19:34:42
【问题描述】：

我是 cuda 编程的初学者。我正在尝试自己的简单代码，但它不起作用，我不知道还能做什么。

我的代码：

#include <mpi.h>
#include <cuda.h>
#include <stdio.h>
#include <sys/wait.h>
// Prototypes
__global__ void helloWorld(char*);
__device__ int  getGlobalIdx_2D_2D();

// Host function

int main(int argc, char** argv)
{
    unsigned int i, N, gridX, gridY, blockX, blockY;
    N = 4096000;

    char *str = (char *) malloc(N*sizeof(char));
    for(i=0; i < N; i++) str[i]='c';

    MPI_Init (&argc, &argv);

    char *d_str;
    size_t size = (size_t) N*sizeof(char);
    cudaMalloc((void**)&d_str, size);
    cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice);

    gridX = 100;
    gridY = 10;
    blockX = blockY = 64;
    dim3 dimGrid(gridX, gridY);  // 4096 chars per block
    dim3 dimBlock(blockX, blockY); // one thread per character, 2D
    printf("dimGrid(%d, %d)\t", gridX, gridY);
    printf("dimBlock(%d, %d)\t", blockX, blockY);
    helloWorld<<< dimGrid, dimBlock >>>(d_str);

    cudaMemcpy(str, d_str, size, cudaMemcpyDeviceToHost);
    cudaThreadSynchronize();

    MPI_Barrier (MPI_COMM_WORLD);

    cudaFree(d_str);

    printf("\nRes:\n");
    for(i = 0; i < N; i++) printf("\t[%u] %c\n", i, str[i]);

    MPI_Finalize ();

    free(str);
    return 0.0;
}

// Device kernel
__global__ void helloWorld(char* str)
{
    // determine where in the thread grid we are
    int pos = getGlobalIdx_2D_2D();
    if (pos % 2 == 0) str[pos] -= 2;
    else str[pos] += 8;
}

__device__ int getGlobalIdx_2D_2D()
{
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;
    int threadId = blockId * (blockDim.x * blockDim.y) +
                     (threadIdx.y * blockDim.x) + threadIdx.x;
    return threadId;
}

我想要的输出是：jajajajajajaja... x4096000

我读到'%'操作效率不高，但我不认为是那里的问题。

谢谢！

【问题讨论】：

啊，输出中打印的是'ccccccc...' x4096000，作为初始化，所以char指针数组没有被修改。
只是好奇，为什么是 4096000 次？
请将程序当前（不正确的）输出放入问题文本中。
我使用 4096000 只是为了达到 4MB。并且使用 cudaDeviceSynchronize() 并不能解决问题。
输出：dimGrid(100, 10) dimBlock(64, 64) Res: [0] c [1] c [2] c [3] c [4] c [5] c [6 ] c [7] c [8] c [9] c [10] c [11] c [12] c [13] c [14] c [15] c [16] c [17] c

标签： cuda mpi

【解决方案1】：

你完全没有CUDA error checking，这样做真的很有好处。启用后，您会发现块尺寸 64 x 64 是无效的，因为它会导致一个块内有 4096 个线程，这不是有效的配置。

【讨论】：

我在 HelloWorld 示例中使用了 cudaThreadSynchronize，所以我没有任何理由。我开始使用 CUDA 错误检查，问题是每个块 64x64 线程不是此线程中解释的有效配置 (stackoverflow.com/questions/16125389/…)