关于CUDA中从块到SM的分配细节的问题答案

【问题标题】：A question about the details about the distribution from blocks to SMs in CUDA关于CUDA中从块到SM的分配细节的问题
【发布时间】：2011-11-02 02:44:39
【问题描述】：

我以计算能力1.3的硬件为例。

30 个 SM 可用。那么最多可以同时运行240个block（考虑到寄存器和共享内存的限制，block个数的限制可能会低很多）。超过 240 个的块必须等待可用的硬件资源。

我的问题是那些超过 240 的块何时会分配给 SM。一旦前 240 个块中的一些块完成？或者当前 240 个块中的所有完成时？

我写了这么一段代码。

#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>

const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
    if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
    if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}

int main() {
    int * mark;
    cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
    cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
    kernel <<< BLOCKNUM, 1>>> ( mark );
    cudaFree ( mark );
    return 0;
}

此代码导致死锁并且无法终止。但是如果我将 N 从 240 更改为 239，则代码能够终止。所以我想知道一些关于blocks调度的细节。

【问题讨论】：

几个答案提供了确定调度顺序和策略的方法。但是在编写依赖于此的代码之前要仔细考虑。块的调度在 CUDA 编程模型中是未定义的，并且可能会改变。

标签： gpgpu nvidia gpu cuda

【解决方案1】：

在 GT200 上，已通过微基准测试证明，每当 SM 退出其正在运行的所有当前活动块时，都会安排新块。所以答案是当一些块完成时，调度粒度是 SM 级别。似乎有一个共识，即 Fermi GPU 比前几代硬件具有更精细的调度粒度。

【讨论】：

这个细节有参考吗？或者如何演示？

【解决方案2】：

我找不到任何关于计算能力

Fermi 架构引入了一种称为 GigaThread 引擎的新块调度程序。
GigaThread 可以在 SM 完成执行时立即替换 SM 上的块，还可以实现内核并发执行。

【讨论】：

【解决方案3】：

虽然对此没有官方答案，但您可以通过原子操作来衡量您的块何时开始工作以及何时结束。

尝试使用以下代码：

#include <stdio.h>

const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy

__global__ void emptyKernel() {}


__global__ void myKernel(int *control, int *output) {
        if (threadIdx.x==1) {
                //register that we enter
                int enter=atomicAdd(control,1);
                output[blockIdx.x]=enter;

                //some intensive and long task
                int &var=output[blockIdx.x+gridDim.x]; //var references global memory
                var=1;
                for (int i=0; i<12345678; ++i) {
                        var+=1+tanhf(var);
                }

                //register that we quit
                var=atomicAdd(control,1);
        }
}


int main() {

        int *gpuControl;
        cudaMalloc((void**)&gpuControl, sizeof(int));
        int cpuControl=0;
        cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);


        int *gpuOutput;
        cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
        int cpuOutput[maxBlocks*2];

        for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
                cpuOutput[i]=-1;

        // play with these values
        const int thr=479;
        const int p=13;
        const int q=maxBlocks;

        //I found that this may actually affect the scheduler! Try with and without this call.
        emptyKernel<<<p,thr>>>();

        cudaEvent_t timerStart;
        cudaEvent_t timerStop;
        cudaEventCreate(&timerStart);
        cudaEventCreate(&timerStop);

        cudaThreadSynchronize();

        cudaEventRecord(timerStart,0);

        myKernel<<<q,512>>>(gpuControl, gpuOutput);

        cudaEventRecord(timerStop,0);
        cudaEventSynchronize(timerStop);

        cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);

        cudaThreadSynchronize();
        float thisTime;
        cudaEventElapsedTime(&thisTime,timerStart,timerStop);

        cudaEventDestroy(timerStart);
        cudaEventDestroy(timerStop);
        printf("Elapsed time: %f\n",thisTime);

        for (int i=0; i<q; ++i)
                printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}

您在输出中得到的是块 ID，然后输入“时间”并退出“时间”。这样您就可以了解这些事件发生的顺序。

【讨论】：

【解决方案4】：

在 Fermi 上，我确信只要有空间，SM 上就会安排一个区块。即，每当一个 SM 完成执行一个块时，如果还有任何块，它将执行另一个块。（但是，实际的顺序是不确定的）。

在旧版本中，我不知道。但是你可以使用内置的clock()函数来验证它。

例如，我使用了以下 OpenCL 内核代码（您可以轻松地将其转换为 CUDA）：

   __kernel void test(uint* start, uint* end, float* buffer);
   {
       int id = get_global_id(0);
       start[id] = clock();
       __do_something_here;
       end[id] = clock();
   }

然后将其输出到文件并构建图形。你会看到它的视觉效果。

【讨论】：