CUDA 减少 - 比赛条件？答案

【问题标题】：CUDA reduction - race condition?CUDA 减少 - 比赛条件？
【发布时间】：2018-06-14 15:55:12
【问题描述】：

请考虑我从教程中获得的以下代码和随附的说明图。其目的是展示 CUDA 的并行减少。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <numeric>
using namespace std;

__global__ void sumSingleBlock(int* d)
{
  int tid = threadIdx.x;

  // Number of participating threads (tc) halves on each iteration
  for (int tc = blockDim.x, stepSize = 1; tc > 0; tc >>= 1, stepSize <<= 1)
  {
    // Thread must be allowed to write
    if (tid < tc)
    {
      // We need to do A + B, where B is the element following A, so first we 
      // need to find the position of element A and of element B      
      int posA = tid * stepSize * 2;
      int posB = posA + stepSize;

      // Update the value at posA by adding the value at posB to it
      d[posA] += d[posB];
    }
  }
}

int main()
{
  cudaError_t status;

  const int count = 8;
  const int size = count * sizeof(int);
  int* h = new int[count];
  for (int i = 0; i < count; ++i)
    h[i] = i+1;

  int* d;
  status = cudaMalloc(&d, size);

  status = cudaMemcpy(d,h,size, cudaMemcpyHostToDevice);

  sumSingleBlock<<<1,count/2>>>(d);

  int result;
  status = cudaMemcpy(&result,d,sizeof(int),cudaMemcpyDeviceToHost);

  cout << "Sum is " << result << endl;

  getchar();

  cudaFree(d);
  delete [] h;

  return 0;
}

现在，我可以理解图中概述的一般减少原理了。我不明白的是如何在添加 (*) 中没有竞争条件：

很明显，所有四个线程都会以相同的次数运行循环；只有tid < tc 他们才会做一些有用的事情。线程#0将1和2相加并将结果存储在元素0中。它的第二次迭代然后访问元素2。同时，线程#1的第一次迭代是将3和4相加并将结果存储在元素2中。

如果线程#0 在线程#1 完成迭代1 之前开始迭代2 怎么办？这意味着线程 #0 可以读取 3 而不是 7，或者可能是一个损坏的值（？）这里没有任何同步，所以代码是错误的吗？

(*) 注意：我不确定是否存在竞争条件，我完全相信教程中的安全代码是正确的。

【问题讨论】：

是的，代码是错误的。一般情况下需要同步。即使您想设置经纱同步行为，提供的代码也有问题。您可能应该使用更好的教程。 NVIDIA 有一个reduction sample code，还有一个tutorial by Mark Harris。但是，此版本的教程缺少对经纱同步部分的 volatile 处理。
thrust::reduce
感谢罗伯特·克罗维拉；也许您可以将其发布为答案？
很好奇谁也对此投了反对票...
我应该说我已经完成了挖掘并阅读了教程上发布的所有 cmets 并且其他人发现了这一点；作者回复说__syncthreads();调用应该加在内核中if语句之后。

标签： c++ cuda

【解决方案1】：

代码错误，需要__syncthreads()调用，如下图。

__global__ void sumSingleBlock(int* d)
{
  int tid = threadIdx.x;

  // Number of participating threads (tc) halves on each iteration
  for (int tc = blockDim.x, stepSize = 1; tc > 0; tc >>= 1, stepSize <<= 1)
  {
    // Thread must be allowed to write
    if (tid < tc)
    {
      // We need to do A + B, where B is the element following A, so first we 
      // need to find the position of element A and of element B      
      int posA = tid * stepSize * 2;
      int posB = posA + stepSize;

      // Update the value at posA by adding the value at posB to it
      d[posA] += d[posB];
    }
     __syncthreads();
  }
}

【讨论】：