推力执行策略将内核发布到默认流答案

【问题标题】：Thrust execution policy issues kernel to default stream推力执行策略将内核发布到默认流
【发布时间】：2015-06-13 11:07:19
【问题描述】：

我目前正在设计一个简短的教程，展示 Thrust 模板库的各个方面和功能。

不幸的是，我为了展示如何使用 cuda 流使用复制/计算并发而编写的代码中似乎存在问题。

我的代码可以在这里找到，在 asynchronousLaunch 目录中： https://github.com/gnthibault/Cuda_Thrust_Introduction/tree/master/AsynchronousLaunch

这是产生问题的代码摘要：

//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
#include <functional>

//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/scan.h>

//Cuda
#include <cuda_runtime.h>

//Local
#include "AsynchronousLaunch.cu.h"

int main( int argc, char* argv[] )
{
    const size_t fullSize = 1024*1024*64;
    const size_t halfSize = fullSize/2;

    //Declare one host std::vector and initialize it with random values
    std::vector<float> hostVector( fullSize );
    std::generate(hostVector.begin(), hostVector.end(), normalRandomFunctor<float>(0.f,1.f) );

    //And two device vector of Half size
    thrust::device_vector<float> deviceVector0( halfSize );
    thrust::device_vector<float> deviceVector1( halfSize );

    //Declare  and initialize also two cuda stream
    cudaStream_t stream0, stream1;
    cudaStreamCreate( &stream0 );
    cudaStreamCreate( &stream1 );

    //Now, we would like to perform an alternate scheme copy/compute
    for( int i = 0; i < 10; i++ )
    {
        //Wait for the end of the copy to host before starting to copy back to device
        cudaStreamSynchronize(stream0);
        //Warning: thrust::copy does not handle asynchronous behaviour for host/device copy, you must use cudaMemcpyAsync to do so
        cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector0.data()), thrust::raw_pointer_cast(hostVector.data()), halfSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
        cudaStreamSynchronize(stream1);
        //second copy is most likely to occur sequentially after the first one
        cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector1.data()), thrust::raw_pointer_cast(hostVector.data())+halfSize, halfSize*sizeof(float), cudaMemcpyHostToDevice, stream1);

        //Compute on device, here inclusive scan, for histogram equalization for instance
        thrust::transform( thrust::cuda::par.on(stream0), deviceVector0.begin(), deviceVector0.end(), deviceVector0.begin(), computeFunctor<float>() );
        thrust::transform( thrust::cuda::par.on(stream1), deviceVector1.begin(), deviceVector1.end(), deviceVector1.begin(), computeFunctor<float>() );

        //Copy back to host
        cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data()), thrust::raw_pointer_cast(deviceVector0.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream0);
        cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data())+halfSize, thrust::raw_pointer_cast(deviceVector1.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream1);
    }

    //Full Synchronize before exit
    cudaDeviceSynchronize();

    cudaStreamDestroy( stream0 );
    cudaStreamDestroy( stream1 );

    return EXIT_SUCCESS;
}

这是一个程序实例的结果，通过 nvidia 可视化配置文件观察到：

如你所见，cudamemcopy（棕色）都发送到流 13 和 14，但 Thrust 从推力::transform 生成的内核被发送到默认流（捕获中的蓝色）

顺便说一句，我使用的是 cuda 工具包版本 7.0.28，带有 GTX680 和 gcc 4.8.2。

如果有人能告诉我我的代码有什么问题，我将不胜感激。

提前谢谢你

编辑：这是我认为作为解决方案的代码：

//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <functional>
#include <vector>


//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>


//Cuda
#include <cuda_runtime.h>

//Local definitions

template<typename T>
struct computeFunctor
{
    __host__ __device__
    computeFunctor() {}

    __host__ __device__
    T operator()( T in )
    {
        //Naive functor that generates expensive but useless instructions
        T a =  cos(in);
        for(int i = 0; i < 350; i++ )
        {
            a+=cos(in);
        }
        return a;
    }
};

int main( int argc, char* argv[] )
{
    const size_t fullSize =  1024*1024*2;
    const size_t nbOfStrip = 4;
    const size_t stripSize =  fullSize/nbOfStrip;

    //Allocate host pinned memory in order to use asynchronous api and initialize it with random values
    float* hostVector;
    cudaMallocHost(&hostVector,fullSize*sizeof(float));
    std::fill(hostVector, hostVector+fullSize, 1.0f );

    //And one device vector of the same size
    thrust::device_vector<float> deviceVector( fullSize );

    //Declare  and initialize also two cuda stream
    std::vector<cudaStream_t> vStream(nbOfStrip);
    for( auto it = vStream.begin(); it != vStream.end(); it++ )
    {
        cudaStreamCreate( &(*it) );
    }

    //Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice/Compute/CopyToHost for each stream scheme:
    for( int i = 0; i < 5; i++ )
    {
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            size_t nextOffset = stripSize*(j+1);
            cudaStreamSynchronize(vStream.at(j));
            cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
            thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );
            cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
        }
    }
    //On devices that do not possess multiple queues copy engine capability, this solution serializes all command even if they have been issued to different streams
    //Why ? Because in the point of view of the copy engine, which is a single ressource in this case, there is a time dependency between HtoD(n) and DtoH(n) which is ok, but there is also
    // a false dependency between DtoH(n) and HtoD(n+1), that preclude any copy/compute overlap

    //Full Synchronize before testing second solution
    cudaDeviceSynchronize();

    //Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice for each stream /Compute for each stream /CopyToHost for each stream scheme:
    for( int i = 0; i < 5; i++ )
    {
        for( int j=0; j!=nbOfStrip; j++)
        {
            cudaStreamSynchronize(vStream.at(j));
        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            size_t nextOffset = stripSize*(j+1);
            thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );

        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
        }
    }
    //On device that do not possess multiple queues in the copy engine, this solution yield better results, on other, it should show nearly identic results

    //Full Synchronize before exit
    cudaDeviceSynchronize();

    for( auto it = vStream.begin(); it != vStream.end(); it++ )
    {
        cudaStreamDestroy( *it );
    }
    cudaFreeHost( hostVector );

    return EXIT_SUCCESS;
}

使用 nvcc ./test.cu -o ./test.exe -std=c++11 编译

【问题讨论】：

链接副本中有一个工作示例
我可能还不够清楚，对我不起作用的不是副本，正如您所见，在我从 nvvp 捕获的数据流中，副本已发布到流 13 和 14。问题是内核（捕获中的蓝色）被发布到默认流，我测试了你从假设的副本链接的代码，它产生了完全相同的问题：内核被发布到默认流。所以我不认为它是重复的
事实上，即使是官方的推力示例也会在我的机器上产生同样的问题：请参阅github.com/thrust/thrust/blob/master/examples/cuda/… 我想知道这是否可能是由于我的编译/执行环境或设置...
CUDA 7 附带的推力版本有一个 issue 与向 CUDA 流发布推力内核相关联。你可能会遇到这个问题。为了帮助确认这一点，我建议 1. 将推力版本更新为 latest development version，或 2. 出于测试目的恢复到 CUDA 6.5。如果其中任何一个“解决”了问题，那么您很可能遇到了该问题。
谢谢 Robert Crovella，我下载了当前的 master (1.8.2)，现在可以使用了。

标签： concurrency cuda thrust cuda-streams

【解决方案1】：

我要指出两点。这两个都（现在）在this related question/answer 中引用，您不妨参考一下。

在这种情况下，将底层内核发布到非默认流的推力失败似乎与 this issue 有关。它可以通过更新到the latest thrust version 来纠正（如问题的 cmets 所述）。未来的 CUDA 版本（超过 7 版）可能也会包括固定推力。这可能是本问题讨论的核心问题。
这个问题似乎也表明目标之一是复制和计算的重叠：
```
in order to show how to use copy/compute concurrency using cuda streams
```
但我认为，即使上面的第 1 项已修复，使用当前编写的代码也是无法实现的。复制与计算操作的重叠需要在复制操作中正确使用 cuda 流 (cudaMemcpyAsync) as well as a pinned host allocation。问题中提出的代码没有使用固定主机分配（std::vector 默认情况下不使用固定分配器，AFAIK），因此我不希望 cudaMemcpyAsync 操作与任何内核活动重叠，即使如果可能的话。为了解决这个问题，应该使用固定分配器，here 就是一个这样的例子。

为了完整起见，该问题缺少MCVE，即expected for questions of this type。这使得其他人更难以尝试测试您的问题，并且显然是 SO 的一个密切原因。是的，您提供了指向外部 github 存储库的链接，但这种行为是不受欢迎的。 MCVE 要求明确指出，必要的部分应包含在问题本身中（而不是外部参考。）由于唯一缺少的部分 AFAICT 是“AsynchronousLaunch.cu.h”，因此看起来相对简单在您的问题中包括这一附加部分。外部链接的问题在于，当它们在未来中断时，这个问题对未来的读者变得不那么有用了。（而且，在我看来，强迫他人浏览外部 github 存储库以查找特定文件不利于获得帮助。）

【讨论】：

我注意到这句话，将来我会在我的 SO 问题中嵌入一个适当的 MCVE。关于复制/计算重叠的问题，我正在调试我的代码，确实缺少固定内存是我稍后整理的一个问题。我在 github 上更正了我的代码，它现在确实表现出计算/复制行为。感谢您的帮助和建设性意见