如何在 Cublas 中使用常量内存？答案

【问题标题】：how to use constant memory with Cublas?如何在 Cublas 中使用常量内存？
【发布时间】：2013-06-10 11:06:17
【问题描述】：

当我将 cublasIsamax 与常规内存分配器一起使用时 - 它工作正常。

int FindMaxIndex( const float* pVector, const size_t length )
{
    int result = 0;
    float* pDevVector = nullptr;

    if( CUBLAS_STATUS_SUCCESS != ::cudaMalloc( (void**)&pDevVector, length * sizeof(float) ) )
    {
        return -1;
    }
    if( CUBLAS_STATUS_SUCCESS !=  ::cudaMemcpy( pDevVector, pVector, length * (int)sizeof(float), cudaMemcpyHostToDevice) )
    {
        return -2;
    }
    ::cublasIsamax_v2( g_handle, length, pDevVector, 1, &result);

    if( nullptr != pDevVector )
    {
        ::cudaFree( pDevVector );
    }
    return result;
}

但是如果尝试使用常量内存，它会失败并出现未知错误 N14。怎么了？复制到常量内存成功但执行失败。

__constant__ float c_pIndex[ 255 ] = {0x00};

// the same function as GetIsMax but using CUBLAS function cublasIsamax_v2
int FindMaxIndexConst( const float* pVector, const size_t length, pfnMsg fnMsg )
{
    int result = 0;
    cudaError_t code = ::cudaMemcpyToSymbol( c_pIndex, pVector, length * sizeof(float), 0, cudaMemcpyHostToDevice );

    if( cudaSuccess != code )
    {
        const char* szMsg = ::cudaGetErrorString ( code );

        LogError3( L"[%d] [%hs] Could not allocate CUDA memory: %I64d pDevA", code, szMsg, (__int64)(length * sizeof(float)));
    }
    cublasStatus_t  status = ::cublasIsamax_v2( g_handle, length, c_pIndex, 1, &result);

    if( CUBLAS_STATUS_SUCCESS != status )
    {
        LogError2( L" [%d] Failed to execute <cublasIsamax_v2> : %I64d", status, (__int64)length );
    }

    return result;
}

【问题讨论】：

你不能像那样传递常量内存。

标签： cuda cublas

【解决方案1】：

为什么不分配一个常规设备数组并将其传递给 CUBLAS？

__constant__ 数组不是普通的 __device__ 数组。在您的代码中，您正在获取数组的地址并将其传递给主机函数。主机上的阵列地址在设备上无效，反之亦然，如 CUDA 编程指南中所述。见CUDA Programming Guide：

取__device__、__shared__或__constant__变量地址得到的地址只能在设备代码中使用。设备内存中描述的通过cudaGetSymbolAddress()获取的__device__或__constant__变量的地址只能在主机代码中使用。

至于通过设备指针访问__constant__内存，请参阅this answer了解为什么它会被取消缓存。

最后，以这种方式使用__constant__内存，即使它被缓存在常量缓存中，由于访问模式的原因，效率很低。 The constant cache is optimized for uniform access across threads in a warp。 isamax 可能在每个线程中访问不同的内存位置，因此访问将被序列化。因此，这将比统一访问慢 32 倍（并且可能比常规设备内存慢得多）。

【讨论】：

谢谢。现在我明白了。我只是在 cublasIsamax 中使用 c_pIndex 重新实现了 cublasIsamax，它工作得很好。
如果回答有帮助，请接受。但是我认为您错过了我的回答中解释为什么您正在做的事情是个坏主意的部分。以这种方式使用常量内存会慢。