CUDA 5.5 示例在 OS X 10.9 上编译良好，但运行时立即出错答案

【问题标题】：CUDA 5.5 samples compile fine on OS X 10.9 but error out immediately when runCUDA 5.5 示例在 OS X 10.9 上编译良好，但运行时立即出错
【发布时间】：2014-04-02 01:33:53
【问题描述】：

这是在配备 GeForce 320M（计算能力 1.2）的 MacBookPro7,1 上。以前，使用 OS X 10.7.8、XCode 4.x 和 CUDA 5.0，CUDA 代码编译并运行良好。

然后，我更新到 OS X 10.9.2、XCode 5.1 和 CUDA 5.5。起初，deviceQuery 失败了。我在别处读到 5.5.28（CUDA 5.5 附带的驱动程序）不支持计算能力 1.x (sm_10)，但 5.5.43 支持。将 CUDA 驱动程序更新到最新的 5.5.47（GPU 驱动程序版本 8.24.11 310.90.9b01）后，deviceQuery 确实通过了以下输出。

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors, (  8) CUDA Cores/MP:     48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS

此外，我可以在不修改 CUDA 5.5 示例的情况下成功编译，尽管我没有尝试编译所有这些示例。

但是，matrixMul、simpleCUFFT、simpleCUBLAS 等示例在运行时都会立即失败。

$ ./matrixMul 
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2

MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)

$ ./simpleCUFFT 
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2

CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"

错误代码 2 是 cudaErrorMemoryAllocation，但我怀疑它以某种方式隐藏了失败的 CUDA 初始化。

$ ./simpleCUBLAS 
GPU Device 0: "GeForce 320M" with compute capability 1.2

simpleCUBLAS test running..
!!!! CUBLAS initialization error

实际错误代码是 CUBLAS_STATUS_NOT_INITIALIZED 从调用 cublasCreate() 返回。

以前有没有人遇到过这个问题并找到了解决办法？提前致谢。

【问题讨论】：

标签： xcode macos cuda osx-mavericks nvidia

【解决方案1】：

我猜你的内存不足了。显示管理器正在使用您的 GPU，它只有 256Mb 的 RAM。 OS 10.9 显示管理器和 CUDA 5.5 运行时的组合内存占用可能会让您几乎没有可用内存。我建议编写和运行一个像这样的小测试程序：

#include <iostream>

int main(void)
{
    size_t mfree, mtotal;

    cudaSetDevice(0);
    cudaMemGetInfo(&mfree, &mtotal);

    std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;

    return cudaDeviceReset();
}

[免责声明：在浏览器中编写，从未编译或测试使用风险自负]

这应该让您了解在设备上建立上下文后可用的可用内存。您可能会惊讶于可以使用的东西如此之少。

编辑：这是一个更轻量级的替代测试，它甚至不尝试在设备上建立上下文。相反，它只使用驱动程序 API 来检查设备。如果这成功了，那么要么 OS X 的运行时 API 以某种方式被破坏，要么你的设备上没有可用的内存来建立上下文。如果它失败了，那么你真的有一个损坏的 CUDA 安装。无论哪种方式，我都会考虑向 NVIDIA 提交错误报告：

#include <iostream>
#include <cuda.h>

int main(void)
{
    CUdevice d;
    size_t b;
    cuInit(0);
    cuDeviceGet(&d, 0);
    cuDeviceTotalMem(&b, d);

    std::cout << "Total memory = " << b << std::endl;

    return 0;
}

请注意，您需要显式链接 cuda 驱动程序库才能使其正常工作（例如，将 -lcuda 传递给 nvcc）

【讨论】：

我刚刚尝试了您的建议。不幸的是，cudaMemGetInfo 也返回错误代码 2 (cudaErrorMemoryAllocation)。但是谢谢你的想法——也许我可以尝试其他的程序诊断。
@cklin：好的，这很奇怪。我建议尝试我在问题中编辑的代码。这甚至没有建立上下文，只是使用驱动程序 API。我会考虑就此联系 NVIDIA 支持。
谢谢。我尝试了你的新 sn-p，它运行没有错误。然后我添加了一行，通过调用cuCtxCreate 手动创建上下文。惊喜！这将返回错误代码 2。似乎甚至可能没有可用的内存来创建上下文。