当 CUDA_VISIBLE_DEVICES 不等于 0 时，扭矩作业找不到 GPU答案

【问题标题】：Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0当 CUDA_VISIBLE_DEVICES 不等于 0 时，扭矩作业找不到 GPU
【发布时间】：2017-08-20 11:45:47
【问题描述】：

我在 GPU 的扭矩分配方面遇到了一个奇怪的问题。

我在具有两个 NVIDIA GTX Titan X GPU 的单台机器上运行 Torque 6.1.0。我正在使用 pbs_sched 进行调度。 nvidia-smi 静态输出如下：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0      On |                  N/A |
| 22%   40C    P8    15W / 250W |      0MiB / 12204MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      0MiB / 12207MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我有一个简单的测试脚本来评估 GPU 分配，如下所示：

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

deviceQuery 是 CUDA 附带的实用程序。当我从命令行运行它时，它正确地找到了两个 GPU。当我像这样从命令行限制到一台设备时...

CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

...它还正确地找到了一个或另一个 GPU。

当我使用 qsub 将 test.sh 提交到队列时，并且没有其他作业正在运行时，它再次正常工作。这是输出：

CUDA_VISIBLE_DEVICES: 0 
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"   CUDA Driver Version / Runtime Version          8.0 / 8.0   CUDA Capability Major/Minor version number:    5.2   Total amount of global memory:                 12204 MBytes (12796887040 bytes)   (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores   GPU Max Clock rate:                    1076 MHz (1.08 GHz)   Memory Clock rate:                             3505 Mhz   Memory Bus Width:                              384-bit   L2 Cache Size:                                 3145728 bytes   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers   Total amount of constant memory:               65536 bytes   Total amount of shared memory per block:       49152 bytes   Total number of registers available per block: 65536   Warp size:                                     32   Maximum number of threads per multiprocessor:  2048   Maximum number of threads per block:           1024   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:            2147483647 bytes   Texture alignment:                             512 bytes   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)   Run time limit on kernels:                     No   Integrated GPU sharing Host Memory:            No   Support host page-locked memory mapping:       Yes   Alignment requirement for Surfaces:            Yes   Device has ECC support:                     Disabled   Device supports Unified Addressing (UVA):      Yes   Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0   Compute Mode:
     < Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS

但是，如果作业已经在 gpu0 上运行（即，如果它被分配了 CUDA_VISIBLE_DEVICES=1），则该作业找不到任何 GPU。输出：

CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

有人知道这里发生了什么吗？

【问题讨论】：

我在 6.1-dev 中看到了很多最近的 GPU/CUDA 修复。可能值得尝试对该分支进行检查以查看其行为方式。
谢谢@clusterdude。我无法让 6.1-dev 工作，但 6.1.1 中存在同样的问题

标签： nvidia gpu pbs torque

【解决方案1】：

我想我已经解决了自己的问题，但不幸的是我一次尝试了两件事。我不想回去确认哪个解决了这个问题。它是以下之一：

在构建之前从 Torque 的配置脚本中删除 --enable-cgroups 选项。
在 Torque 安装过程中运行这些步骤：

制作包

sh torque-package-server-linux-x86_64.sh --install

sh torque-package-mom-linux-x86_64.sh --install

sh torque-package-clients-linux-x86_64.sh --install

对于第二个选项，我知道 Torque 安装说明中正确记录了这些步骤。但是，我有一个简单的设置，我只有一个节点（计算节点和服务器是同一台机器）。我认为“make install”应该为该单个节点完成软件包安装所做的一切，但也许我错了。

【讨论】：