【问题标题】:Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0当 CUDA_VISIBLE_DEVICES 不等于 0 时,扭矩作业找不到 GPU
【发布时间】:2017-08-20 11:45:47
【问题描述】:

我在 GPU 的扭矩分配方面遇到了一个奇怪的问题。

我在具有两个 NVIDIA GTX Titan X GPU 的单台机器上运行 Torque 6.1.0。我正在使用 pbs_sched 进行调度。 nvidia-smi 静态输出如下:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0      On |                  N/A |
| 22%   40C    P8    15W / 250W |      0MiB / 12204MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      0MiB / 12207MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我有一个简单的测试脚本来评估 GPU 分配,如下所示:

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery 

deviceQuery 是 CUDA 附带的实用程序。当我从命令行运行它时,它正确地找到了两个 GPU。当我像这样从命令行限制到一台设备时...

CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

...它还正确地找到了一个或另一个 GPU。

当我使用 qsub 将 test.sh 提交到队列时,并且没有其他作业正在运行时,它再次正常工作。这是输出:

CUDA_VISIBLE_DEVICES: 0 
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"   CUDA Driver Version / Runtime Version          8.0 / 8.0   CUDA Capability Major/Minor version number:    5.2   Total amount of global memory:                 12204 MBytes (12796887040 bytes)   (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores   GPU Max Clock rate:                    1076 MHz (1.08 GHz)   Memory Clock rate:                             3505 Mhz   Memory Bus Width:                              384-bit   L2 Cache Size:                                 3145728 bytes   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers   Total amount of constant memory:               65536 bytes   Total amount of shared memory per block:       49152 bytes   Total number of registers available per block: 65536   Warp size:                                     32   Maximum number of threads per multiprocessor:  2048   Maximum number of threads per block:           1024   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:            2147483647 bytes   Texture alignment:                             512 bytes   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)   Run time limit on kernels:                     No   Integrated GPU sharing Host Memory:            No   Support host page-locked memory mapping:       Yes   Alignment requirement for Surfaces:            Yes   Device has ECC support:                     Disabled   Device supports Unified Addressing (UVA):      Yes   Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0   Compute Mode:
     < Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS

但是,如果作业已经在 gpu0 上运行(即,如果它被分配了 CUDA_VISIBLE_DEVICES=1),则该作业找不到任何 GPU。输出:

CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

有人知道这里发生了什么吗?

【问题讨论】:

  • 我在 6.1-dev 中看到了很多最近的 GPU/CUDA 修复。可能值得尝试对该分支进行检查以查看其行为方式。
  • 谢谢@clusterdude。我无法让 6.1-dev 工作,但 6.1.1 中存在同样的问题

标签: nvidia gpu pbs torque


【解决方案1】:

我想我已经解决了自己的问题,但不幸的是我一次尝试了两件事。我不想回去确认哪个解决了这个问题。它是以下之一:

  1. 在构建之前从 Torque 的配置脚本中删除 --enable-cgroups 选项。

  2. 在 Torque 安装过程中运行这些步骤:

    制作包

    sh torque-package-server-linux-x86_64.sh --install

    sh torque-package-mom-linux-x86_64.sh --install

    sh torque-package-clients-linux-x86_64.sh --install

对于第二个选项,我知道 Torque 安装说明中正确记录了这些步骤。但是,我有一个简单的设置,我只有一个节点(计算节点和服务器是同一台机器)。我认为“make install”应该为该单个节点完成软件包安装所做的一切,但也许我错了。

【讨论】:

    猜你喜欢
    • 2017-03-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多