JupyterHub 单用户无法通过 systemdspawner 使用 tensorflow gpu 支持答案

【问题标题】：JupyterHub singleuser not able to use tensorflow gpu support using systemdspawnerJupyterHub 单用户无法通过 systemdspawner 使用 tensorflow gpu 支持
【发布时间】：2020-05-27 23:24:01
【问题描述】：

（这是对SO、jupyterhub issue tracker 和jupyterhub/systemdspawner issue tracker 的交叉发布）

我有一个使用 SystemdSpawner 的私人 JupyterHub 设置，我尝试在 gpu 支持下运行 tensorflow。

我遵循了 tensorflow instructions，或者在 AWS EC2 g4 实例上尝试了已经配置好的 AWS AMI（深度学习基础 AMI (Ubuntu 18.04) 版本 21.0）和 NDVIDIA。

在这两种设置中，我都可以在 (i)python 3.6 shell 中使用带有 gpu 支持的 tensorflow

>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
2020-02-12 10:57:13.670937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 10:57:13.698230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.699066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-02-12 10:57:13.699286: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-12 10:57:13.700918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 10:57:13.702512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 10:57:13.702814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 10:57:13.704561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 10:57:13.705586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 10:57:13.709171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 10:57:13.709278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

（关于NUMA节点的一些警告，但是找到了gpu）

同样使用nvidia-smi 和deviceQuery 显示gpu：

$ nvidia-smi
Wed Feb 12 10:39:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ /usr/local/cuda/extras/demo_suite/deviceQuery
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          10.1 / 10.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 15080 MBytes (15812263936 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = Tesla T4
Result = PASS

现在我启动 JupyterHub，登录并打开一个终端，我得到：

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

和

$ /usr/local/cuda/extras/demo_suite/deviceQuery
cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

还有

我怀疑某种“沙盒”、缺少 ENV 变量等，因为在单用户环境中找不到 gpu 驱动程序，因此 tensorflow gpu 支持不起作用。

对此有什么想法吗？可能它要么是一个小的配置调整，要么是由于架构根本无法解决；）

【问题讨论】：

您是否使用 SystemdSpawner 的默认配置？这是在黑暗中拍摄的，但您可以尝试在 SystemdSpawner 配置 github.com/jupyterhub/systemdspawner#isolate_devices 中设置 c.SystemdSpawner.isolate_devices = False。虽然默认情况下它应该是 False...
我很惊讶！这解决了我的问题！我将它设置为True，因为它对我来说似乎很聪明地分离用户。仅使用 CPU 时我从未遇到过问题......直到现在。
现在知道原因了。有没有办法仍然隔离设备并启用 GPU 支持？启用隔离似乎仍然更安全。由于我目前的用例是一个完全不重要的临时培训设置，我不在乎，但它可能在未来具有相关性。

标签： python tensorflow gpu jupyter jupyterhub

【解决方案1】：

在您的jupyterhub_config.py 中设置c.SystemdSpawner.isolate_devices = False。

这是the documentation的摘录：

将此设置为 true 可为每个用户提供单独的私有 /dev。这可以防止用户直接访问硬件设备，这可能是安全问题的潜在来源。 /dev/null, /dev/zero, /dev/random 和 ttyp 伪设备已经被挂载，所以大多数用户在启用时应该看不到任何变化。
c.SystemdSpawner.isolate_devices = True
这需要 systemd 版本 > 227。如果您在早期版本中启用此功能，生成将失败。

Nvidia 使用设备（即/dev 中的文件）。请参考their documentation for more information。那里应该有名为/dev/nvidia* 的文件。使用 SystemdSpawner 隔离设备将阻止对这些 Nvidia 设备的访问。

有没有办法仍然隔离设备并启用 GPU 支持？

我不确定……但我可以提供指向文档的指针。设置 c.SystemdSpawner.isolate_devices = True 在最终的 systemd-run 调用 (source) 中设置 PrivateDevices=yes。有关PrivateDevices 选项的更多信息，请参阅the systemd documentation。

您也许可以保留isolate_devices = True，然后显式挂载nvidia 设备。虽然我不知道该怎么做...

【讨论】：