【问题标题】:How to get Docker to recognize NVIDIA drivers?如何让 Docker 识别 NVIDIA 驱动程序?
【发布时间】:2019-11-25 17:10:34
【问题描述】:

我有一个加载 Pytorch 模型的容器。每次我尝试启动它时,我都会收到此错误:

Traceback (most recent call last):
  File "server/start.py", line 166, in <module>
    start()
  File "server/start.py", line 94, in start
    app.register_blueprint(create_api(), url_prefix="/api/1")
  File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
    atomic_demo_model = DemoModel(model_filepath, comet_dir)
  File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
    model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
  File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
    model.to(cfg.device)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

我知道nvidia-docker2 正在工作。

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
|  0%   44C    P0    72W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
|  0%   44C    P0    66W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   44C    P0    48W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:3E:00.0 Off |                  N/A |
|  0%   41C    P0    54W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:3F:00.0 Off |                  N/A |
|  0%   42C    P0    48W / 260W |      0MiB / 10989MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   42C    P0     1W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

但是,我不断收到上述错误。

我尝试了以下方法:

  1. /etc/docker/daemon.json中设置"default-runtime": nvidia

  2. 使用docker run --runtime=nvidia &lt;IMAGE_ID&gt;

  3. 将以下变量添加到我的 Dockerfile:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"

我希望这个容器能够运行 - 我们有一个生产中的工作版本,没有这些问题。而且我知道 Docker 可以找到驱动程序,如上面的输出所示。有什么想法吗?

【问题讨论】:

  • 你能给我们看一下对应的Dockerfile吗?
  • 顺便安装了nvidia-smi cmd吗?我的配置 json ``` { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } ``

标签: docker docker-compose pytorch ubuntu-18.04 nvidia-docker


【解决方案1】:

为了让 docker 使用主机 GPU 驱动程序和 GPU,需要执行一些步骤。

  • 确保在主机系统上安装了 nvidia 驱动程序
  • 按照步骤here 设置nvidia 容器工具包
  • 确保镜像中安装了 cuda、cudnn
  • 使用--gpus 标志运行容器(如上面的链接中所述)

我猜你已经完成了前 3 点,因为 nvidia-docker2 正在工作。因此,由于您的运行命令中没有 --gpus 标志,这可能是问题所在。

我通常使用以下命令运行我的容器

docker run --name <container_name> --gpus all -it <image_name>

-it只是容器是交互式的,启动了一个bash环境。

【讨论】:

    【解决方案2】:

    我遇到了同样的错误。在尝试了多种解决方案后,我发现了以下内容

    docker run -ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all <image_name>
    

    【讨论】:

      【解决方案3】:

      对我来说,我是从一个普通的 ubuntu 基础 docker 映像运行的,即

      FROM ubuntu
      

      更改为 Nvidia 提供的 Docker 基础映像为我解决了这个问题:

      FROM nvidia/cuda:11.2.1-runtime-ubuntu20.04
      

      【讨论】:

        猜你喜欢
        • 2022-11-07
        • 2015-09-25
        • 2021-08-24
        • 2019-09-01
        • 2020-09-21
        • 2019-11-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多