【发布时间】:2020-06-10 13:52:47
【问题描述】:
我想在 tensorflow 2.x 中创建一些在 GPU 上训练的神经网络,并且我想在 docker-compose 网络中设置所有必要的基础设施(假设现在这实际上是可能的)。据我所知,为了在 GPU 上训练张量流模型,我需要 CUDA 工具包和 NVIDIA 驱动程序。在我的计算机(操作系统:Ubuntu 18.04)上本地安装这些依赖项总是很痛苦,因为 tensorflow、CUDA 和 NVIDIA 驱动程序之间存在许多版本依赖项。所以,我试图找到一种方法来创建一个包含 tensorflow、CUDA 和 NVIDIA 驱动程序的服务的 docker-compose 文件,但我收到以下错误:
# Start the services
sudo docker-compose -f docker-compose-test.yml up --build
Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1 ... done
Recreating vw_image_cls_tensorflow_1 ... error
ERROR: for vw_image_cls_tensorflow_1 Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: for tensorflow Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
我的 docker-compose 文件如下所示:
# version 2.3 is required for NVIDIA runtime
version: '2.3'
services:
nvidia-driver:
# NVIDIA GPU driver used by the CUDA Toolkit
image: nvidia/driver:440.33.01-ubuntu18.04
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Do we need this volume to make the driver accessible by other containers in the network?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
networks:
- net
nvidia-cuda:
depends_on:
- nvidia-driver
image: nvidia/cuda:10.1-base-ubuntu18.04
volumes:
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need to create an additional volume for this service to be accessible by the tensorflow service?
devices:
# Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
networks:
- net
tensorflow:
image: tensorflow/tensorflow:2.0.1-gpu # Does this ship with cuda10.0 installed or do I need a separate container for it?
runtime: nvidia
restart: always
privileged: true
depends_on:
- nvidia-cuda
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Volumes related to source code and config files
- ./src:/src
- ./configs:/configs
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need an additional volume from the nvidia-cuda service?
command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
devices:
# Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
- /dev/nvidia-uvm-tools
networks:
- net
volumes:
nvidia_driver:
networks:
net:
driver: bridge
我的/etc/docker/daemon.json 文件如下所示:
{"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
因此,该错误似乎与配置 nvidia 运行时有关,但更重要的是,我几乎可以肯定我没有正确设置我的 docker-compose 文件。所以,我的问题是:
- 真的可以做我想做的事吗?
- 如果是,我是否正确设置了 docker-compose 文件(参见
docker-compose.yml中的 cmets)? - 如何解决上面收到的错误消息?
非常感谢您的帮助,非常感谢。
【问题讨论】:
-
我还没有这样做,但是...您需要在 docker 映像上使用
-gpu标志,请参阅:hub.docker.com/r/tensorflow/tensorflow 和 NVIIDIA Container Toolkit (github.com/NVIDIA/nvidia-docker/blob/master/README.md) -
您好 DazWilkin,感谢您的评论。据我了解,您可以在执行
docker run ...时使用--gpu标志,但是在运行docker-compose up时您将如何执行此操作。根据documentation of docker-compose up,没有--gpu... -
Docker-Compose 有效地为您执行
docker run ...。您可以使用与image:、environment:等级别相同的command:在 Compose 中为容器提供参数。您将拥有command:。然后在它下面- --gpu。 NB 这是一个单连字符,表示command的数组项,然后是gpu前面的双连字符。或者(但很混乱)您可以将 JSON 与 YAML 混合并编写:command: ["--gpu"] -
您好 DazWin,感谢您的评论。不幸的是,您的建议似乎适用于 docker-compose 3.x 版(至少它适用于 3.7 版),但不适用于我认为应该使用的 2.3 版。所以,我调整了 tensorflow 的命令如下:
command: ["/bin/sh -c", "--gpus all python", "import tensorflow as tf", "print(tf.reduce_sum(tf.random.normal([1000, 1000])))"]。你是这个意思吗?不幸的是,我现在无法对此进行测试... -
对于 docker-compose 2.3 版,我认为您可以使用运行时命令。所以运行时:nvidia,以及环境变量 NVIDIA_VISIBLE_DEVICES 和 NVIDIA_DRIVER_CAPABILITIES 这在后来的 docker-compose 中被删除了,所以在 v3+ 中似乎存在关于如何支持 nvidia gpus 的争论。
标签: docker tensorflow docker-compose gpu nvidia