共享 GPU 上的 Tensorflow：如何自动选择未使用的 GPU答案

【问题标题】：Tensorflow on shared GPUs: how to automatically select the one that is unused共享 GPU 上的 Tensorflow：如何自动选择未使用的 GPU
【发布时间】：2017-01-13 12:27:54
【问题描述】：

我可以通过 ssh 访问由 n 个 GPU 组成的集群。 Tensorflow 自动给它们命名为 gpu:0,...,gpu:(n-1)。

其他人也可以访问，有时他们会随机使用 gpus。我没有明确放置任何tf.device()，因为这很麻烦，即使我选择了 gpu 编号 j 并且有人已经在 gpu 编号 j 上，这也会有问题。

我想通过 gpus 的使用找到第一个未使用的并只使用这个。我猜有人可以用 bash 解析nvidia-smi 的输出并获取一个变量 i 并将该变量 i 作为要使用的 gpu 的编号提供给 tensorflow 脚本。

我从未见过这样的例子。我想这是一个很常见的问题。最简单的方法是什么？有纯张量流吗？

【问题讨论】：

标签： tensorflow gpu distributed-system

【解决方案1】：

我不知道纯 TensorFlow 解决方案。问题在于 TensorFlow 配置的现有位置是 Session 配置。但是，对于 GPU 内存，GPU 内存池为进程中的所有 TensorFlow 会话共享，因此 Session config 将是添加它的错误位置，并且没有用于进程全局配置的机制（但应该有，也应该有能够配置进程全局特征线程池）。因此，您需要使用CUDA_VISIBLE_DEVICES 环境变量在进程级别上进行操作。

类似这样的：

import subprocess, re

# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23

def run_command(cmd):
    """Run command, return output as string."""
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
    return output.decode("ascii")

def list_available_gpus():
    """Returns list of available GPU ids."""
    output = run_command("nvidia-smi -L")
    # lines of the form GPU 0: TITAN X
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
    result = []
    for line in output.strip().split("\n"):
        m = gpu_regex.match(line)
        assert m, "Couldnt parse "+line
        result.append(int(m.group("gpu_id")))
    return result

def gpu_memory_map():
    """Returns map of GPU id to memory allocated on that GPU."""

    output = run_command("nvidia-smi")
    gpu_output = output[output.find("GPU Memory"):]
    # lines of the form
    # |    0      8734    C   python                                       11705MiB |
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
    rows = gpu_output.split("\n")
    result = {gpu_id: 0 for gpu_id in list_available_gpus()}
    for row in gpu_output.split("\n"):
        m = memory_regex.search(row)
        if not m:
            continue
        gpu_id = int(m.group("gpu_id"))
        gpu_memory = int(m.group("gpu_memory"))
        result[gpu_id] += gpu_memory
    return result

def pick_gpu_lowest_memory():
    """Returns GPU with the least allocated memory"""

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
    best_memory, best_gpu = sorted(memory_gpu_map)[0]
    return best_gpu

然后您可以将其放入 utils.py 并在您的 TensorFlow 脚本中设置 GPU，然后再导入 tensorflow。浏览器

import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

【讨论】：

感谢您的精彩回答！
显然nvidia-smi 在某些情况下可能会给出不匹配的设备编号，看起来您必须将它组合起来lspci 才能获得正确的编号，如152 中所述
我会检查一下谢谢！但到目前为止，您的解决方案似乎对我来说效果很好！
如果它停止工作，它设置环境变量的解决方法：export CUDA_DEVICE_ORDER=PCI_BUS_ID
我在 github 上看到了。再次感谢！

【解决方案2】：

https://github.com/bamos/setGPU 上提供了与 Yaroslav Bulatov 解决方案类似的实现。

【讨论】：