TensorFlow 自动选择负载最少的 GPU答案

【问题标题】：Tensorflow automatically choose least loaded GPUTensorFlow 自动选择负载最少的 GPU
【发布时间】：2017-02-14 07:04:19
【问题描述】：

我们指定要使用的 GPU 设备：

with tf.device('/gpu:'+gpu_id):

gpu_id 是一个字符串变量，我在其中手动设置要使用的 GPU id。我需要运行几个实验，每个实验都在不同的 GPU 上。因此，我在运行代码实例之前手动更改了 gpu_id 的值。

我可以编写一些代码自动检测第一个未使用的 GPU 并将其设置为 gpu_id 吗？

【问题讨论】：

是的，您必须手动编写该代码。可能最简单的方法是为您的实验进行静态分配，并使用 CUDA_VISIBLE_DEVICES 将实验固定到 GPU。

标签： python tensorflow gpu

【解决方案1】：

已经有一个函数可以让你知道哪个 GPU 被用于张量：

# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

将log_device_placement 设置为True 将返回类似这样的数据：

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c, pci bus
id: 0000:05:00.0
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

↳Using GPUs

【讨论】：

谢谢。但是如何使用它来确定在运行时没有使用哪个 GPU 呢？
a 和 b 显示正在使用哪个 gpu，在上面的示例中都没有。如果它看起来像这样/job:localhost/replica:0/task:0/gpu:1 -> device: 1（请参阅我链接的页面上的最后一个示例：tensorflow.org/how_tos/using_gpu）

【解决方案2】：

我正在研究 TF-2.1 和 torch，所以我不想在任何 ML 框架中具体说明这种自动选择。我只是使用原始的 nvidia-smi 和 os.environ 来获得一个空置的 gpu。

def auto_gpu_selection(usage_max=0.01, mem_max=0.05):
"""Auto set CUDA_VISIBLE_DEVICES for gpu

:param mem_max: max percentage of GPU utility
:param usage_max: max percentage of GPU memory
:return:
"""
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
log = str(subprocess.check_output("nvidia-smi", shell=True)).split(r"\n")[6:-1]
gpu = 0

# Maximum of GPUS, 8 is enough for most
for i in range(8):
    idx = i*3 + 2
    if idx > log.__len__()-1:
        break
    inf = log[idx].split("|")
    if inf.__len__() < 3:
        break
    usage = int(inf[3].split("%")[0].strip())
    mem_now = int(str(inf[2].split("/")[0]).strip()[:-3])
    mem_all = int(str(inf[2].split("/")[1]).strip()[:-3])
    # print("GPU-%d : Usage:[%d%%]" % (gpu, usage))
    if usage < 100*usage_max and mem_now < mem_max*mem_all:
        os.environ["CUDA_VISIBLE_EVICES"] = str(gpu)
        print("\nAuto choosing vacant GPU-%d : Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]\n" %
              (gpu, mem_now, mem_all, usage))
        return
    print("GPU-%d is busy: Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]" %
          (gpu, mem_now, mem_all, usage))
    gpu += 1
print("\nNo vacant GPU, use CPU instead\n")
os.environ["CUDA_VISIBLE_EVICES"] = "-1"

如果我能得到任何 GPU，它会将 CUDA_VISIBLE_EVICES 设置为该 gpu 的 BUSID：

GPU-0 is busy: Memory:[5738MiB/11019MiB] , GPU-Util:[60%]
GPU-1 is busy: Memory:[9688MiB/11019MiB] , GPU-Util:[78%]

Auto choosing vacant GPU-2 : Memory:[1MiB/11019MiB] , GPU-Util:[0%]

否则，设置为 -1 以使用 CPU：

GPU-0 is busy: Memory:[8900MiB/11019MiB] , GPU-Util:[95%]
GPU-1 is busy: Memory:[4674MiB/11019MiB] , GPU-Util:[35%]
GPU-2 is busy: Memory:[9784MiB/11016MiB] , GPU-Util:[74%]

No vacant GPU, use CPU instead

注意：使用此功能在导入任何需要 GPU 的 ML 帧之前，它可以自动选择 GPU。此外，您可以轻松设置多个任务。

【讨论】：