【问题标题】:TensorFlow out of Memory error running Inception v3 distributed on 4 machinesTensorFlow out of Memory 错误运行分布在 4 台机器上的 Inception v3
【发布时间】:2016-09-19 07:43:43
【问题描述】:

我正在尝试在多达 32 台机器上运行 Inception v3 (https://github.com/tensorflow/models/tree/master/inception)。

我在 4 台机器上运行它时看到内存不足错误。

这是错误:

INFO:tensorflow:Started 0 queues for processing input data.
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[2048,1001]
     [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]]
     [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]]
Traceback (most recent call last):
  File "imagenet_distributed_train.py", line 65, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "imagenet_distributed_train.py", line 61, in main
    inception_distributed_train.train(server.target, dataset, cluster_spec)
  File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 286, in train
    loss_value, step = sess.run([train_op, global_step])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1001]
     [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]]
     [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]]
Caused by op u'gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul', defined at:
  File "imagenet_distributed_train.py", line 65, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "imagenet_distributed_train.py", line 61, in main
    inception_distributed_train.train(server.target, dataset, cluster_spec)
  File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 215, in train
    grads = opt.compute_gradients(total_loss)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 229, in compute_gradients
    return self._opt.compute_gradients(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
    in_grads = _AsList(grad_fn(op, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 402, in _L2LossGrad
    return op.inputs[0] * grad
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul
    result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'logits/logits/weights/Regularizer/L2Regularizer/L2Loss', defined at:
  File "imagenet_distributed_train.py", line 65, in <module>
    tf.app.run()
[elided 1 identical lines from previous traceback]
  File "imagenet_distributed_train.py", line 61, in main
    inception_distributed_train.train(server.target, dataset, cluster_spec)
  File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 154, in train
    logits = inception.inference(images, num_classes, for_training=True)
  File "/home/ubuntu/indu/models/inception/inception/inception_model.py", line 87, in inference
    scope=scope)
  File "/home/ubuntu/indu/models/inception/inception/slim/inception_model.py", line 326, in inception_v3
    restore=restore_logits)
  File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/indu/models/inception/inception/slim/ops.py", line 300, in fc
    restore=restore)
  File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/indu/models/inception/inception/slim/variables.py", line 290, in variable
    trainable=trainable, collections=collections)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 830, in get_variable
    custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 673, in get_variable
    custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter
    caching_device=caching_device, validate_shape=validate_shape)

我正在使用 EC2 G2.8XL 实例。这些实例具有:

  1. 英特尔至强 E5-2670(桑迪桥)处理器
  2. 60 GB 内存和
  3. 四个 GK104GL [GRID K520] GPU,每个上都有 4 GB 内存。
  4. 10 千兆网卡

我在这些机器上运行 Ubuntu 14.04.4 LTS。

我在每个 GPU 上运行一名工作人员。因此,总共有 16 名工人。

我在每台机器上运行一个 PS。所以,总共 4 PS。

我使用的批量大小为 8。(批量大小为 8 时,4 台机器内存不足。即使批量大小为 2,32 台机器内存不足)。

CUDA 和 cuDNN 的安装版本:

ubuntu@ip-172-31-16-180:~$ ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a

我从https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl安装了TensorFlow

ubuntu@ip-172-31-16-180:~$ python -c "import tensorflow; print(tensorflow.version)"
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
0.10.0rc0

谁能帮我弄清楚如何解决这个问题并在一个有 32 台机器的集群中运行 Inception v3?

更多信息: 以下是我在集群中的机器上执行的命令:

On machine1:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 &


On machine2:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=5 > /tmp/worker5 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=6 > /tmp/worker6 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=7 > /tmp/worker7 2>&1 &


On machine3:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=2 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=8 > /tmp/worker8 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=9 > /tmp/worker9 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=10 > /tmp/worker10 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=11 > /tmp/worker11 2>&1 &


On machine4:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=3 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=12 > /tmp/worker12 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=13 > /tmp/worker13 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=14 > /tmp/worker14 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=15 > /tmp/worker15 2>&1 &

更新 1:

我尝试了以下实验:

实验一:

  • 机器 1 上的 Worker1、worker2、worker3 和 worker4
  • ps1 或 machine1,machine2 上的 ps2,machine3 上的 ps3,machine4 上的 ps4。

这与失败的 4 台机器配置相同,只是移除了 4 台机器中的 3 台的工人。 machine1 上的工作负载保持不变。 machine1 上的通信负载(与 4 个 ps 通话)保持不变。 我预计这会耗尽内存,但这非常好。

实验 2:

  • 机器 1 上的 Worker1、worker2、worker3 和 worker4。
  • 机器 2 上的 ps1(仅 ps)。

这很有效,学习速度比实验 1 快。

鉴于此,我想知道为什么使用所有四个 GPU 的四台机器内存不足。

【问题讨论】:

  • 一个建议:尝试为每台机器上的第一个任务设置CUDA_VISIBLE_DEVICES=0,为第二个任务设置CUDA_VISIBLE_DEVICES=1,等等。这将改变GPU命名(每个工作任务将有一个GPU设备称为/gpu:0,对应单个可见设备),但要防止不同的TensorFlow进程相互干扰。
  • @mrry:这解决了问题。谢谢!我之前在 inception_distributed_train.py 中将 with tf.device('/job:worker/task:%d' % FLAGS.task_id): 更改为 with tf.device('/gpu:%d' % gpunum):。 Gpunum 是gpunum = FLAGS.task_id%4。想知道为什么这样做和使用 CUDA_VISIBLE_DEVICES=gpunum 做同样的事情不一样。
  • 我在回答中添加了一些理论...我认为当多个 TensorFlow 进程共享同一个物理设备时会出现一些问题,但我不是 100% 会导致您看到的失败的原因.

标签: tensorflow


【解决方案1】:

正如 cmets 中所讨论的,在每台机器上为 ith 任务设置 CUDA_VISIBLE_DEVICES=i 可以解决问题。这具有更改 GPU 命名的效果(因此每个工作任务都有一个名为 "/gpu:0" 的 GPU 设备,对应于该任务中的单个可见设备),但它可以防止同一台机器上的不同 TensorFlow 进程相互干扰其他。

以下命令应该可以工作:

# On machine1:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 &
CUDA_VISIBLE_DEVICES=0 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 &
CUDA_VISIBLE_DEVICES=1 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 &
CUDA_VISIBLE_DEVICES=2 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 &
CUDA_VISIBLE_DEVICES=3 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 &


# On machine2:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 &
CUDA_VISIBLE_DEVICES=0 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 &
...

具体原因尚不完全清楚,但有两种可能:

  1. 在您的初始设置中,每台机器上的所有四个工作任务都会为机器上的每个 GPU 创建一个设备对象,并且它们可能会尝试在每台设备上分配 4 倍的内存。

  2. 当系统上的所有四个 GPU 对每个进程都可见时,TensorFlow 的布局器具有更多选项,并且根据您的设置/训练程序,它可能会无意中将来自两个工作任务的操作放置在同一个 GPU 上。

【讨论】:

    【解决方案2】:

    4GB GPU 内存对于那些在具有 12GB GPU 内存的 GPU 卡上进行调整的型号来说有点低。小批量会降低激活大小,但不会降低参数大小。

    一旦您确定模型中没有不必要的内存使用,您可以尝试禁用 Cudnn conv 暂存内存,方法是使用

    TF_CUDNN_WORKSPACE_LIMIT_IN_MB=0

    这将禁止在您的模型中使用暂存内存。您的模型会更慢,但希望它至少有一点优势可以完成。

    【讨论】:

    • 感谢您的回答。您能否检查一下这个问题的“更新 1”。如果参数数量很大是问题所在,那么这些实验不也应该耗尽内存吗?鉴于此,我是否仍应尝试将 TF_CUDNN_WORKSPACE_LIMIT_IN_MB 设置为 0。如果是,有没有办法在不从源代码编译的情况下做到这一点?
    • TF_CUDNN_WORKSPACE_LIMIT_IN_MB 是一个环境变量。您不必从源代码编译。我同意不能保证这会解决问题。但无论如何尝试都是个好主意。如果你只是有点记忆力不足,它可能会起作用。
    猜你喜欢
    • 2018-08-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-08-01
    • 2017-04-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多