【发布时间】:2016-09-19 07:43:43
【问题描述】:
我正在尝试在多达 32 台机器上运行 Inception v3 (https://github.com/tensorflow/models/tree/master/inception)。
我在 4 台机器上运行它时看到内存不足错误。
这是错误:
INFO:tensorflow:Started 0 queues for processing input data.
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[2048,1001]
[[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]]
[[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]]
Traceback (most recent call last):
File "imagenet_distributed_train.py", line 65, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "imagenet_distributed_train.py", line 61, in main
inception_distributed_train.train(server.target, dataset, cluster_spec)
File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 286, in train
loss_value, step = sess.run([train_op, global_step])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1001]
[[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]]
[[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]]
Caused by op u'gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul', defined at:
File "imagenet_distributed_train.py", line 65, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "imagenet_distributed_train.py", line 61, in main
inception_distributed_train.train(server.target, dataset, cluster_spec)
File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 215, in train
grads = opt.compute_gradients(total_loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 229, in compute_gradients
return self._opt.compute_gradients(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
in_grads = _AsList(grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 402, in _L2LossGrad
return op.inputs[0] * grad
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch
return gen_math_ops.mul(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
...which was originally created as op u'logits/logits/weights/Regularizer/L2Regularizer/L2Loss', defined at:
File "imagenet_distributed_train.py", line 65, in <module>
tf.app.run()
[elided 1 identical lines from previous traceback]
File "imagenet_distributed_train.py", line 61, in main
inception_distributed_train.train(server.target, dataset, cluster_spec)
File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 154, in train
logits = inception.inference(images, num_classes, for_training=True)
File "/home/ubuntu/indu/models/inception/inception/inception_model.py", line 87, in inference
scope=scope)
File "/home/ubuntu/indu/models/inception/inception/slim/inception_model.py", line 326, in inception_v3
restore=restore_logits)
File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args
return func(*args, **current_args)
File "/home/ubuntu/indu/models/inception/inception/slim/ops.py", line 300, in fc
restore=restore)
File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args
return func(*args, **current_args)
File "/home/ubuntu/indu/models/inception/inception/slim/variables.py", line 290, in variable
trainable=trainable, collections=collections)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 830, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 673, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
我正在使用 EC2 G2.8XL 实例。这些实例具有:
- 英特尔至强 E5-2670(桑迪桥)处理器
- 60 GB 内存和
- 四个 GK104GL [GRID K520] GPU,每个上都有 4 GB 内存。
- 10 千兆网卡
我在这些机器上运行 Ubuntu 14.04.4 LTS。
我在每个 GPU 上运行一名工作人员。因此,总共有 16 名工人。
我在每台机器上运行一个 PS。所以,总共 4 PS。
我使用的批量大小为 8。(批量大小为 8 时,4 台机器内存不足。即使批量大小为 2,32 台机器内存不足)。
CUDA 和 cuDNN 的安装版本:
ubuntu@ip-172-31-16-180:~$ ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a
我从https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl安装了TensorFlow
ubuntu@ip-172-31-16-180:~$ python -c "import tensorflow; print(tensorflow.version)"
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
0.10.0rc0
谁能帮我弄清楚如何解决这个问题并在一个有 32 台机器的集群中运行 Inception v3?
更多信息: 以下是我在集群中的机器上执行的命令:
On machine1:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 &
On machine2:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=5 > /tmp/worker5 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=6 > /tmp/worker6 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=7 > /tmp/worker7 2>&1 &
On machine3:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=2 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=8 > /tmp/worker8 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=9 > /tmp/worker9 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=10 > /tmp/worker10 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=11 > /tmp/worker11 2>&1 &
On machine4:
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=3 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=12 > /tmp/worker12 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=13 > /tmp/worker13 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=14 > /tmp/worker14 2>&1 &
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=15 > /tmp/worker15 2>&1 &
更新 1:
我尝试了以下实验:
实验一:
- 机器 1 上的 Worker1、worker2、worker3 和 worker4
- ps1 或 machine1,machine2 上的 ps2,machine3 上的 ps3,machine4 上的 ps4。
这与失败的 4 台机器配置相同,只是移除了 4 台机器中的 3 台的工人。 machine1 上的工作负载保持不变。 machine1 上的通信负载(与 4 个 ps 通话)保持不变。 我预计这会耗尽内存,但这非常好。
实验 2:
- 机器 1 上的 Worker1、worker2、worker3 和 worker4。
- 机器 2 上的 ps1(仅 ps)。
这很有效,学习速度比实验 1 快。
鉴于此,我想知道为什么使用所有四个 GPU 的四台机器内存不足。
【问题讨论】:
-
一个建议:尝试为每台机器上的第一个任务设置
CUDA_VISIBLE_DEVICES=0,为第二个任务设置CUDA_VISIBLE_DEVICES=1,等等。这将改变GPU命名(每个工作任务将有一个GPU设备称为/gpu:0,对应单个可见设备),但要防止不同的TensorFlow进程相互干扰。 -
@mrry:这解决了问题。谢谢!我之前在 inception_distributed_train.py 中将
with tf.device('/job:worker/task:%d' % FLAGS.task_id):更改为with tf.device('/gpu:%d' % gpunum):。 Gpunum 是gpunum = FLAGS.task_id%4。想知道为什么这样做和使用CUDA_VISIBLE_DEVICES=gpunum做同样的事情不一样。 -
我在回答中添加了一些理论...我认为当多个 TensorFlow 进程共享同一个物理设备时会出现一些问题,但我不是 100% 会导致您看到的失败的原因.
标签: tensorflow