【发布时间】:2021-04-01 17:16:21
【问题描述】:
我正在尝试使用 Colab pro GPU(最大 25Gb 内存)来训练顺序模型。 根据here 的说明,我将内存限制设置为 22Gb。以下是我的代码和日志。
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
mem_limit=22000
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=mem_limit)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
根据此日志,它似乎正在设置上限
Dec 22, 2020, 7:57:15 PM WARNING 2020-12-23 01:57:15.673093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22000 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
Dec 22, 2020, 7:57:15 PM WARNING 2020-12-23 01:57:15.673030: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
但是,当执行一个语句时,它总是试图分配 37Gb 内存并且运行时崩溃。这是日志
Dec 22, 2020, 8:01:01 PM INFO KernelRestarter: restarting kernel (1/5), keep random ports
Dec 22, 2020, 8:00:47 PM WARNING tcmalloc: large alloc 37200994304 bytes == 0x7f48b828a000 @ 0x7f5249f5a001 0x7f52414564ff 0x7f52414a6ab8 0x7f52414aabb7 0x7f5241549003 0x50a4a5 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x5161c5 0x50a12f 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x508ec2 0x594a01 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd
我的数据集很大,可能需要超过 128Gb 的内存。有没有办法限制 TF 使用的内存量,如果涉及到,我可以延长执行时间。
提前致谢。
【问题讨论】:
标签: tensorflow out-of-memory google-colaboratory