Google Colab Pro 在分配大内存时崩溃答案

【问题标题】：Google Colab Pro crashed while allocating large memoryGoogle Colab Pro 在分配大内存时崩溃
【发布时间】：2021-04-01 17:16:21
【问题描述】：

我正在尝试使用 Colab pro GPU（最大 25Gb 内存）来训练顺序模型。根据here 的说明，我将内存限制设置为 22Gb。以下是我的代码和日志。

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
mem_limit=22000

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=mem_limit)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

根据此日志，它似乎正在设置上限

Dec 22, 2020, 7:57:15 PM    WARNING 2020-12-23 01:57:15.673093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22000 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)

Dec 22, 2020, 7:57:15 PM    WARNING 2020-12-23 01:57:15.673030: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

但是，当执行一个语句时，它总是试图分配 37Gb 内存并且运行时崩溃。这是日志

Dec 22, 2020, 8:01:01 PM    INFO    KernelRestarter: restarting kernel (1/5), keep random ports

Dec 22, 2020, 8:00:47 PM    WARNING tcmalloc: large alloc 37200994304 bytes == 0x7f48b828a000 @ 0x7f5249f5a001 0x7f52414564ff 0x7f52414a6ab8 0x7f52414aabb7 0x7f5241549003 0x50a4a5 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x5161c5 0x50a12f 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x508ec2 0x594a01 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd

我的数据集很大，可能需要超过 128Gb 的内存。有没有办法限制 TF 使用的内存量，如果涉及到，我可以延长执行时间。

提前致谢。

【问题讨论】：

标签： tensorflow out-of-memory google-colaboratory

【解决方案1】：

我遇到了同样的问题，不得不更改我的 tf 代码。设置最大 GPU 内存并不意味着 tf 会找到一种方法来运行您的代码，而不会尝试分配超过您指定的内存。这适用于我所说的分配“单位”，但如果一个操作非常庞大，它就会崩溃。

所以，假设您有一个无法在 GPU 上运行的海量矩阵乘法。 Colab 将崩溃。

根据我有限的经验，您有两种选择：

更改您的设置以不使用 GPU（并承受性能损失）
更改您的代码

【讨论】：