分配器内存不足 - 如何从 TensorFlow 数据集中清除 GPU 内存？答案

【问题标题】：Allocator ran out of memory - how to clear GPU memory from TensorFlow dataset?分配器内存不足 - 如何从 TensorFlow 数据集中清除 GPU 内存？
【发布时间】：2022-01-21 16:38:00
【问题描述】：

假设一个形状为(4559552, 13, 22)的Numpy数组X_train，代码如下：

train_dataset = tf.data.Dataset \
    .from_tensor_slices((X_train, y_train)) \
    .shuffle(buffer_size=len(X_train) // 10) \
    .batch(batch_size)

只运行一次。当我重新运行它时（在对X_train 稍作修改后），由于内存不足，它会触发InternalError：

2021-12-19 15:36:58.460497: W tensorflow/core/common_runtime/bfc_allocator.cc:457]
Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.71GiB requested by op _EagerConst

似乎第一次，它找到了 100% 可用的 GPU 内存，所以一切正常，但随后的时间，GPU 内存已经快满了，因此出现错误。

据我了解，似乎只需从旧的train_dataset 中清除 GPU 内存就足以解决问题，但我在 TensorFlow 中找不到任何方法来实现这一点。目前重新分配数据集的唯一方法是杀死 Python 内核并从头开始重新运行所有内容。

有没有办法避免从头开始重新启动 Python 内核，而是释放 GPU 内存以便可以将新数据集加载到其中？

数据集不需要完整的 GPU 内存，因此我会考虑切换到 TFRecord 解决方案作为这里的非理想解决方案（因为它会带来额外的复杂性）。

【问题讨论】：

与下面给出的答案相同，您也可以尝试this解决方案。

标签： python tensorflow gpu out-of-memory

【解决方案1】：

尝试对总 GPU 内存设置硬限制，如 here 所示

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

【讨论】：

有趣。在链接中，他们说By default, TensorFlow maps nearly all of the GPU memory of all GPUs — 是否可以按需“取消映射”内存，以便我们两全其美？
我认为这会有所帮助stackoverflow.com/questions/69031604/…
是的，我知道这个 QA，但我想避免为此产生 TFRecords 开销，因为这种情况下的数据集小于可用的 GPU 内存。