如何在 PyTorch 中释放 GPU 内存答案

【问题标题】：How to free GPU memory in PyTorch如何在 PyTorch 中释放 GPU 内存
【发布时间】：2022-01-27 05:08:24
【问题描述】：

我有一个句子列表，我正在尝试计算困惑度，使用了几个使用此代码的模型：

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

def score(model, tokenizer, sentence):
    tensor_input = tokenizer.encode(sentence, return_tensors='pt')
    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
    masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
    labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
    with torch.inference_mode():
        loss = model(masked_input.cuda(), labels=labels.cuda()).loss
    return np.exp(loss.item())


print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
# 4.541251105675365

大多数模型运行良好，但有些句子似乎会抛出错误：

RuntimeError: CUDA out of memory. Tried to allocate 10.34 GiB (GPU 0; 23.69 GiB total capacity; 10.97 GiB already allocated; 6.94 GiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

这是有道理的，因为有些很长。所以我所做的就是添加类似try, except RuntimeError, pass 的内容。

这似乎工作到大约 210 句，然后它只是输出错误：

CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

我发现this 有很多讨论和想法，有些是关于潜在的故障 GPU？但我知道我的 GPU 可以正常工作，因为这个确切的代码适用于其他模型。还有关于批处理大小here 的讨论，这就是为什么我认为它可能与释放内存有关。

我尝试在每个 epoch 之后运行 torch.cuda.empty_cache() 以像 here 那样释放内存，但它不起作用（引发相同的错误）。

更新： 我过滤了长度超过 550 的句子，这似乎消除了 CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 错误。

【问题讨论】：

你如何测试不同的模型？您是每个模型执行一个程序，还是简单地在一个程序中循环它们？
@LucaClissa 老实说，我尝试了这两种方法。我有大约 11 个模型要测试，其中 3 个抛出了这个错误。对于剩下的 8 个，我只是循环执行，他们做得很好。
我明白了，我会尝试在下面的答案中总结我对类似问题的经验

标签： python memory pytorch huggingface-transformers

【解决方案1】：

你需要在torch.cuda.empty_cache()之前申请gc.collect() 我还将模型拉到 cpu，然后删除该模型及其检查点。试试适合你的方法：

import gc

model.cpu()
del model, checkpoint
gc.collect()
torch.cuda.empty_cache()

【讨论】：

有点困惑，我想什么时候删除模型？还是gc.collect() ？我应该每隔几句话做一次吗？
^ 我认为每当出现错误时你都应该这样做。
因此在遍历句子时删除模型似乎不是一个选项，因为那时我没有模型来评估剩余的句子。正如你所提到的，我尝试在 torch.cuda.empty_cache() 之前运行 gc.collect()，但它似乎没有做任何事情（仍然有大约 210 个句子并且有同样的错误）
您应该能够在单个模型上推断出数百万个句子，而不仅仅是 210。似乎有问题。我不得不删除模型，因为我必须为不同的推理加载新模型。我的是不同的用例。

【解决方案2】：

我没有确切的答案，但我可以分享一些我在类似情况下采用的故障排除技术...希望对您有所帮助。

首先，不幸的是，CUDA error 有时含糊不清，因此您应该考虑在 CPU 上运行您的代码，看看是否真的发生了其他事情（请参阅here）
如果问题是关于内存的，这里有两个我使用的自定义工具：

from torch import cuda


def get_less_used_gpu(gpus=None, debug=False):
    """Inspect cached/reserved and allocated memory on specified gpus and return the id of the less used device"""
    if gpus is None:
        warn = 'Falling back to default: all gpus'
        gpus = range(cuda.device_count())
    elif isinstance(gpus, str):
        gpus = [int(el) for el in gpus.split(',')]

    # check gpus arg VS available gpus
    sys_gpus = list(range(cuda.device_count()))
    if len(gpus) > len(sys_gpus):
        gpus = sys_gpus
        warn = f'WARNING: Specified {len(gpus)} gpus, but only {cuda.device_count()} available. Falling back to default: all gpus.\nIDs:\t{list(gpus)}'
    elif set(gpus).difference(sys_gpus):
        # take correctly specified and add as much bad specifications as unused system gpus
        available_gpus = set(gpus).intersection(sys_gpus)
        unavailable_gpus = set(gpus).difference(sys_gpus)
        unused_gpus = set(sys_gpus).difference(gpus)
        gpus = list(available_gpus) + list(unused_gpus)[:len(unavailable_gpus)]
        warn = f'GPU ids {unavailable_gpus} not available. Falling back to {len(gpus)} device(s).\nIDs:\t{list(gpus)}'

    cur_allocated_mem = {}
    cur_cached_mem = {}
    max_allocated_mem = {}
    max_cached_mem = {}
    for i in gpus:
        cur_allocated_mem[i] = cuda.memory_allocated(i)
        cur_cached_mem[i] = cuda.memory_reserved(i)
        max_allocated_mem[i] = cuda.max_memory_allocated(i)
        max_cached_mem[i] = cuda.max_memory_reserved(i)
    min_allocated = min(cur_allocated_mem, key=cur_allocated_mem.get)
    if debug:
        print(warn)
        print('Current allocated memory:', {f'cuda:{k}': v for k, v in cur_allocated_mem.items()})
        print('Current reserved memory:', {f'cuda:{k}': v for k, v in cur_cached_mem.items()})
        print('Maximum allocated memory:', {f'cuda:{k}': v for k, v in max_allocated_mem.items()})
        print('Maximum reserved memory:', {f'cuda:{k}': v for k, v in max_cached_mem.items()})
        print('Suggested GPU:', min_allocated)
    return min_allocated


def free_memory(to_delete: list, debug=False):
    import gc
    import inspect
    calling_namespace = inspect.currentframe().f_back
    if debug:
        print('Before:')
        get_less_used_gpu(debug=True)

    for _var in to_delete:
        calling_namespace.f_locals.pop(_var, None)
        gc.collect()
        cuda.empty_cache()
    if debug:
        print('After:')
        get_less_used_gpu(debug=True)

2.1 free_memory 允许您组合 gc.collect 和 cuda.empty_cache 从命名空间中删除一些所需的对象并释放它们的内存（您可以将变量名称列表作为 to_delete 参数传递）。这很有用，因为您可能有未使用的对象占用内存。例如，假设您遍历 3 个模型，那么当您进行第二次迭代时，第一个模型可能仍会占用一些 gpu 内存（我不知道为什么，但我在笔记本中遇到过这种情况，并且我能找到的唯一解决方案是重新启动笔记本电脑或显式释放内存）。但是，我不得不说这并不总是实用的，因为您需要知道哪些变量持有 GPU 内存......而且情况并非总是如此，尤其是当您有很多与模型内部相关的渐变时（请参阅here了解更多信息）。您还可以尝试的一件事是使用with torch.no_grad(): 而不是with torch.inference_mode():；它们应该是等效的，但可能值得一试......

2.2 如果您有一个多 GPU 环境，您可以考虑交替切换到使用较少的 GPU，这要归功于其他实用程序 get_less_used_gpu

此外，您可以尝试跟踪 GPU 使用情况，以查看错误发生的时间并从那里进行调试。我可以建议的最好/最简单的方法是使用nvtop，如果您使用的是 linux 平台

希望有什么有用的:)

【讨论】：