Python pytorch 函数过快消耗内存答案

【问题标题】：Python pytorch function consumes memory excessively quicklyPython pytorch 函数过快消耗内存
【发布时间】：2021-05-11 03:45:17
【问题描述】：

我正在使用 pytorch 编写一个函数，该函数通过转换器模型提供输入，然后通过计算沿特定轴的平均值（使用掩码定义的索引子集）来压缩最后一个嵌入层。由于模型的输出非常非常大，我需要对输入进行批量处理。

我的问题与此函数的逻辑无关，因为我相信我有正确的实现。我的问题是我编写的函数过快地消耗内存并且实际上使其无法使用。

这是我的功能：

def get_chunk_embeddings(encoded_dataset, batch_size):
  chunk_embeddings = torch.empty([0,768])
  for i in range(len(encoded_dataset['input_ids'])//batch_size):
    input_ids = encoded_dataset['input_ids'][i*batch_size:i*batch_size + batch_size]
    attention_mask = encoded_dataset['attention_mask'][i*batch_size:i*batch_size + batch_size]
    embeddings = model.forward(input_ids=input_ids, attention_mask=attention_mask)['last_hidden_state']
    embeddings = embeddings * attention_mask[:,:,None]
    embeddings = embeddings.sum(dim=1)/attention_mask.sum(dim=1)[:,None]
    chunk_embeddings = torch.cat([chunk_embeddings, embeddings],0)
  return chunk_embeddings

现在让我们谈谈内存（下面的数字假设我通过了 8 的 batch_size）：

我正在使用 google colab，我有大约 25 GB 的可用内存
model 是 BERT 模型，占用 413 MB
encoded_dataset 消耗 0.48 GB
input_ids 消耗 0.413 MB
attention_mask 消耗 4.096 KB
embeddings 在其峰值消耗时消耗 12.6 MB
chunk_embeddings 每次迭代增加 0.024576 MB

所以根据我的理解，我应该能够让chunk_embeddings 增长到：25GB - 413MB - 0.48GB - 0.413MB - 4.096KB - 12.6MB ~= 24 GB。足以进行近 100 万次迭代。

在这里，我将通过一个例子来说明我正在经历的事情：

在运行我的函数之前，google colab 告诉我内存充足

现在，为了举例，我将运行该函数（仅 3 次迭代）明确地说，我把它放在我的 for 循环的末尾： if (i == 2):return chunk_embeddings
现在我运行代码val = get_chunk_embeddings(train_encoded_dataset, 8) 因此，即使只有 3 次迭代，我也消耗了将近 5.5 GB 的 RAM。

为什么会这样？同样在我从函数返回后，所有的局部变量都应该被删除，val 不可能这么大。

谁能告诉我我做错了什么或不理解？如果需要更多信息，请告诉我。

【问题讨论】：

你的return chunk_embeddings 应该在for循环中吗？
抱歉不错。不应该。我编辑了它
以后需要从chunk_embeddings 反向传播吗？目前，当您在每次迭代中catembeddings 时，将保留每个前向传递的整个计算图以允许这样做。如果您不这样做，那么您可以在embeddings 上在cat 之前调用detach()。

标签： python memory-management memory-leaks pytorch ram

【解决方案1】：

为了扩展 @GoodDeeds 的答案，默认情况下，pytorch.nn 模块（模型）中的计算会创建计算图并保留梯度（除非您使用 with torch.no_grad() 或类似的东西。这意味着在每次迭代中在您的循环中，嵌入的计算图存储在张量 embeddings 中。embeddings.grad 可能比 embeddings 本身大得多，因为每个层值相对于每个前一层值的梯度保持不变。接下来，因为您使用torch.cat，将embeddingsd 和相关的渐变附加到chunk_embeddings。这意味着经过几次迭代后，chunk_embeddings 存储了大量渐变值，这就是您的记忆所在。有一个几个解决方案：

如果您需要使用块嵌入进行反向传播（即训练），您应该在循环中移动损失计算和优化器步骤，以便之后自动清除梯度。
如果仅在推理期间使用此功能，您可以使用 torch.no_grad() 完全禁用梯度计算（这也应该稍微加快计算速度），或者您可以在每次迭代时使用 torch.detach() on embeddings，如厘米。

例子：

def get_chunk_embeddings(encoded_dataset, batch_size):
  with torch.no_grad():
    chunk_embeddings = torch.empty([0,768])
    for i in range(len(encoded_dataset['input_ids'])//batch_size):
      input_ids = encoded_dataset['input_ids'][i*batch_size:i*batch_size + batch_size]
      attention_mask = encoded_dataset['attention_mask'][i*batch_size:i*batch_size + batch_size]
      embeddings = model.forward(input_ids=input_ids, attention_mask=attention_mask)['last_hidden_state']
      embeddings = embeddings * attention_mask[:,:,None]
      embeddings = embeddings.sum(dim=1)/attention_mask.sum(dim=1)[:,None]
      chunk_embeddings = torch.cat([chunk_embeddings, embeddings],0)
  return chunk_embeddings

【讨论】：