【问题标题】:Running out of memory with pytorchpytorch 内存不足
【发布时间】:2021-10-07 23:20:07
【问题描述】:

我正在尝试使用 huggingface 的 wav2vec 训练模型进行音频分类。我不断收到此错误:

The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
  Num examples = 2708
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 42
 [ 2/42 : < :, Epoch 0.02/1]
Step    Training Loss   Validation Loss

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
    return_dict=return_dict,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
    return_dict=return_dict,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
    hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
    hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
    return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)

我使用的是 AWS ubuntu 深度学习 AMI ec2。

我一直在研究这个。我已经试过了:

  • 减少批量大小(我想要 4 个,但我已经减少到 1 个,但错误没有变化)
  • 添加:
    import gc
    gc.collect()
    torch.cuda.empty_cache()
    
  • 删除我的数据集中所有超过 6 秒的 wav 文件

还有什么我可以做的吗?我在一个安装了 105 GiB 的 p2.8xlarge 数据集上。

运行torch.cuda.memory_summary(device=None, abbreviated=False) 给了我:

|===========================================================================|\n|                  PyTorch CUDA memory summary, device ID 0                 |\n|---------------------------------------------------------------------------|\n|            CUDA OOMs: 3            |        cudaMalloc retries: 4         |\n|===========================================================================|\n|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |\n|---------------------------------------------------------------------------|\n| Allocated memory      |    7550 MB |   10852 MB |  209624 MB |  202073 MB |\n|       from large pool |    7544 MB |   10781 MB |  209325 MB |  201780 MB |\n|       from small pool |       5 MB |      87 MB |     298 MB |     293 MB |\n|---------------------------------------------------------------------------|\n| Active memory         |    7550 MB |   10852 MB |  209624 MB |  202073 MB |\n|       from large pool |    7544 MB |   10781 MB |  209325 MB |  201780 MB |\n|       from small pool |       5 MB |      87 MB |     298 MB |     293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory   |   10936 MB |   10960 MB |   63236 MB |   52300 MB |\n|       from large pool |   10928 MB |   10954 MB |   63124 MB |   52196 MB |\n|       from small pool |       8 MB |      98 MB |     112 MB |     104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory |  443755 KB |    1309 MB |  155426 MB |  154992 MB |\n|       from large pool |  443551 KB |    1306 MB |  155081 MB |  154648 MB |\n|       from small pool |     204 KB |      12 MB |     344 MB |     344 MB |\n|---------------------------------------------------------------------------|\n| Allocations           |    1940    |    2622    |   32288    |   30348    |\n|       from large pool |    1036    |    1618    |   21855    |   20819    |\n|       from small pool |     904    |    1203    |   10433    |    9529    |\n|---------------------------------------------------------------------------|\n| Active allocs         |    1940    |    2622    |   32288    |   30348    |\n|       from large pool |    1036    |    1618    |   21855    |   20819    |\n|       from small pool |     904    |    1203    |   10433    |    9529    |\n|---------------------------------------------------------------------------|\n| GPU reserved segments |     495    |     495    |    2169    |    1674    |\n|       from large pool |     491    |     491    |    2113    |    1622    |\n|       from small pool |       4    |      49    |      56    |      52    |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs |     179    |     335    |   15998    |   15819    |\n|       from large pool |     165    |     272    |   12420    |   12255    |\n|       from small pool |      14    |      63    |    3578    |    3564    |\n|===========================================================================|\n'

在仅将数据减少到长度小于 tahn 2 秒的输入之后,它会进一步训练,但仍然会出现以下错误:

The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
  Num examples = 1411
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 [11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step    Training Loss   Validation Loss Accuracy
10  2.428100    2.257138    0.300283
The following columns in the evaluation set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
  Num examples = 353
  Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    378             with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    380                 return

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
    498         num_bytes = storage.size() * storage.element_size()
--> 499         zip_file.write_record(name, storage.data_ptr(), num_bytes)
    500 

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1334                     self.control = self.callback_handler.on_step_end(args, self.state, self.control)
   1335 
-> 1336                     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1337                 else:
   1338                     self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1441 
   1442         if self.control.should_save:
-> 1443             self._save_checkpoint(model, trial, metrics=metrics)
   1444             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
   1445 

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   1531         elif self.args.should_save and not self.deepspeed:
   1532             # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533             torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
   1534             with warnings.catch_warnings(record=True) as caught_warnings:
   1535                 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    378             with _open_zipfile_writer(opened_file) as opened_zipfile:
    379                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380                 return
    381         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
    382 

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
    257 
    258     def __exit__(self, *args) -> None:
--> 259         self.file_like.write_end_of_file()
    260         self.buffer.flush()
    261 

RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888

当我在笔记本中运行!free 时,我得到:

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
              total        used        free      shared  buff/cache   available
Mem:      503392908     6223452   478499292      346492    18670164   492641984
Swap:             0           0           0

对于训练代码,我基本上是以运行这个 colab 笔记本为例: https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1

我要更改的只是传入的数据/标签,我有意将其放入教程笔记本中使用的相同目录结构中。出于某种原因,教程笔记本运行良好,即使我的数据具有可比较的大小/数量类。

【问题讨论】:

  • 你总共有多少内存?
  • 我添加了一些编辑以希望得到答复。这对您的诊断有帮助吗?
  • 我指的是您的 GPU 设备中的可用内存。能否提供您的培训代码?
  • 如何判断 GPU 内存量?我有一个带有 8 个 GPU 的 p2.8xlarge 实例。至于我的代码,请参阅编辑。谢谢!
  • "'database or disk is full'" 是一个非常明显的错误,你不觉得吗? :-)

标签: deep-learning pytorch huggingface-transformers


【解决方案1】:

您可以在 Pytorch 中使用 DataParallelDistributedDataParallel 框架

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

在这种方法中,模型会在每个设备 (gpu) 上复制,并且数据分布在设备之间

DataParallel 自动拆分您的数据并将工作订单发送到 多个 GPU 上的多个模型。每个模型完成工作后, DataParallel 在返回之前收集并合并结果 你。

这里有更多示例https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

如果模型不适合一个 gpu 的内存,则应采用模型并行方法。

从您现有的模型中,您可能会通过.to('cuda:0').to('cuda:1') 等来判断哪个层位于哪个 gpu 上。

class ModelParallelResNet50(ResNet):
    def __init__(self, *args, **kwargs):
        super(ModelParallelResNet50, self).__init__(
            Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)

        self.seq1 = nn.Sequential(
            self.conv1,
            self.bn1,
            self.relu,
            self.maxpool,

            self.layer1,
            self.layer2
        ).to('cuda:0')

        self.seq2 = nn.Sequential(
            self.layer3,
            self.layer4,
            self.avgpool,
        ).to('cuda:1')

        self.fc.to('cuda:1')

    def forward(self, x):
        x = self.seq2(self.seq1(x).to('cuda:1'))
        return self.fc(x.view(x.size(0), -1))

由于您可能会损失性能,因此可能会使用流水线方法,即将输入数据进一步分块成在不同设备上并行运行的批次。

【讨论】:

    猜你喜欢
    • 2020-12-22
    • 1970-01-01
    • 2021-12-16
    • 2023-04-05
    • 1970-01-01
    • 2022-07-15
    • 2018-04-15
    • 1970-01-01
    • 2023-02-09
    相关资源
    最近更新 更多