【发布时间】:2021-10-07 23:20:07
【问题描述】:
我正在尝试使用 huggingface 的 wav2vec 训练模型进行音频分类。我不断收到此错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
Num examples = 2708
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 42
[ 2/42 : < :, Epoch 0.02/1]
Step Training Loss Validation Loss
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)
我使用的是 AWS ubuntu 深度学习 AMI ec2。
我一直在研究这个。我已经试过了:
- 减少批量大小(我想要 4 个,但我已经减少到 1 个,但错误没有变化)
- 添加:
import gc gc.collect() torch.cuda.empty_cache() - 删除我的数据集中所有超过 6 秒的 wav 文件
还有什么我可以做的吗?我在一个安装了 105 GiB 的 p2.8xlarge 数据集上。
运行torch.cuda.memory_summary(device=None, abbreviated=False) 给了我:
|===========================================================================|\n| PyTorch CUDA memory summary, device ID 0 |\n|---------------------------------------------------------------------------|\n| CUDA OOMs: 3 | cudaMalloc retries: 4 |\n|===========================================================================|\n| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |\n|---------------------------------------------------------------------------|\n| Allocated memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| Active memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory | 10936 MB | 10960 MB | 63236 MB | 52300 MB |\n| from large pool | 10928 MB | 10954 MB | 63124 MB | 52196 MB |\n| from small pool | 8 MB | 98 MB | 112 MB | 104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory | 443755 KB | 1309 MB | 155426 MB | 154992 MB |\n| from large pool | 443551 KB | 1306 MB | 155081 MB | 154648 MB |\n| from small pool | 204 KB | 12 MB | 344 MB | 344 MB |\n|---------------------------------------------------------------------------|\n| Allocations | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| Active allocs | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| GPU reserved segments | 495 | 495 | 2169 | 1674 |\n| from large pool | 491 | 491 | 2113 | 1622 |\n| from small pool | 4 | 49 | 56 | 52 |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs | 179 | 335 | 15998 | 15819 |\n| from large pool | 165 | 272 | 12420 | 12255 |\n| from small pool | 14 | 63 | 3578 | 3564 |\n|===========================================================================|\n'
在仅将数据减少到长度小于 tahn 2 秒的输入之后,它会进一步训练,但仍然会出现以下错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
Num examples = 1411
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step Training Loss Validation Loss Accuracy
10 2.428100 2.257138 0.300283
The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
Num examples = 353
Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
380 return
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
498 num_bytes = storage.size() * storage.element_size()
--> 499 zip_file.write_record(name, storage.data_ptr(), num_bytes)
500
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1334 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
1335
-> 1336 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1337 else:
1338 self.control = self.callback_handler.on_substep_end(args, self.state, self.control)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
1441
1442 if self.control.should_save:
-> 1443 self._save_checkpoint(model, trial, metrics=metrics)
1444 self.control = self.callback_handler.on_save(self.args, self.state, self.control)
1445
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
1531 elif self.args.should_save and not self.deepspeed:
1532 # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533 torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
1534 with warnings.catch_warnings(record=True) as caught_warnings:
1535 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380 return
381 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
382
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
257
258 def __exit__(self, *args) -> None:
--> 259 self.file_like.write_end_of_file()
260 self.buffer.flush()
261
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888
当我在笔记本中运行!free 时,我得到:
The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
total used free shared buff/cache available
Mem: 503392908 6223452 478499292 346492 18670164 492641984
Swap: 0 0 0
对于训练代码,我基本上是以运行这个 colab 笔记本为例: https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1
我要更改的只是传入的数据/标签,我有意将其放入教程笔记本中使用的相同目录结构中。出于某种原因,教程笔记本运行良好,即使我的数据具有可比较的大小/数量类。
【问题讨论】:
-
你总共有多少内存?
-
我添加了一些编辑以希望得到答复。这对您的诊断有帮助吗?
-
我指的是您的 GPU 设备中的可用内存。能否提供您的培训代码?
-
如何判断 GPU 内存量?我有一个带有 8 个 GPU 的 p2.8xlarge 实例。至于我的代码,请参阅编辑。谢谢!
-
"'database or disk is full'" 是一个非常明显的错误,你不觉得吗? :-)
标签: deep-learning pytorch huggingface-transformers