RuntimeError：CUDA 错误：使用 YOLOv5 尝试在 google colab 上训练时触发设备断言答案

【问题标题】：RuntimeError: CUDA error: device assert triggered when trying to train on google colab while using YOLOv5RuntimeError：CUDA 错误：使用 YOLOv5 尝试在 google colab 上训练时触发设备断言
【发布时间】：2023-01-13 04:57:54
【问题描述】：

我不太确定这个问题是什么，它只有在到达培训的测试部分时才会发生。我已经尝试检查文件并重命名所有文件，一切似乎都是正确的，如果有人能提供帮助，我将不胜感激。我正在使用来自 GitHub 的 YOLO 存储库。

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [103,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [104,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  36% 16/45 [00:09<00:16,  1.77it/s]
Traceback (most recent call last):
  File "train.py", line 625, in <module>
    main(opt)
  File "train.py", line 522, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 365, in train
    compute_loss=compute_loss)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/content/yolov5/val.py", line 186, in run
    targets[:, 2:] *= torch.Tensor([width, height, width, height]).to(device)  # to pixels
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

【问题讨论】：

这不是 CUDA 问题。看起来 Pytorch 中 Tensor 索引级别的某些东西正在越界
我不确定为什么这个错误仍然存在，我之前已经使用相同的数据集进行了训练并且它工作正常我只添加了增强图像。
那很有意思。也许您的标签中有一些非常小的方框，并且增强正在创建它们的一些零大小版本？我有一个类似的问题我正在看 - 如果我解决了它会添加一个答案
我注意到有些标签文件的标签索引不正确，即标签编号大于定义标签的 data/*.yaml 文件中定义的集合，从而解决了我明显相同的问题。

标签： python pytorch

【解决方案1】：

我发现这个链接似乎部分解决了这个问题： https://builtin.com/software-engineering-perspectives/cuda-error-device-side-assert-triggered

【讨论】：