在训练 ONNX 的预训练模型 Emotion FerPlus 时抛出异常 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED'答案

【问题标题】：Throw exception 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED' in training ONNX's pretrained model Emotion FerPlus在训练 ONNX 的预训练模型 Emotion FerPlus 时抛出异常 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED'
【发布时间】：2020-10-29 13:41:43
【问题描述】：

我正在测试训练Emotion FerPlus 情感识别模型。训练有cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED 错误。我正在使用Nvidia GPU TitanRTX 24G。然后更改minibatch_size from 32 to 1。但是还是有错误。我正在使用 CNTK-GPU 泊坞窗。完整的错误信息是

About to throw exception 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))'
cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))
Traceback (most recent call last):
  File "train.py", line 193, in <module>
    main(args.base_folder, args.training_mode)
  File "train.py", line 124, in main
    trainer.train_minibatch({input_var : images, label_var : labels})
  File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/train/trainer.py", line 184, in train_minibatch
    device)
  File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/cntk_py.py", line 3065, in train_minibatch
    return _cntk_py.Trainer_train_minibatch(self, *args)
RuntimeError: cuDNN failure 8: CUDNN_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))

[CALL STACK]
[0x7fc04da7ce89]                                                       + 0x732e89
[0x7fc045a71aaf]                                                       + 0xeabaaf
[0x7fc045a7b613]    Microsoft::MSR::CNTK::CuDnnConvolutionEngine<float>::  ForwardCore  (Microsoft::MSR::CNTK::Matrix<float> const&,  Microsoft::MSR::CNTK::Matrix<float> const&,  Microsoft::MSR::CNTK::Matrix<float>&,  Microsoft::MSR::CNTK::Matrix<float>&) + 0x1a3
[0x7fc04dd4f8d3]    Microsoft::MSR::CNTK::ConvolutionNode<float>::  ForwardProp  (Microsoft::MSR::CNTK::FrameRange const&) + 0xa3
[0x7fc04dfba654]    Microsoft::MSR::CNTK::ComputationNetwork::PARTraversalFlowControlNode::  ForwardProp  (std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const&,  Microsoft::MSR::CNTK::FrameRange const&) + 0xf4
[0x7fc04dcb6e33]    std::_Function_handler<void (std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const&),void Microsoft::MSR::CNTK::ComputationNetwork::ForwardProp<std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>(std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>> const&)::{lambda(std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const&)#1}>::  _M_invoke  (std::_Any_data const&,  std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const&) + 0x63
[0x7fc04dd04ed9]    void Microsoft::MSR::CNTK::ComputationNetwork::  TravserseInSortedGlobalEvalOrder  <std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>(std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>> const&,  std::function<void (std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const&)> const&) + 0x5b9
[0x7fc04dca64da]    CNTK::CompositeFunction::  Forward  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&,  CNTK::DeviceDescriptor const&,  std::unordered_set<CNTK::Variable,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<CNTK::Variable>> const&,  std::unordered_set<CNTK::Variable,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<CNTK::Variable>> const&) + 0x15da
[0x7fc04dc3d603]    CNTK::Function::  Forward  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&,  CNTK::DeviceDescriptor const&,  std::unordered_set<CNTK::Variable,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<CNTK::Variable>> const&,  std::unordered_set<CNTK::Variable,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<CNTK::Variable>> const&) + 0x93
[0x7fc04ddbf91b]    CNTK::Trainer::  ExecuteForwardBackward  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&,  CNTK::DeviceDescriptor const&,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&) + 0x36b
[0x7fc04ddc06e4]    CNTK::Trainer::  TrainLocalMinibatch  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&,  bool,  CNTK::DeviceDescriptor const&) + 0x94
[0x7fc04ddc178a]    CNTK::Trainer::  TrainMinibatch  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  bool,  std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>>&,  CNTK::DeviceDescriptor const&) + 0x5a
[0x7fc04ddc1852]    CNTK::Trainer::  TrainMinibatch  (std::unordered_map<CNTK::Variable,std::shared_ptr<CNTK::Value>,std::hash<CNTK::Variable>,std::equal_to<CNTK::Variable>,std::allocator<std::pair<CNTK::Variable const,std::shared_ptr<CNTK::Value>>>> const&,  bool,  CNTK::DeviceDescriptor const&) + 0x52
[0x7fc04eb2db22]                                                       + 0x229b22
[0x7fc057ea15e9]    PyCFunction_Call                                   + 0xf9
[0x7fc057f267c0]    PyEval_EvalFrameEx                                 + 0x6ba0
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f29cd8]    PyEval_EvalCodeEx                                  + 0x48
[0x7fc057f29d1b]    PyEval_EvalCode                                    + 0x3b
[0x7fc057f4f020]    PyRun_FileExFlags                                  + 0x130
[0x7fc057f50623]    PyRun_SimpleFileExFlags                            + 0x173
[0x7fc057f6b8c7]    Py_Main                                            + 0xca7
[0x400add]          main                                               + 0x15d
[0x7fc056f06830]    __libc_start_main                                  + 0xf0
[0x4008b9]

【问题讨论】：

标签： cntk onnx onnxruntime

【解决方案1】：

CNTK 现在处于维护模式（基本上已弃用）。虽然 CNTK 可以很好地导出到 ONNX，但导入 ONNX 模型并没有得到很好的支持。

ONNX Runtime https://github.com/microsoft/onnxruntime 现在支持训练，请尝试一下。 ONNX 运行时培训正在积极开发并得到支持，因此如果有些事情不太奏效，问题可能会很快得到解决。

【讨论】：

感谢您的信息。我有用于培训的 CNTK 代码。如果使用 ONNX 运行时，使用 ONNX 运行时进行训练的最佳方法是什么。可以直接使用 CNTK 代码吗？或者我必须选择 ONNX-Pytorch 进行训练（例如）并将 CNTK 训练代码中的所有代码行转换为 Pytorch 代码并使用 ONNX 运行时进行训练。
您必须使用 PyTorch/ONNX，将所有 CNTK 代码行转换为 PyTorch，并针对 ONNXRuntime 进行修改。
谢谢。还有一个问题。如果我想在 Nvidia Tensorrt 上测试 Emotion FerPlus，我必须为 LINUX/C++/x64/TensorRT 选择 ONNX 运行时来构建 ONNX 运行时。直接使用 onnx 模型进行 tensorrt 进行推理，是吗？我在哪里可以获得 Emotion FerPlus 测试代码？提供的 src 代码不包含 test.pyy，尤其是 Tensorrt 输出的解析器部分。
对于训练，CUDA 执行提供者应该可以工作。您可能只需使用pip install onnxruntime-gpu 进行安装。然后您可以在训练结束时将模型保存为 ONNX 格式，并与 TensorRT 执行提供程序一起使用进行推理（请参阅github.com/microsoft/onnxruntime/blob/master/docs/…；不确定与 CUDA 执行提供程序相比它的速度有多快）。
你的意思是，将 CNTK 代码转换为 Pytorch 并使用 CUDA 执行提供程序进行训练？但是对于 Tensorrt 部署，我可以使用模型动物园中的 ONNX 模型并在 tensorrt 进行部署，对吧？