• 问题:

当训练tdnn时迭代到110次时报错

snips示例tdnn训练报错

 

snips示例tdnn训练报错

查看对应的log文件,显示

ERROR (nnet3-chain-train[5.5.0-]:AllocateNewRegion():cu-allocator.cc:519) Failed to allocate a memory region of 2502950912 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:4773M, used:6244M, total:11018M, free/total:0.433275 CUDA error: 'out of memory'
 

  • 解决办法:

修改GPU模式:

        sudo nvidia-smi -c 3

修改run_e2e_tdnn.sh

        snips示例tdnn训练报错

然后重新运行脚本。

解决。

 

相关文章: