在通用语音数据集上训练 DeepSpeech 在 gpu 上出现错误答案

【问题标题】：Train DeepSpeech on Common Voice dataset gives error on gpu在通用语音数据集上训练 DeepSpeech 在 gpu 上出现错误
【发布时间】：2021-04-27 16:15:59
【问题描述】：

我正在尝试在 Common Voice 数据集上训练 DeepSpeech 模型，正如 documentation 中所述。但它给出了以下错误：

I0421 11:34:32.779112 140581195995008 utils.py:157] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepSpeech/DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 529, in train
    load_or_init_graph_for_training(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 112, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 88, in _initialize_all_variables
    session.run(v.initializer)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

我的本地机器规格如下：

蟒蛇3.7；库达 10.1； CuDNN 7.6.5；张量流-GPU 1.15.2； GPU GTX 1050 显卡

我还在安装以下包和库来准备环境：

!apt-add-repository universe
!apt-get install sox libsox-fmt-mp3 cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
!python3.7 -m pip install sox
!python3.7 -m pip install deepspeech-gpu
!python3.7 -m pip install tensorflow-gpu==1.15.2
!python3.7 -m pip install numpy==1.19.5
!python3.7 -m pip install progressbar2
!python3.7 -m pip install progressbar
!python3.7 -m pip install progressbar33
!python3.7 -m pip install ds_ctcdecoder==0.10.0-alpha.3
!python3.7 -m pip install pyogg==0.6.14a1
!python3.7 -m pip install deepspeech
!git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech
!python3.7 -m pip install --upgrade --force-reinstall -e ./DeepSpeech/
!git clone https://github.com/kpu/kenlm.git
!mkdir -p build
!cmake kenlm
!make -j 4
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-checkpoint.tar.gz
!curl -LO "https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz"
!mkdir native_client
!tar xvf native_client.amd64.cuda.linux.tar.xz -C native_client

我在本地机器和 google colab vm 上都遇到了同样的问题。

编辑：我还将 cuda 和 cudnn 版本分别更改为 10.0 和 7.5.6。但错误已经存在。

【问题讨论】：

标签： python tensorflow deep-learning speech-recognition mozilla-deepspeech

【解决方案1】：

我看到了similar error posted on the DeepSpeech Discourse，问题在于 CUDA 安装。

$LD_LIBRARY_PATH 环境变量的值是多少？

您可以通过以下方式找到：

$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/local/cuda/bin:/usr/local/cuda/lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib

我的怀疑是 CUDA 无法找到正确的库。

【讨论】：

这是我的 LD_LIBRARY_PATH：/usr/local/lib:/usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI /lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/targets/x86_64-linux/lib/:/usr/local/cuda-10.0/targets/x86_64-linux/lib/

【解决方案2】：

感谢 Soroush 提供的其他信息。

LD_LIBRARY_PATH 看起来不错，我将假设这些库实际上在这些路径中。

接下来，我要确保代码在 GPU 本身上执行。

代码无法在 GPU 上执行的原因有很多。您提到您的环境是根据 DeepSpeech PlayBook 设置的，这意味着它使用的是 Docker。那是对的吗？如果是这样，您是使用 gpus -all 参数生成的 Docker 容器吗？

接下来要检查的是 nvtop 是否正在报告来自 DeepSpeech 的 GPU 活动。当DeepSpeech.py 脚本运行时，这会导致compute 的高负载，可在nvtop 中观察到。如果您没有看到这一点，则表示代码可能没有在 GPU 上执行，这可以解释 No OpKernel 错误。

【讨论】：

感谢您的回复。实际上，我已经在机器上执行了另一个模型，并且 Tensorflow 和 Pytorch 尽可能地利用了 GPU。但 DeepSpeech 没有。在 colab 上，有时运行良好，有时会报此错误。

【解决方案3】：

我已经解决了这个问题。该问题是由 Tensorflow 的版本引起的。作为我之前提到过，我使用了 Tf 1.15.2，而我必须使用 Tf 1.15.4。

【讨论】：