tensorflow：找不到dnn实现答案

【问题标题】：tensorflow: Fail to find dnn implementationtensorflow：找不到dnn实现
【发布时间】：2019-10-13 11:04:47
【问题描述】：

我正在尝试使用 gpu 在 tensorflow 上运行我的代码 Keras CuDNNGRU，但即使我已经安装了 CUDA 和 CuDNN，它也总是出现错误“无法找到 dnn 实现”。

我已经多次重新安装 CUDA 和 CuDNN 并将 CuDNN 版本从 7.2.1 升级到 7.5.0，但它没有解决任何问题。我还尝试在 Jupyter Notebook 和 python 编译器（在终端上）中运行我的代码，并且两个结果都是相同的。这是我的硬件和软件的详细信息。

特斯拉 V100 PCIE 16GB
Ubuntu 18.04
NVIDIA-SMI 384.183
CUDA 9.0
CuDNN 7.5.0
迷你康达 3
Python 3.6
张量流 1.12
Keras 2.1.6

这是我的代码。

encoder_LSTM = tf.keras.layers.CuDNNGRU(hidden_unit,return_sequences=True,return_state=True)
encoder_LSTM_rev=tf.keras.layers.CuDNNGRU(hidden_unit,return_state=True,return_sequences=True,go_backwards=True)

encoder_outputs, state_h = encoder_LSTM(x)
encoder_outputsR, state_hR = encoder_LSTM_rev(x)

这是错误信息。

2019-05-27 19:08:06.814896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-27 19:08:06.814956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 19:08:06.814971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-05-27 19:08:06.814978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-05-27 19:08:06.815279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14678 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)
2019-05-27 19:08:08.050226: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050350: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050378: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: Fail to find the dnn implementation.
2019-05-27 19:08:08.050483: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050523: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050541: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[{{node cu_dnngru/CudnnRNN}} = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
     [[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ta_skenario1.py", line 271, in <module>
    losss, op = sess.run([loss, optimizer], feed_dict={x:data,y_label:label,initial_input:begin_sentence})
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
     [[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205)  = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
     [[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'cu_dnngru/CudnnRNN', defined at:
  File "ta_skenario1.py", line 205, in <module>
    encoder_outputs, state_h = encoder_LSTM(x)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 619, in __call__
    return super(RNN, self).__call__(inputs, **kwargs)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 109, in call
    output, states = self._process_batch(inputs, initial_state)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 299, in _process_batch
    rnn_mode='gru')
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 116, in cudnn_rnn
    is_training=is_training, name=name)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Fail to find the dnn implementation.
     [[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205)  = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
     [[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

有什么想法吗？谢谢

更新：我尝试将 CuDNN 版本从 7.5.0 降级到 7.1.4，但结果保持不变。

【问题讨论】：

标签： python tensorflow gpu nvidia cudnn

【解决方案1】：

使用 TF 2.0 配置您的 GPU 以实现增长对我很有效。几个月前，当我在运行 TF 2.0 之前遇到问题时，我在另一个问题中找到了这个解决方案。不记得在哪里。

添加以下内容可能会很好。

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

【讨论】：

【解决方案2】：

不确定它是否有帮助，但在我的情况下，问题是由使用多个 jupyter 笔记本文件引起的。

我正在为神经网络编写一个简单的代码，我决定将它分成 2 个笔记本，一个用于训练，一个用于预测（如果您没有资源/时间来训练您的网络，我提供了我的将模型保存在文件中）。

如果我“一起”运行这两个笔记本，那么基本上首先是训练，然后是预测，而不断开第一个代码的内核，我会得到这个错误。

在使用第二个之前断开第一个 jupyter notebook 的内核解决了我的问题。

【讨论】：

对我来说这是一个类似的问题。我正在运行一个笔记本。然后我尝试运行python脚本并收到此错误。关闭笔记本内核后，脚本按预期工作

【解决方案3】：

这在 Tensorflow 2 中对我有用，正如 here 建议的那样

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

【讨论】：

.set_memory_growth() 给了我两个 GPU 的错误，所以我改用：.set_visible_devices(physical_devices[0], device_type='GPU')，这对我来说效果很好。

【解决方案4】：

您是否测试过您的安装（cuda、cudnn、tensorflow-gpu）？

测试 cuda： 首先检查是否：

$ nvcc -V

显示您的 cuda 工具包的正确版本。然后就可以用下面的流程来测试了：

首先（需要几分钟）：

 $ cd ~/NVIDIA_CUDA-9.0_Samples
 $ make

然后：

$ cd ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release
$./deviceQuery

如果你最后得到：“结果：通过”，你就没事了！

测试 cudnn：

$ cp -r /usr/src/cudnn_samples_v7/ $HOME
$ cd $HOME/cudnn_samples_v7/mnistCUDNN
$ make clean && make
$ ./mnistCUDNN

结果应该是：'测试通过！'

测试 tensorflow-gpu：

如果 cuda 和 cudnn 正常工作，您可以使用以下命令测试您的 tensorflow 安装：

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

我建议你在 conda 环境中安装 tensorflow：

conda create --name tf_gpu tensorflow-gpu

对我来说（在遇到很多问题之后）它运行得很好。

来源： gpu installation for Ubuntu 18.04, tensorflow-gpu installation

【讨论】：

我尝试了您的所有建议，所有测试都成功了。但它仍然错误。所以我尝试在 conda 之外安装 tensorflow-gpu。现在可以了。谢谢你的回答

【解决方案5】：

对于使用 TF2.0 和 Cuda 10.0 使用 cuDNN-7 遇到此问题的任何人，您可能会遇到此问题，因为您不小心升级了cuDNN 从 7.6.2 到 >7.6.5。尽管 TF 文档声明任何 >=7.4.1 都在工作，但事实并非如此！降级到CudNN如下：

sudo apt-get install --no-install-recommends \
  cuda-10-0 \
  libcudnn7=7.6.2.24-1+cuda10.0  \
  libcudnn7-dev=7.6.2.24-1+cuda10.0

在未来，您可以通过在 aptitude 中标记它们来暂停 Ubuntu/Debian 中对 cuDNN 的更新：

sudo apt-mark hold libcudnn7 libcudnn7-dev

【讨论】：