NVIDIA-SMI 失败，因为它无法与 NVIDIA 驱动程序通信答案

【问题标题】：NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driverNVIDIA-SMI 失败，因为它无法与 NVIDIA 驱动程序通信
【发布时间】：2017-08-16 12:29:57
【问题描述】：

我正在使用 Ubuntu 14.04 LTS 运行 AWS EC2 g2.2xlarge 实例。我想在训练我的 TensorFlow 模型时观察 GPU 利用率。我在尝试运行“nvidia-smi”时遇到错误。

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

我按照这些说明安装了 CUDA 7 和 cuDNN：

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot

================================================ =========================

重启后，通过运行 '$sudo update-initramfs -u' 更新 initramfs

现在，请编辑 /etc/modprobe.d/blacklist.conf 文件以将 nouveau 列入黑名单。在编辑器中打开文件并在文件末尾插入以下行。

新黑名单黑名单 lbm-nouveau 选项新模式集 = 0 别名 nouveau 别名 lbm-nouveau 关闭

保存并退出文件。

现在安装构建基本工具并更新 initramfs 并再次重启，如下所示：

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot

================================================ ==========================

重启后，运行以下命令安装 Nvidia。

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot

================================================ ==========================

现在系统已经启动，通过运行以下命令来验证安装。

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`

您应该会看到类似“nvidia.png”的输出。

现在运行以下命令。 $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

但是，在 Tensorflow 训练模型时，“nvidia-smi”仍然不显示 GPU 活动：

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

【问题讨论】：

对我有用的是运行：nvidia-settings 并选择 NVIDIA GPU（根据您的喜好选择性能/按需）。它之前设置为 Intel。

标签： gpu

【解决方案1】：

我通过从 BIOS 禁用安全启动控制解决了我的华硕笔记本电脑上的“NVIDIA-SMI 已失败，因为它无法与 NVIDIA 驱动程序通信”。GTX 950m 和 Ubuntu 18.04。

【讨论】：

为我工作。现在我可以再次使用 CUDA。
曾在戴尔 Inspiron 7460 和 940MX 上工作。非常感谢！
禁用安全启动适用于带有 Geforce-950M 的 Acer Aspire VN7。
我不认为禁用安全启动是个好主意。您可以注册 MOK（机器所有者密钥）然后您不需要禁用安全启动。
我怎样才能做到这一点？注册 MOK 并且不禁用安全启动？

【解决方案2】：

运行以下命令获取正确的 NVIDIA 驱动程序：

sudo ubuntu-drivers devices

然后选右跑：

sudo apt install <version>

【讨论】：

【解决方案3】：

我在使用 K80 GPU 的 Google Compute Engine 中的 Ubuntu 16.04（Linux 4.14 内核）上遇到了同样的错误。我将内核从 4.14 升级到 4.15，问题就解决了。以下是我将 Linux 内核从 4.14 升级到 4.15 的方法：

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

您应该看到您的内核已经升级，并且希望 nvidia-smi 应该可以工作。

【讨论】：

这适用于我和 nvidia-390 驱动程序，它现在modprobes，我从内核 4.4.0 更新到 4.15.0。
浪费了4个多小时后，这个解决了我的问题。谢谢。
你是升级到4.14（如文中所述）还是4.15（如代码所示）？我正在运行 4.15 并遇到同样的问题
抱歉造成混乱，看起来有一些错字。刚刚编辑过
这对我的亚马逊服务器 16.04 有帮助。 nvidia-driver = 410, cuda 10.0

【解决方案4】：

我正在使用 AWS DeepAMI P2 实例，突然发现 Nvidia-driver 命令不起作用，并且找不到 GPU 火炬或 tensorflow 库。然后我通过以下方式解决了问题，

如果不起作用，请运行 nvcc --version

然后运行下面的

apt install nvidia-cuda-toolkit

希望这能解决问题。

【讨论】：

这对我有用。在我的情况下，需要重新启动才能使 nvidia-smi 再次工作。

【解决方案5】：

就我而言，上述解决方案都没有帮助：

根本原因：gcc 版本不兼容

解决方案：

1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3  sudo apt install nvidia-driver-450 
4. sudo reboot

系统：AWS EC2 18.04 实例

解决方案来源：https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4

【讨论】：

我的机器在更新后突然停止显示 NVIDIA 卡。这帮助我解决了问题。谢谢

【解决方案6】：

我只想感谢@Heapify 提供了一个实用的答案并更新了他的答案，因为附加的链接不是最新的。

第 1 步： 检查 Ubuntu Linux 的现有内核：

uname -a

第 2 步：

Ubuntu 为所有内核版本维护了一个网站被释放。在撰写本文时，最新的稳定版本 Ubuntu内核是4.15。如果你去这个链接：http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/，你会的查看几个下载链接。

第 3 步：

根据您拥有的操作系统类型下载适当的文件。对于 64 位，我会下载以下 deb 文件：

// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

第 4 步：

安装所有下载的deb文件：

sudo dpkg -i *.deb

第 5 步：

重新启动您的机器并检查内核是否已通过以下方式更新：

uname -aenter code here

【讨论】：

【解决方案7】：

我发现无论内核版本如何都可以解决该问题，即采用 WGET 选项并安装它们。

sudo apt-get install --reinstall linux-headers-$(uname -r)

驱动程序版本：Ubuntu 服务器 18.04.4 上的 390.138

【讨论】：

【解决方案8】：

我尝试了上述解决方案，但只有以下解决方案对我有用。

sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot

信用 --> https://deeptalk.lambdalabs.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver/148

【讨论】：

【解决方案9】：

我的系统版本：ubuntu 20.04 LTS。

我通过生成一个新的 MOK 并将其注册到 shim 中解决了这个问题。
没有禁用安全启动，虽然它对我也很有效。
只需执行此命令并按照它的建议进行操作：
```
sudo update-secureboot-policy --enroll-key
```

根据 ubuntu 的 wiki： How can I do non-automated signing of drivers

【讨论】：

它说没有找到 MOK。
@Stepan Yakovenko 我只是坚持按照链接的建议进行操作，一切顺利。但是现在我不使用 ubuntu，因为在上面开发软件对我来说不舒服。我也试过manjaro，但现在我又用win10了。也许我以后会使用linux。很抱歉没有帮到你。

【解决方案10】：

我必须在 g2.2xlarge Ubuntu 14.04LTS 实例上安装 NVIDIA 367.57 驱动程序和带有 Tensorflow 的 CUDA 7.5。例如 nvidia-graphics-drivers-367_367.57.orig.tar

现在，当我训练 tensorflow 模型时，GRID K520 GPU 正在工作：

ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr  1 18:03:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   39C    P8    43W / 125W |   3800MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2254    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          8.0 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4036 MBytes (4232052736 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

【讨论】：

【解决方案11】：

以上都对我没有帮助。

我在 Google Cloud 上使用 Kubernetes 和 tesla k-80 gpu。

按照本指南进行操作，以确保您正确安装了所有内容： https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

我错过了一些重要的事情：

在您的 NODES 上安装 NVIDIA GPU 设备驱动程序。为此，请使用：

对于COS节点：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

对于 UBUNTU 节点：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

确保更新已滚动到您的节点。如果升级关闭，请重新启动它们。

我在我的 docker 中使用此图像 nvidia/cuda:10.1-base-ubuntu16.04
你必须设置 gpu 限制！这是节点驱动程序可以与 pod 通信的唯一方式。在你的 yaml 配置中，在你的容器下添加这个：
```
resources:
  limits:
    nvidia.com/gpu: 1
```

【讨论】：

【解决方案12】：

关于 NVIDIA 驱动程序的一个不太为人所知的重要事实是它的构建是由 DKMS 完成的。这允许在内核升级的情况下自动重建，这发生在系统启动时。因此，很容易错过错误消息，尤其是在您使用云虚拟机或没有额外 IPMI/管理界面的服务器时。但是，可以在包安装后立即执行dkms autoinstall 来触发 DKMS 构建。如果这失败了，那么您将收到一条有意义的错误消息，说明缺少依赖项或其他类似情况。如果dkms autoinstall 正确构建模块，您可以简单地通过modprobe 加载它 - 无需重新启动系统（这通常用作触发 DKMS 重建的一种方式）。您可以查看示例here

【讨论】：

【解决方案13】：

这可能发生在你的Linux内核更新之后，如果你输入了这个错误，你可以使用以下命令重建你的nvidia驱动来修复：

首先你需要有dkms，它可以在内核版本变化后自动重新生成新的模块。
sudo apt-get install dkms
其次，重建您的 nvidia 驱动程序。这里我的nvidia驱动版本是440.82，如果你之前安装过，可以在/usr/src查看你安装的版本。
sudo dkms build -m nvidia -v 440.82
最后，重新安装 nvidia 驱动程序。然后重启你的电脑。
sudo dkms install -m nvidia -v 440.82

现在您可以通过sudo nvidia-smi查看是否可以使用它。

【讨论】：

当你说“重新安装驱动程序”是什么意思？运行原始安装程序？

【解决方案14】：

通过重新安装CUDA解决了问题：

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
echo "md5sum: $(md5sum cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb)"
echo "correct: 056de5e03444cce506202f50967b0016"
dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
apt-get -qq update
apt-get -qq -y install cuda
rm cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb

【讨论】：

【解决方案15】：

我在这个问题上苦苦挣扎了两天，在这里分享我的解决方案以防万一有人需要。

我使用的 VM 是标准 N 系列 GPU 服务器，在 Azure 平台上带有 2 个 K80 卡。安装了 Ubuntu 18.04 操作系统。

显然在我遇到这个问题前几天更新了linux内核，更新后驱动程序停止工作。

起初，我确实按照上述回复的建议进行了清除并重新安装。没有任何效果。突然之间（我不记得为什么要这样做了），我更新了我的一台 VM 上的默认 gcc 和 g++ 版本，如下所示。

sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90

然后我清除了nvidia软件，按照官方文档中的说明重新安装（请为您的系统选择正确的：https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal）。

sudo apt-get purge nvidia-*

然后 nvidia-smi 命令终于再次起作用了。

PS：

如果您像我一样使用 Azure linux VM。安装 CUDA 的推荐方法实际上是在 Azure 门户中启用“NVIDIA GPU 驱动程序扩展”（当然，在您配置正确的 gcc 版本之后）。

我已经在我的另一个虚拟机上尝试过这种方式，它也能正常工作。

【讨论】：

【解决方案16】：

对于所有其他有相同问题的人，所有解决方案都不起作用，好吧，这就是我的解决方案，只需禁用安全启动，然后重新安装驱动程序。

【讨论】：

【解决方案17】：

尝试拔出 NVIDIA 显卡并重新插入。

【讨论】：

这怎么可能有帮助？
如何在 AWS 实例上执行此操作？