TPU 在带有 Kubernetes 集群的 Google Cloud 上返回“对 cuInit 的调用失败：未知错误 (303)”答案

【问题标题】：TPU returning "failed call to cuInit: UNKNOWN ERROR (303)" on Google Cloud with Kubernetes ClusterTPU 在带有 Kubernetes 集群的 Google Cloud 上返回“对 cuInit 的调用失败：未知错误 (303)”
【发布时间】：2021-09-04 04:31:33
【问题描述】：

我正在尝试将 TPU 与 Google Cloud 的 Kubernetes 引擎一起使用。当我尝试初始化 TPU 时，我的代码返回了几个错误，并且任何其他操作仅在 CPU 上运行。为了运行这个程序，我将一个 Python 文件从我的 Dockerhub 工作区传输到 Kubernetes，然后在单个 v2 可抢占 TPU 上执行它。 TPU 使用 Tensorflow 2.3，据我所知，这是 Cloud TPU 支持的最新版本。（当我尝试使用 Tensorflow 2.4 或 2.5 时，我收到一条错误消息，指出该版本尚不支持）。

当我运行我的代码时，Google Cloud 会看到 TPU，但无法连接到它，而是使用 CPU。它返回此错误：

tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)

tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (resnet-tpu-fxgz7): /proc/driver/nvidia/version does not exist

tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz

tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561fb2112c20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001

TPU name grpc://10.8.16.2:8470

这些错误似乎表明 tensorflow 需要安装 NVIDIA 软件包，但我从 Google Cloud TPU 文档中了解到，我不需要将 tensorflow-gpu 用于 TPU。无论如何我尝试使用 tensorflow-gpu 并收到相同的错误，所以我不知道如何解决这个问题。我已经尝试多次删除和重新创建我的集群和 TPU，但我似乎无法取得任何进展。我对 Google Cloud 比较陌生，所以我可能会遗漏一些明显的东西，但我们将不胜感激。

这是我要运行的 Python 脚本：

import tensorflow as tf
import os

import sys


# Parse the TPU name argument 
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
print("TPU name", tpu_name)


tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name)  # TPU detection

tpu_name = 'grpc://' + str(tpu.cluster_spec().as_dict()['worker'][0])

print("TPU name", tpu_name)
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

这是我的 Kubernetes 集群的 yaml 配置文件（尽管我在这篇文章中包含了我的真实工作区名称和图像的占位符）：

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  template:
    metadata:
      name: test 
      annotations:
        tf-version.cloud-tpus.google.com: "2.3"
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: regcred
      containers:
        - name:  test
          image: my_workspace/image 
          command: ["/bin/bash","-c","pip3 install cloud-tpu-client tensorflow==2.3.0 && python3 DebugTPU.py --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]

          resources:
            limits:
              cloud-tpus.google.com/preemptible-v2: 8
  backoffLimit: 0

【问题讨论】：

标签： tensorflow kubernetes google-cloud-platform tpu google-cloud-tpu

【解决方案1】：

您提供的这个工作负载或日志中实际上没有错误。一些我认为可能会有所帮助的 cmets：

pip install tensorflow 正如您所指出的安装 tensorflow-gpu。默认情况下，它会尝试运行 GPU 特定的初始化并失败 (failed call to cuInit: UNKNOWN ERROR (303))，因此它会退回到本地 CPU 执行。如果您尝试在 GPU VM 上进行开发，这是一个错误，但在典型的 CPU 工作负载中并不重要。本质上是tensorflow == tensorflow-gpu 并且没有可用的GPU，它相当于tensorflow-cpu 并带有额外的错误消息。安装 tensorflow-cpu 将使这些警告消失。
在此工作负载中，TPU 服务器也有自己安装的 TensorFlow 正在运行。实际上，您的本地 VM（例如您的 GKE 容器）是否具有 tensorflow-gpu 或 tensorflow-cpu 并不重要，只要它与 TPU 服务器的 TF 版本相同即可。您的工作负载已成功连接到 TPU 服务器，如下所示：

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}

tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}

tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001

【讨论】：

好的，非常感谢您的解释！