【发布时间】:2021-09-04 04:31:33
【问题描述】:
我正在尝试将 TPU 与 Google Cloud 的 Kubernetes 引擎一起使用。当我尝试初始化 TPU 时,我的代码返回了几个错误,并且任何其他操作仅在 CPU 上运行。为了运行这个程序,我将一个 Python 文件从我的 Dockerhub 工作区传输到 Kubernetes,然后在单个 v2 可抢占 TPU 上执行它。 TPU 使用 Tensorflow 2.3,据我所知,这是 Cloud TPU 支持的最新版本。 (当我尝试使用 Tensorflow 2.4 或 2.5 时,我收到一条错误消息,指出该版本尚不支持)。
当我运行我的代码时,Google Cloud 会看到 TPU,但无法连接到它,而是使用 CPU。它返回此错误:
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (resnet-tpu-fxgz7): /proc/driver/nvidia/version does not exist
tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz
tensorflow/compiler/xla/service/service.cc:168] XLA service 0x561fb2112c20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.8.16.2:8470}
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30001}
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30001
TPU name grpc://10.8.16.2:8470
这些错误似乎表明 tensorflow 需要安装 NVIDIA 软件包,但我从 Google Cloud TPU 文档中了解到,我不需要将 tensorflow-gpu 用于 TPU。无论如何我尝试使用 tensorflow-gpu 并收到相同的错误,所以我不知道如何解决这个问题。我已经尝试多次删除和重新创建我的集群和 TPU,但我似乎无法取得任何进展。我对 Google Cloud 比较陌生,所以我可能会遗漏一些明显的东西,但我们将不胜感激。
这是我要运行的 Python 脚本:
import tensorflow as tf
import os
import sys
# Parse the TPU name argument
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
print("TPU name", tpu_name)
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
tpu_name = 'grpc://' + str(tpu.cluster_spec().as_dict()['worker'][0])
print("TPU name", tpu_name)
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
这是我的 Kubernetes 集群的 yaml 配置文件(尽管我在这篇文章中包含了我的真实工作区名称和图像的占位符):
apiVersion: batch/v1
kind: Job
metadata:
name: test
spec:
template:
metadata:
name: test
annotations:
tf-version.cloud-tpus.google.com: "2.3"
spec:
restartPolicy: Never
imagePullSecrets:
- name: regcred
containers:
- name: test
image: my_workspace/image
command: ["/bin/bash","-c","pip3 install cloud-tpu-client tensorflow==2.3.0 && python3 DebugTPU.py --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]
resources:
limits:
cloud-tpus.google.com/preemptible-v2: 8
backoffLimit: 0
【问题讨论】:
标签: tensorflow kubernetes google-cloud-platform tpu google-cloud-tpu