gcr.io 上的 GKE imagePullBackOff答案

【问题标题】：GKE imagePullBackOff on gcr.iogcr.io 上的 GKE imagePullBackOff
【发布时间】：2018-10-26 04:01:23
【问题描述】：

我尝试使用 gcr.io 在 GKE 上设置我自己的容器，但一直出现 ImagePullBackOff 失败。

以为我做错了什么，我回到这里的教程https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app 并按照所有步骤操作并得到相同的错误。这看起来像是一个凭证问题，但我按照教程的所有步骤操作，仍然没有运气。

如何调试此错误，因为日志似乎没有帮助。

教程的第 1-4 步工作。

kubectl run hello-web --image=gcr.io/${PROJECT_ID}/hello-app:v1 --port 8080

ImagePullBackOff 失败我认为 GKE 和 gcr.io 会自动处理凭据。我究竟做错了什么？我该如何调试？

kubectl describe pods hello-web-6444d588b7-tqgdm

Name:           hello-web-6444d588b7-tqgdm
Namespace:      default
Node:           gke-aia-default-pool-9ad6a2ee-j5g7/10.152.0.2
Start Time:     Sat, 27 Oct 2018 06:51:38 +1000
Labels:         pod-template-hash=2000814463
                run=hello-web
Annotations:    kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container hello-web
Status:         Pending
IP:             10.12.2.5
Controlled By:  ReplicaSet/hello-web-6444d588b7
Containers:
hello-web:
    Container ID:   
    Image:          gcr.io/<project-id>/hello-app:v1
    Image ID:       
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
    Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
    cpu:        100m
    Environment:  <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-qgv8h (ro)
Conditions:
Type           Status
Initialized    True 
Ready          False 
PodScheduled   True 
Volumes:
default-token-qgv8h:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qgv8h
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type     Reason                 Age                  From                                         Message
----     ------                 ----                 ----                                         -------
Normal   Scheduled              45m                  default-scheduler                            Successfully assigned hello-web-6444d588b7-tqgdm to gke-aia-default-pool-9ad6a2ee-j5g7
Normal   SuccessfulMountVolume  45m                  kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  MountVolume.SetUp succeeded for volume "default-token-qgv8h"
Normal   Pulling                44m (x4 over 45m)    kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  pulling image "gcr.io/<project-id>/hello-app:v1"
Warning  Failed                 44m (x4 over 45m)    kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  Failed to pull image "gcr.io/<project-id>/hello-app:v1": rpc error: code = Unknown desc = Error response from daemon: repository gcr.io/<project-id>/hello-app not found: does not exist or no pull access
Warning  Failed                 44m (x4 over 45m)    kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  Error: ErrImagePull
Normal   BackOff                5m (x168 over 45m)   kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  Back-off pulling image "gcr.io/<project-id>/hello-app:v1"
Warning  Failed                 48s (x189 over 45m)  kubelet, gke-aia-default-pool-9ad6a2ee-j5g7  Error: ImagePullBackOff

集群权限：

User info Disabled
Compute Engine Read/Write
Storage Read Only
Task queue Disabled
BigQuery Disabled
Cloud SQL Disabled
Cloud Datastore Disabled
Stackdriver Logging API Write Only
Stackdriver Monitoring API Full
Cloud Platform Disabled
Bigtable Data Disabled
Bigtable Admin Disabled
Cloud Pub/Sub Disabled
Service Control Enabled
Service Management Read Only
Stackdriver Trace Write Only
Cloud Source Repositories Disabled
Cloud Debugger Disabled

【问题讨论】：

上述问题是在命令行终端完成的。我还从浏览器视图中尝试了相同的示例。所以我去了 Kubernetes 集群页面，从下拉列表 (gcr.io) 中选择了 hello-app 映像，然后单击了部署按钮。它生成了 yaml 并尝试部署。结果是同样的失败。这可能是地区问题吗？我在 zone/australia-southeast1-b。
这不是区域问题。我删除了我的集群并使用浏览器界面重新创建了一个新集群并选择了 us-central1-a。然后部署示例 hello-app 并遇到相同的图像拉取失败。
你能描述一下这个 pod 并提供完整的 ImagePullBackOff 错误信息吗？
另外，您能否确认您使用的是 GKE 集群的默认范围？
@patrick-w 我用描述 pod 和集群范围编辑了帖子（我没有更改范围）

标签： credentials google-kubernetes-engine

【解决方案1】：

阅读了一些文档后，我使用以下说明手动添加了访问权限： https://cloud.google.com/container-registry/docs/access-control

现在允许部署示例代码。从 gke 到 gcr 的自动访问似乎不起作用。

【讨论】：

【解决方案2】：

在创建 GKE 集群时，请确保您的节点具有 Storage RO 或 https://www.googleapis.com/auth/devstorage.read_only 范围。

我在通过 Terraform 创建 GKE 集群时遇到了这个问题：

node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]

...

而不是

node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only"
    ]

...

【讨论】：

感谢这为我解决了这个问题，这是一个很难追踪的问题。

【解决方案3】：

kubectl 服务帐户应具有执行部署和 GCR 访问（存储管理员）所需的权限。第1步。在 GCP 上创建一个服务帐户并分配具有 Kubernetes 和 GCR 权限的角色。第2步。保存生成的服务帐号 Json 文件步骤 3。使用具有相同 Json 文件的 G-Cloud 进行身份验证。第4步。执行部署

【讨论】：