cmt

我们的博客系统是部署在用阿里云服务器自己搭建的 Kubernetes 集群上,故障在 k8s 部署更新 pod 的过程中就出现了,昨天发布时,我们特地观察一下,在这1集中分享一下。

在部署过程中,k8s 会进行3个阶段的 pod 更新操作:

  1. "xxx new replicas have been updated"
  2. "xxx replicas are pending termination"
  3. "xxx updated replicas are available"

正常发布情况下,整个部署操作通常在5-8分钟左右完成(这与livenessProbe和readinessProbe的配置有关),下面是部署期间的控制台输出

Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated...
Waiting for deployment spec update to be observed...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
...
Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination...
...
Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available...
deployment "blog-web" successfully rolled out

而在故障场景下,整个部署操作需要在15分钟左右才能完成,3个阶段的 pod 更新都比正常情况下慢,尤其是"old replicas are pending termination"阶段。

在部署期间通过 kubectl get pods -l app=blog-web -o wide 命令查看 pod 的状态,新部署的 pod 处于 Running 状态,说明 livenessProbe 健康检查成功,但多数 pod 没有进入 ready 状态,说明这些 pod 的 readinessProbe 健康检查失败,restarts 大于0 说明 livenessProbe 健康检查失败对 pod 进行了重启。

NAME                        READY   STATUS    RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
blog-web-55d5677cf-2854n    0/1     Running   1          5m1s    192.168.107.213   k8s-node3    <none>           <none>
blog-web-55d5677cf-7vkqb    0/1     Running   2          6m17s   192.168.228.33    k8s-n9       <none>           <none>
blog-web-55d5677cf-8gq6n    0/1     Running   2          5m29s   192.168.102.235   k8s-n19      <none>           <none>
blog-web-55d5677cf-g8dsr    0/1     Running   2          5m54s   192.168.104.78    k8s-node11   <none>           <none>
blog-web-55d5677cf-kk9mf    0/1     Running   2          6m9s    192.168.42.3      k8s-n13      <none>           <none>
blog-web-55d5677cf-kqwzc    0/1     Pending   0          4m44s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-lmbvf    0/1     Running   2          5m54s   192.168.201.123   k8s-n14      <none>           <none>
blog-web-55d5677cf-ms2tk    0/1     Pending   0          6m9s    <none>            <none>       <none>           <none>
blog-web-55d5677cf-nkjrd    1/1     Running   2          6m17s   192.168.254.129   k8s-n7       <none>           <none>
blog-web-55d5677cf-nnjdx    0/1     Pending   0          4m48s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-pqgpr    0/1     Pending   0          4m33s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-qrjr5    0/1     Pending   0          2m38s   <none>            <none>       <none>           <none>
blog-web-55d5677cf-t5wvq    1/1     Running   3          6m17s   192.168.10.100    k8s-n12      <none>           <none>
blog-web-55d5677cf-w52xc    1/1     Running   3          6m17s   192.168.73.35     k8s-node10   <none>           <none>
blog-web-55d5677cf-zk559    0/1     Running   1          5m21s   192.168.118.6     k8s-n4       <none>           <none>
blog-web-5b57b7fcb6-7cbdt   1/1     Running   2          18m     192.168.168.77    k8s-n6       <none>           <none>
blog-web-5b57b7fcb6-cgfr4   1/1     Running   4          19m     192.168.89.250    k8s-n8       <none>           <none>
blog-web-5b57b7fcb6-cz278   1/1     Running   3          19m     192.168.218.99    k8s-n18      <none>           <none>
blog-web-5b57b7fcb6-hvzwp   1/1     Running   3          18m     192.168.195.242   k8s-node5    <none>           <none>
blog-web-5b57b7fcb6-rhgkq   1/1     Running   1          16m     192.168.86.126    k8s-n20      <none>           <none>

在我们的 k8e deployment 配置中 livenessProbe 与 readinessProbe 检查的是同一个地址,具体配置如下

livenessProbe:
    httpGet:
    path: /
    port: 80
    httpHeaders:
    - name: X-Forwarded-Proto
        value: https
    - name: Host
        value: www.cnblogs.com
    initialDelaySeconds: 30
    periodSeconds: 3
    successThreshold: 1
    failureThreshold: 5
    timeoutSeconds: 5
readinessProbe:
    httpGet:
    path: /
    port: 80
    httpHeaders:
    - name: X-Forwarded-Proto
        value: https
    - name: Host
        value: www.cnblogs.com
    initialDelaySeconds: 40
    periodSeconds: 5
    successThreshold: 1
    failureThreshold: 5
    timeoutSeconds: 5

由于潜藏的并发问题造成 livenessProbe 与 readinessProbe 健康检查频繁失败,造成 k8s 更新 pod 的过程跌跌撞撞,在这个过程中,由于有部分旧 pod 分担负载,新 pod 出现问题会暂停更新,等正在部署的 pod 恢复正常,所以这时故障的影响局限在一定范围内,访问网站的表现是时好时坏。

这个跌跌撞撞的艰难部署过程最终会完成,而部署完成之际,就是故障全面爆发之时。部署完成后,新 pod 全面接管负载,存在并发问题的新 pod 在并发请求的重压下溃不成军,多个 pod 因 livenessProbe 健康检查失败被重启,重启后因为 readinessProbe 健康检查失败很难进入 ready 状态分担负载,仅剩的 pod 不堪重负,CrashLoopBackOff 此起彼伏,在源源不断的并发请求的冲击下,始终没有足够的 pod 应付当前的负载,故障就一直无法恢复。

分类:

技术点:

.NET