如果没有完整的日志,很难猜测究竟是什么导致了您的 nginx pod 被删除。此外,正如您提到的客户环境可能有很多原因。正如我在 cmets 中询问的那样,可能是 HPA 或 CA、可抢占节点、临时网络问题等。
关于第二部分关于pod删除和Liveness,Liveness探测失败,因为nginx pod在deletion进程中。
Kubernetes 默认设置之一是grace-period 等于 30 秒。简而言之,这意味着 Pod 将处于Terminating 状态 30 秒,之后将被移除。
测试
如果您想自己验证,可以做一些测试来确认。这将需要 kubeadm master 和 Verbosity 的更改。您可以通过编辑/var/lib/kubelet/kubeadm-flags.env 文件(您必须具有root 权限)并添加--v=X 来实现,其中X 是编号0-9。详细哪个级别显示特定日志可以找到here。
- 至少将详细级别设置为
level=5,我已经在level=8 上进行了测试
- 部署
Nginx Ingress Controller
- 手动删除
Nginx Ingress Controller pod
- 使用
$ journalctl -u kubelet查看日志,可以使用grep缩小输出范围并保存到文件
($ journalctl -u kubelet | grep ingress-nginx-controller-s2kfr > nginx.log)
以下是我的测试示例:
#Liveness and Readiness probe works properly:
Feb 24 14:18:35 kubeadm kubelet[11922]: I0224 14:18:35.399156 11922 prober.go:126] Readiness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" succeeded
Feb 24 14:18:40 kubeadm kubelet[11922]: I0224 14:18:40.587129 11922 prober.go:126] Liveness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" succeeded
#Once Deletion process start you can find DELETE api and other information
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.900957 11922 kubelet.go:1931] SyncLoop (DELETE, "api"): "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)"
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.901057 11922 kubelet_pods.go:1482] Generating status for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)"
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.901914 11922 round_trippers.go:422] GET https://10.154.15.225:6443/api/v1/namespaces/ingress-nginx/pods/ingress-nginx-controller-s2kfr
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.909123 11922 event.go:291] "Event occurred" object="ingress-nginx/ingress-nginx-controller-s2kfr" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container controller"
# This entry occurs as default grace-period-time was kept
Feb 24 14:18:46 kubeadm kubelet[11922]: I0224 14:18:46.947193 11922 kubelet_pods.go:952] Pod "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)" is terminated, but some containers are still running
# As Pod was in deletion, Probes failed.
Feb 24 14:18:50 kubeadm kubelet[11922]: I0224 14:18:50.584208 11922 prober.go:117] Liveness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" failed (failure): HTTP probe failed with statuscode: 500
Feb 24 14:18:50 kubeadm kubelet[11922]: I0224 14:18:50.584338 11922 event.go:291] "Event occurred" object="ingress-nginx/ingress-nginx-controller-s2kfr" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Liveness probe failed: HTTP probe failed with statuscode: 500"
Feb 24 14:18:52 kubeadm kubelet[11922]: I0224 14:18:52.045155 11922 kubelet_pods.go:952] Pod "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21)" is terminated, but some containers are still running
Feb 24 14:18:55 kubeadm kubelet[11922]: I0224 14:18:55.398025 11922 prober.go:117] Readiness probe for "ingress-nginx-controller-s2kfr_ingress-nginx(9046e404-1b9e-44fd-86f3-5a16ebf27c21):controller" failed (failure): HTTP probe failed with statuscode: 500
在日志中,SyncLoop (DELETE, "api") 和 Liveness 探测之间的时间为 4 秒。在其他情况下,测试时间为几秒(相差 4-7 秒)。
如果您想执行自己的测试,可以将Readiness 和Liveness 探测检查更改为 1 秒(不是默认设置的 10),您将在与Delete api 相同的秒内遇到探测问题。
Feb 24 15:09:40 kubeadm kubelet[11922]: I0224 15:09:40.865718 11922 prober.go:126] Liveness probe for "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6):controller" succeeded
Feb 24 15:09:41 kubeadm kubelet[11922]: I0224 15:09:41.488819 11922 kubelet.go:1931] SyncLoop (DELETE, "api"): "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6)"
...
Feb 24 15:09:41 kubeadm kubelet[11922]: I0224 15:09:41.865422 11922 prober.go:117] Liveness probe for "ingress-nginx-controller-wwrdw_ingress-nginx(427bc9d6-261e-4427-b034-7abe8cbbfea6):controller" failed (failure): HTTP probe failed with statuscode: 500
你可以在Alibaba docs找到syncLoop的很好的解释
如 cmets 所示,syncLoop 函数是 Kubelet 的主要循环。该函数监听更新,获取最新的Pod 配置,synchronizes 运行状态和期望状态。这样,本地节点上的所有Pods都可以运行在预期的状态。其实syncLoop只是封装了syncLoopIteration,而synchronization的操作是由syncLoopIteration进行的。
结论
如果您在终止之前没有额外的日志记录来保存 pod 的输出,那么在该事件发生一段时间后很难确定根本原因。
在您提供的设置中,Liveness 探测失败,因为nginx-ingress pod 已经处于终止过程中。 Liveness probe fail 没有触发 pod 删除,但它是删除的结果。
此外,您还可以查看Kubelet和Prober源代码。