【问题标题】:Kuberhealthy deployment health check fails frequently saying cluster ClusterUnhealthy kuberhealthKuberhealthy 部署健康检查经常失败说集群 ClusterUnhealthy kuberhealth
【发布时间】:2023-01-19 20:08:39
【问题描述】:

Kuberhealthy 部署健康检查经常失败说 [Prometheus]: [FIRING:2] kuberhealthy (ClusterUnhealthy kuberhealthy http kuberhealthy observability/kube-prometheus-stack-prometh

重现步骤:kuberhealthy 定期运行部署检查虽然部署似乎已完成,但无法报告 kuberhealthy 服务的状态

$ k get events -nkuberhealthy | grep deployment | tail
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 2
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXXto 4
12m         Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 0
3m31s       Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 4
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 2
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-69459d778b to 2
3m9s        Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled up replica set deployment-deployment-XXX to 4
3m          Normal    ScalingReplicaSet        deployment/deployment-deployment                      Scaled down replica set deployment-deployment-XXX to 0
63m         Warning   FailedToUpdateEndpoint   endpoints/deployment-svc                              Failed to update endpoint kuberhealthy/deployment-svc: Operation cannot be fulfilled on endpoints "deployment-svc": the object has been modified; please apply your changes to the latest version and try again
53m         Warning   Fa

iledToUpdateEndpoint 端点/d\

debug logs 
$ k logs deployment-XXX -nkuberhealthy
time="2022-12-16T12:36:43Z" level=info msg="Found instance namespace: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2022-12-16T12:36:43Z" level=info msg="Debug logging enabled."
time="2022-12-16T12:36:43Z" level=debug msg="[/app/deployment-check]"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_IMAGE: XXXX"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_IMAGE_ROLL_TO: XXX"
time="2022-12-16T12:36:43Z" level=info msg="Found pod namespace: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Performing check in kuberhealthy namespace."
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_DEPLOYMENT_REPLICAS: 2"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_SERVICE_ACCOUNT: default"
time="2022-12-16T12:36:43Z" level=info msg="Check time limit set to: 14m46.760673918s"
time="2022-12-16T12:36:43Z" level=info msg="Parsed CHECK_DEPLOYMENT_ROLLING_UPDATE: true"
time="2022-12-16T12:36:43Z" level=info msg="Check deployment image will be rolled from [XXX] to [XXXX]"
time="2022-12-16T12:36:43Z" level=debug msg="Allowing this check 14m46.760673918s to finish."
time="2022-12-16T12:36:43Z" level=info msg="Kubernetes client created."
time="2022-12-16T12:36:43Z" level=info msg="Waiting for node to become ready before starting check."
time="2022-12-16T12:36:43Z" level=debug msg="Checking if the kuberhealthy endpoint: XXX is ready."
time="2022-12-16T12:36:43Z" level=debug msg="XXX."
time="2022-12-16T12:36:43Z" level=debug msg="Kuberhealthy endpoint: XXX is ready. Proceeding to run check."
time="2022-12-16T12:36:43Z" level=info msg="Starting check."
time="2022-12-16T12:36:43Z" level=info msg="Wiping all found orphaned resources belonging to this check."
time="2022-12-16T12:36:43Z" level=info msg="Attempting to find previously created service(s) belonging to this check."
time="2022-12-16T12:36:43Z" level=debug msg="Found 1 service(s)."
time="2022-12-16T12:36:43Z" level=debug msg="Service: kuberhealthy"
time="2022-12-16T12:36:43Z" level=info msg="Did not find any old service(s) belonging to this check."
time="2022-12-16T12:36:43Z" level=info msg="Attempting to find previously created deployment(s) belonging to this check."
time="2022-12-16T12:36:44Z" level=debug msg="Found 1 deployment(s)"
time="2022-12-16T12:36:44Z" level=debug msg=kuberhealthy
time="2022-12-16T12:36:44Z" level=info msg="Did not find any old deployment(s) belonging to this check."
time="2022-12-16T12:36:44Z" level=info msg="Successfully cleaned up prior check resources."
time="2022-12-16T12:36:44Z" level=info msg="Creating deployment resource with 2 replica(s) in kuberhealthy namespace using image XXX]"
time="2022-12-16T12:36:44Z" level=info msg="Creating container using image [XXX]"
time="2022-12-16T12:36:44Z" level=info msg="Created deployment resource."
time="2022-12-16T12:36:44Z" level=info msg="Creating deployment in cluster with name: deployment-deployment"
time="2022-12-16T12:36:44Z" level=info msg="Watching for deployment to exist."
time="2022-12-16T12:36:44Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event ADDED"
time="2022-12-16T12:36:47Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:48Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:53Z" level=debug msg="Received an event watching for deployment changes: deployment-deployment got event MODIFIED"
time="2022-12-16T12:36:53Z" level=info msg="Deployment is reporting Available with True."
time="2022-12-16T12:36:53Z" level=info msg="Created deployment in kuberhealthy namespace: deployment-deployment"
time="2022-12-16T12:36:53Z" level=info msg="Creating service resource for kuberhealthy namespace."
time="2022-12-16T12:36:53Z" level=info msg="Created service resource."
time="2022-12-16T12:36:53Z" level=info msg="Creating service in cluster with name: deployment-svc"
time="2022-12-16T12:36:53Z" level=info msg="Watching for service to exist."
time="2022-12-16T12:36:53Z" level=debug msg="Received an event watching for service changes: ADDED"
time="2022-12-16T12:36:53Z" level=info msg="Cluster IP found:XXX"
time="2022-12-16T12:36:53Z" level=info msg="Created service in kuberhealthy namespace: deployment-svc"
time="2022-12-16T12:36:53Z" level=debug msg="Retrieving a cluster IP belonging to: deployment-svc"
time="2022-12-16T12:36:53Z" level=info msg="Found service cluster IP address: XXX"
time="2022-12-16T12:36:53Z" level=info msg="Looking for a response from the endpoint."
time="2022-12-16T12:36:53Z" level=debug msg="Setting timeout for backoff loop to: 3m0s"
time="2022-12-16T12:36:53Z" level=info msg="Beginning backoff loop for HTTP GET request."
time="2022-12-16T12:36:53Z" level=debug msg="Making GET to XXX"
time="2022-12-16T12:36:53Z" level=debug msg="Got a 401"
time="2022-12-16T12:36:53Z" level=info msg="Retrying in 5 seconds."
time="2022-12-16T12:36:58Z" level=error msg="error occurred making request to service in cluster: could not get a response from the given address: XXX"
time="2022-12-16T12:36:58Z" level=info msg="Cleaning up deployment and service."
time="2022-12-16T12:36:58Z" level=info msg="Attempting to delete service deployment-svc in kuberhealthy namespace."
time="2022-12-16T12:36:58Z" level=debug msg="Checking if service has been deleted."
time="2022-12-16T12:36:58Z" level=debug msg="Delete service and wait has not yet timed out."
time="2022-12-16T12:36:58Z" level=debug msg="Waiting 5 seconds before trying again."
time="2022-12-16T12:37:03Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."
time="2022-12-16T12:37:03Z" level=debug msg="Checking if deployment has been deleted."
time="2022-12-16T12:37:03Z" level=debug msg="Delete deployment and wait has not yet timed out."
time="2022-12-16T12:37:03Z" level=debug msg="Waiting 5 seconds before trying again."
time="2022-12-16T12:37:08Z" level=info msg="Finished clean up process."
time="2022-12-16T12:37:08Z" level=error msg="Reporting errors to Kuberhealthy: [could not get a response from the given address: XXX"

【问题讨论】:

  • Ist 似乎无法访问您部署的服务。范你检查了吗?

标签: kubernetes monitoring


【解决方案1】:

看起来您的 kubernetes 部署工作正常。让 k8s 客户端(控制器)知道再试一次是 k8s 的常见行为,这完全没问题,您可以安全地忽略它。

让我尝试解释事件中此类警告的一般原因:

K8s API 服务器正在实现一种名为"Optimistic concurrency control"(有时称为乐观锁定)的东西。这是一种方法,其中不是锁定一段数据并防止在锁定时读取或更新它,而是一段数据包含一个版本号。每次更新数据,版本号都会增加。

更新数据时,会检查版本号是否在客户端读取数据和提交更新之间增加了。如果发生这种情况,更新将被拒绝,客户端必须重新读取新数据并尝试再次更新。结果是,当两个客户端尝试更新同一个数据条目时,只有第一个成功。

也可以参考SO获取相关信息。

也请通过Kubernetes Health Checks: Everything You Need to Know获取更多信息。

【讨论】:

    猜你喜欢
    • 2019-05-05
    • 1970-01-01
    • 2019-10-25
    • 2020-12-18
    • 2019-10-17
    • 2015-01-23
    • 2017-06-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多