kubernetes autoscaler 不会缩减节点答案

【问题标题】：kubernetes autoscaler will not scale down nodeskubernetes autoscaler 不会缩减节点
【发布时间】：2020-09-30 12:24:49
【问题描述】：

我将Kubernetes autoscaler 用于AWS。我已经使用以下命令部署了它：

          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=1:10:nodes.k8s-1-17.dev.platform

但是，自动缩放器似乎无法启动缩减。日志显示它找到了一个未使用的节点，但没有缩小它并且没有给我一个错误（显示“无节点组配置”的节点是主节点）。

I0610 22:09:37.164102       1 static_autoscaler.go:147] Starting main loop
I0610 22:09:37.164462       1 utils.go:471] Removing autoscaler soft taint when creating template from node ip-10-141-10-176.ec2.internal
I0610 22:09:37.164805       1 utils.go:626] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0610 22:09:37.164823       1 static_autoscaler.go:303] Filtering out schedulables
I0610 22:09:37.165083       1 static_autoscaler.go:320] No schedulable pods
I0610 22:09:37.165106       1 static_autoscaler.go:328] No unschedulable pods
I0610 22:09:37.165123       1 static_autoscaler.go:375] Calculating unneeded nodes
I0610 22:09:37.165141       1 utils.go:574] Skipping ip-10-141-12-194.ec2.internal - no node group config
I0610 22:09:37.165155       1 utils.go:574] Skipping ip-10-141-15-159.ec2.internal - no node group config
I0610 22:09:37.165167       1 utils.go:574] Skipping ip-10-141-11-28.ec2.internal - no node group config
I0610 22:09:37.165181       1 utils.go:574] Skipping ip-10-141-13-239.ec2.internal - no node group config
I0610 22:09:37.165197       1 utils.go:574] Skipping ip-10-141-10-69.ec2.internal - no node group config
I0610 22:09:37.165378       1 scale_down.go:379] Scale-down calculation: ignoring 4 nodes unremovable in the last 5m0s
I0610 22:09:37.165397       1 scale_down.go:410] Node ip-10-141-10-176.ec2.internal - utilization 0.023750
I0610 22:09:37.165692       1 cluster.go:90] Fast evaluation: ip-10-141-10-176.ec2.internal for removal
I0610 22:09:37.166115       1 cluster.go:225] Pod metrics-storage/querier-6bdfd7c6cf-wm7r8 can be moved to ip-10-141-13-253.ec2.internal
I0610 22:09:37.166227       1 cluster.go:225] Pod metrics-storage/querier-75588cb7dc-cwqpv can be moved to ip-10-141-12-116.ec2.internal
I0610 22:09:37.166398       1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553       1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s
I0610 22:09:37.166608       1 static_autoscaler.go:402] Scale down status: unneededOnly=true lastScaleUpTime=2020-06-10 21:45:31.739421421 +0000 UTC m=+22.218633767 lastScaleDownDeleteTime=2020-06-10 21:45:31.739421531 +0000 UTC m=+22.218633877 lastScaleDownFailTime=2020-06-10 22:06:44.128044684 +0000 UTC m=+1294.607257070 scaleDownForbidden=false isDeleteInProgress=false

为什么自动缩放器不缩小节点？

【问题讨论】：

您有任何 pod 中断预算吗？
你能检查一下这个github.com/kubernetes/autoscaler/issues/2936。
您可以尝试使用此链接使用自动发现吗github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/…

标签： amazon-web-services kubernetes autoscaling

【解决方案1】：

在我看来 cluster-autoscaler 到目前为止表现正确。它已决定可以缩小其中一个节点：

     1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553
     1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s

但是，默认情况下，cluster-autoscaler 将等待 10 分钟，然后才会真正终止节点。请参阅“按比例缩小的工作原理”： https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work

从上面的第一个日志中可以看出，duration 2m41 不需要您的节点 - 当它达到 10 分钟时，将发生缩减。

10 分钟后，您应该会看到如下内容：

I0611 14:58:02.384101       1 static_autoscaler.go:382] <node_name> is unneeded since 2020-06-11 14:47:59.621770178 +0000 UTC m=+1299856.757452427 duration 10m2.760318899s
<...snip...>
I0611 14:58:02.385035       1 scale_down.go:754] Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>
I0611 14:58:02.386146       1 event.go:209] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"cluster-autoscaler", Name:"cluster-autoscaler-status", UID:"31a72ce9-9c4e-11ea-a0a8-0201be076001", APIVersion:"v1", ResourceVersion:"13431409", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>

我相信这个设置是为了防止抖动。

【讨论】：

【解决方案2】：

最近我们遇到了与集群自动缩放器类似的问题。将 EKS 集群升级到 1.18 后，我们在 autoscaler 中观察到了类似的日志。

Skipping ip-xx-xx-xx-xx.ec2.internal - no node group config

问题在于自动发现。而不是 kubernetes.io/cluster/YOUR_CLUSTER_NAME，下面提到的标签应该在 ASG 中

k8s.io/cluster-autoscaler/YOUR_CLUSTER_NAME

k8s.io/cluster-autoscaler/启用

详情请参考： https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.4.0

【讨论】：

【解决方案3】：

我们最近发现发生这种情况是因为在未指定正确区域的情况下启动了自动缩放器 - 默认情况下 eu-west-1 在那里。将此值重置为正确的区域并重新启动自动缩放器后，我们的节点开始被正确发现。

【讨论】：