为什么即使我只有一个 pod，GKE 也不会缩减集群节点？答案

【问题标题】：Why isn't GKE scaling down cluster nodes even though I only have one pod?为什么即使我只有一个 pod，GKE 也不会缩减集群节点？
【发布时间】：2020-09-10 00:44:54
【问题描述】：

我知道有一些现有的问题，他们通常参考这个https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why

但我仍然无法调试。我的集群上只运行了 1 个 pod，所以我不明白为什么它不能扩展到 1 个节点。我该如何进一步调试？

这里有一些信息：

kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
gke-qua-gke-foobar1234-default-pool-6302174e-4k84   Ready    <none>   4h14m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-6wfs   Ready    <none>   16d     v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-74lm   Ready    <none>   4h13m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-m223   Ready    <none>   4h13m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-srlg   Ready    <none>   66d     v1.14.10-gke.27

kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
qua-gke-foobar1234-5959446675-njzh4   1/1     Running   0          14m

nodePools:
- autoscaling:
    enabled: true
    maxNodeCount: 10
    minNodeCount: 1
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-highcpu-32
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/datastore
    - https://www.googleapis.com/auth/devstorage.full_control
    - https://www.googleapis.com/auth/pubsub
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
  initialNodeCount: 1
  instanceGroupUrls:
  - https://www.googleapis.com/compute/v1/projects/fooooobbbarrr-dev/zones/us-central1-a/instanceGroupManagers/gke-qua-gke-foobar1234-default-pool-6302174e-grp
  locations:
  - us-central1-a
  management:
    autoRepair: true
    autoUpgrade: true
  name: default-pool
  podIpv4CidrSize: 24
  selfLink: https://container.googleapis.com/v1/projects/ffoooobarrrr-dev/locations/us-central1/clusters/qua-gke-foobar1234/nodePools/default-pool
  status: RUNNING
  version: 1.14.10-gke.27

kubectl describe horizontalpodautoscaler
Name:               qua-gke-foobar1234
Namespace:          default
Labels:             <none>
Annotations:        autoscaling.alpha.kubernetes.io/conditions:
                      [{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-17T19:59:19Z","reason":"ReadyForNewScale","message":"recommended size...
                    autoscaling.alpha.kubernetes.io/current-metrics:
                      [{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
                    autoscaling.alpha.kubernetes.io/metrics:
                      [{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"qua-gke-foobar1234","namespace":...
CreationTimestamp:  Tue, 17 Mar 2020 12:59:03 -0700
Reference:          Deployment/qua-gke-foobar1234
Min replicas:       1
Max replicas:       10
Deployment pods:    1 current / 1 desired
Events:             <none>

【问题讨论】：

检查kubectl get pods --all-namespaces
HPA 用于 pod 自动缩放，而不是节点。您是否启用了节点自动缩放器。设置为缩小的最小节点数是多少？
您应该检查日志以了解自动缩放器正在做出什么决定cloud.google.com/kubernetes-engine/docs/how-to/…
检查所有工作区中的 pod 并在您的问题中提供更多详细信息
@coderanger 啊，我看到了gist.github.com/danielyaa5/0779e29ca72869e7b290ae33c6817157，所以其中一些可能会阻止节点关闭

标签： kubernetes google-cloud-platform google-kubernetes-engine autoscaling

【解决方案1】：

HorizontalPodAutoscaler 将增加或减少 pods 的数量，而不是节点。它与节点缩放没有任何关系。

节点扩展由云提供商处理，在您的情况下，由 Google Cloud Platform 处理。

您应该从 GCP 控制台检查是否启用了节点自动扩缩器。

您应该按照以下步骤操作： 1. 转到 GCP 控制台上的Kubernetes clusters screen 2.点击你的集群 3. 从底部，单击要为其启用自动缩放的节点池 4.点击“编辑” 5.启用自动缩放，定义最小和最大节点数，并保存。看截图：

或者，通过gcloud CLI，如here 所述：

gcloud container clusters update cluster-name --enable-autoscaling \
    --min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool

【讨论】：

我的集群已经设置为自动缩放，nodePools: - autoscaling: enabled: true maxNodeCount: 10 minNodeCount: 1我的部分帖子来自gcloud container clusters describe cluster-name
我发布水平缩放对象的原因是因为在我的情况下，每个 pod 都需要它自己的节点，因此当通过水平缩放添加新 pod 时，它需要从 gcloud 请求一个新节点

【解决方案2】：

所以我调试尝试的最初问题是我运行了kubectl get pods 而不是kubectl get pods --all-namespaces，所以我看不到系统上运行的 pod。然后我在所有系统 pod 上添加 PDB。

kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-fluentd-scaler --namespace=kube-system --selector k8s-app=fluentd-gcp-scaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1

然后我开始在一些 pdb 事件日志中发现这些错误。 controllermanager Failed to calculate the number of expected pods: found no controllers for pod，我在运行 kubectl describe pdb --all-namespaces 时在 pdb 中看到了这些。我不知道为什么会发生这些，但我删除了那些 pdb。然后一切都开始工作了！

【讨论】：

【解决方案3】：

我遇到了同样的问题，原因是在 kube-system NS 中运行的工作负载缺少 PDB。您可以查看“Autoscaler Logs”选项卡。

如果您不配置 PDB，集群自动扩缩器不会移除多余的 GKE 节点。 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

关于是否应该有一些默认行为或 PDB 有一个有趣的讨论。 https://github.com/kubernetes/kubernetes/issues/35318

【讨论】：