GKE 自动驾驶仪的垂直/水平缩放答案

【问题标题】：Vertical / Horizontal Scaling for GKE autopilotGKE 自动驾驶仪的垂直/水平缩放
【发布时间】：2021-11-06 23:11:00
【问题描述】：

我和我的团队正在尝试在 GCP 无服务器基础架构上部署计算量非常大的工作负载。由于 Cloud Run 的资源限制非常窄（4 个 vCPU 和 8GB 内存），我们接下来使用 Autopilot 测试 GKE。

使用默认的 Autopilot 集群，我设法配置了具有多达 8 个 vCPU 的单个部署和容器，但仅此而已。

我现在的问题是，是否有办法使用 resources.request.requests.cpu > 8 部署部署和容器，如果有，如何部署。

到目前为止我已经尝试过：

设置资源请求 - 这工作正常，最多 8 个
水平、垂直和多维自动缩放——这个好像没有
NodeSelector 以便将 pod 部署在更强大的节点上 - 这对于 Autopilot 是禁止的

这是我的 deployment.yaml：

---
apiVersion: "apps/v1"
kind: "Deployment"
metadata:
  name: "backend-flask"
  namespace: "default"
  labels:
    app: "backend-flask"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "backend-flask"
  template:
    metadata:
      labels:
        app: "backend-flask"
    spec:
      containers:
      - name: "backend-flask1"
        image: "{...}backend-flask:latest"
        resources:
          requests:
            memory: "6Gi"
            cpu: "8"
          limits:
            memory: "32Gi"
            cpu: "32"
      # nodeSelector:
      #   beta.kubernetes.io/instance-type: e2-highcpu-32
---
# apiVersion: autoscaling.gke.io/v1beta1
# kind: MultidimPodAutoscaler
# metadata:
#   name: backend-flask-autoscaler
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: backend-flask
#   goals:
#     metrics:
#     - type: Resource
#       resource:
#       # Define the target CPU utilization request here
#         name: cpu
#         target:
#           type: Utilization
#           averageUtilization: 80
#   constraints:
#     global:
#       minReplicas: 1
#       maxReplicas: 2
#     containerControlledResources: [ memory ]
#     container:
#     - name: '*'
#     # Define boundaries for the memory request here
#       requests:
#         minAllowed:
#           memory: 4Gi
#           cpu: 4
#         maxAllowed:
#           memory: 32Gi
#           cpu: 32
#   policy:
#     updateMode: Auto
# ---
apiVersion: "autoscaling/v2beta1"
kind: "HorizontalPodAutoscaler"
metadata:
  name: "backend-flask-horizontal-autoscaler"
  namespace: "default"
  labels:
    app: "backend-flask"
spec:
  scaleTargetRef:
    kind: "Deployment"
    name: "backend-flask"
    apiVersion: "apps/v1"
  minReplicas: 1
  maxReplicas: 1
  metrics:
  - type: "Resource"
    resource:
      name: "cpu"
      targetAverageUtilization: 80
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-flask-horizontal-autoscaler
  namespace: "default"
  labels:
    app: "backend-flask"
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       backend-flask
  updatePolicy:
    updateMode: "Auto"
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "backend-flask-service"
  namespace: "default"
  labels:
    app: "backend-flask"
spec:
  ports:
  - protocol: "TCP"
    port: 5000
    targetPort: 5000
  selector:
    app: "backend-flask"
  type: "LoadBalancer"
  loadBalancerIP: ""

【问题讨论】：

我可以在部署请求 16 个 CPU 时添加一个额外的 e2-highcpu-16 节点，但它只是空闲并且无法调度 pod
几分钟前我能够在 Autopilot 上部署 16CPU / 16G 而没有问题
您的部署或失败的 pod 的日志中有任何内容吗？
可能是你的CPU配额不够？
我也能够部署 28vCPU/28G。 Autopilot 的限制是每个 pod 28vCPU。 gist.github.com/mastersingh24/dbdf181569522c23ad70a6a2881870ec

标签： google-cloud-platform google-kubernetes-engine autoscaling

【解决方案1】：

原来这确实是一个配额问题。出于某种原因，配额不断显示更多我当时实际使用的实例。

增加配额仅在删除&重新创建集群后生效。

最后，我的 onw 自动缩放器弄乱了我的部署，因为我在请求之间使用了指定的资源。

感谢您@GariSingh 的回答。移除自动扩缩器并增加配额后，我还能够部署多达 24 个 CPU。

【讨论】：