【问题标题】:Kubernetes Pod Termination doesn't happen immediately, have to wait until grace period expiresKubernetes Pod 终止不会立即发生,必须等到宽限期到期
【发布时间】:2021-02-01 21:34:13
【问题描述】:

我有一个 helm 图表,其中包含一个部署/pod 和一项服务。我将部署终止GracePeriodSeconds 设置为300s。 我没有任何 pod 生命周期钩子,所以如果我终止 pod,则 pod 应该立即终止。但是,现在 pod 将确定直到我的宽限期结束!

下面是我的 pod 的部署模板:

$ kubectl get pod hpa-poc---jcc-7dbbd66d86-xtfc5 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2021-02-01T18:12:34Z"
  generateName: hpa-poc-jcc-7dbbd66d86-
  labels:
    app.kubernetes.io/instance: hpa-poc
    app.kubernetes.io/name: -
    pod-template-hash: 7dbbd66d86
  name: hpa-poc-jcc-7dbbd66d86-xtfc5
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hpa-poc-jcc-7dbbd66d86
    uid: 66db29d8-9e2d-4097-94fc-b0b827466e10
  resourceVersion: "127938945"
  selfLink: /api/v1/namespaces/default/pods/hpa-poc-jcc-7dbbd66d86-xtfc5
  uid: 82ed4134-95de-4093-843b-438e94e408dd
spec:
  containers:
  - env:
    - name: _CONFIG_LINK
      value: xxx
    - name: _USERNAME
      valueFrom:
        secretKeyRef:
          key: username
          name: hpa-jcc-poc
    - name: _PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: hpa-jcc-poc
    image: xxx
    imagePullPolicy: IfNotPresent
    name: -
    resources:
      limits:
        cpu: "2"
        memory: 8Gi
      requests:
        cpu: 500m
        memory: 2Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-hzmwh
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: xxx
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 300
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-hzmwh
    secret:
      defaultMode: 420
      secretName: default-token-hzmwh
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-02-01T18:12:34Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-02-01T18:12:36Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-02-01T18:12:36Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-02-01T18:12:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://c4c969ec149f43ff4494339930c8f0640d897b461060dd810c63a5d1f17fdc47
    image: xxx
    imageID: xxx
    lastState: {}
    name: -
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: "2021-02-01T18:12:35Z"
  hostIP: 10.0.35.137
  phase: Running
  podIP: 10.0.21.35
  qosClass: Burstable
  startTime: "2021-02-01T18:12:34Z"

当我尝试终止 pod 时(我使用了helm delete 命令),你可以看到它在 5 分钟后终止,这是宽限期。

$ helm delete hpa-poc
release "hpa-poc" uninstalled
$ kubectl get pod -w | grep hpa
hpa-poc-jcc-7dbbd66d86-xtfc5         1/1     Terminating   0          3h10m
hpa-poc-jcc-7dbbd66d86-xtfc5         0/1     Terminating   0          3h15m
hpa-poc-jcc-7dbbd66d86-xtfc5         0/1     Terminating   0          3h15m

所以我怀疑这是我的 pod/容器配置问题。因为我已经尝试过其他简单的 Java App 部署,一旦我终止 pod,它就可以立即终止。

顺便说一句,我正在使用 AWS EKS 集群。也不确定它是特定于 AWS 的。

那么有什么建议吗?

【问题讨论】:

  • 如果您将容器/应用程序/pod 日志推送到任何日志系统,那么您在此类延迟终止的容器/应用程序/pod 的日志中看到了什么?
  • 我更倾向于认为您的Java 应用程序无法捕获kubelet 发送的SIGTERM,而不是PodContainer 存在问题。这就是它在 300 秒后被杀死的原因(发送 SIGKILL 时)。您可以在这里阅读更多信息:cloud.google.com/blog/products/gcp/…

标签: kubernetes kubernetes-helm kubernetes-pod amazon-eks


【解决方案1】:

我发现了问题。当我执行到容器中时,我注意到有一个进程正在运行,这就是拖尾日志进程。

所以,我需要终止进程并将其添加到 prestop 挂钩中。之后,我的容器可以立即关闭。

【讨论】:

  • 你能否用一个 yaml 示例分享你的确切更改以完成答案?
猜你喜欢
  • 2020-11-15
  • 2018-10-19
  • 1970-01-01
  • 2020-09-24
  • 2015-07-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-07-09
相关资源
最近更新 更多