我将 Mark 的解决方案与 spec.jobTemplate.spec.activeDeadlineSeconds 一起使用。
只是其中还有一件事。来自 K8S 文档:
一旦 Job 达到 activeDeadlineSeconds,其所有正在运行的 Pod 都将终止,并且 Job 状态将变为 type: Failed with reason: DeadlineExceeded。
Pod 终止时实际发生的情况是 K8S 针对 POD 的容器进程 pid 0 触发 SIGTERM。它不等待实际进程终止。如果您的容器没有正常终止,它将保持终止状态 30 秒,之后 K8S 会触发 SIGKILL。同时,K8S 可能会调度另一个 Pod,因此终止的 Pod 与新调度的 Pod 最多重叠 30 秒。
这很容易通过这个 CronJob 定义重现:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cj-sleep
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 5
jobTemplate:
metadata:
creationTimestamp: null
spec:
activeDeadlineSeconds: 50
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- "/usr/local/bin/bash"
- "-c"
- "--"
args:
- "tail -f /dev/null & wait $!"
image: bash
imagePullPolicy: IfNotPresent
name: cj-sleep
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: '* * * * *'
startingDeadlineSeconds: 100
successfulJobsHistoryLimit: 5
这就是调度的发生方式:
while true; do date; kubectl get pods -A | grep cj-sleep; sleep 1; done
Thu Sep 3 09:50:51 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Running 0 49s
Thu Sep 3 09:50:53 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 50s
Thu Sep 3 09:50:54 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 51s
Thu Sep 3 09:50:55 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 52s
Thu Sep 3 09:50:56 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 54s
Thu Sep 3 09:50:58 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 56s
Thu Sep 3 09:51:00 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 57s
Thu Sep 3 09:51:01 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 58s
Thu Sep 3 09:51:02 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 59s
Thu Sep 3 09:51:03 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 60s
default cj-sleep-1599126660-l69gd 0/1 ContainerCreating 0 0s
Thu Sep 3 09:51:04 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 61s
default cj-sleep-1599126660-l69gd 0/1 ContainerCreating 0 1s
Thu Sep 3 09:51:05 UTC 2020
default cj-sleep-1599126600-kzzxg 1/1 Terminating 0 62s
default cj-sleep-1599126660-l69gd 1/1 Running 0 2s
....
Thu Sep 3 09:51:29 UTC 2020
default cj-sleep-1599126600-kzzxg 0/1 Terminating 0 86s
default cj-sleep-1599126660-l69gd 1/1 Running 0 26s
Thu Sep 3 09:51:30 UTC 2020
default cj-sleep-1599126660-l69gd 1/1 Running 0 28s
Thu Sep 3 09:51:32 UTC 2020
default cj-sleep-1599126660-l69gd 1/1 Running 0 29s
init 0 进程有一个细节,默认情况下它们不处理 SIGTERM,您必须提供自己的处理程序。在 bash 的情况下,它是通过添加一个陷阱:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cj-sleep
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 5
jobTemplate:
metadata:
creationTimestamp: null
spec:
activeDeadlineSeconds: 50
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- "/usr/local/bin/bash"
- "-c"
- "--"
args:
- "trap 'exit' SIGTERM; tail -f /dev/null & wait $!"
image: bash
imagePullPolicy: IfNotPresent
name: cj-sleep
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: '* * * * *'
startingDeadlineSeconds: 100
successfulJobsHistoryLimit: 5
现在调度是这样发生的:
Thu Sep 3 09:47:54 UTC 2020
default cj-sleep-1599126420-sm887 1/1 Terminating 0 52s
Thu Sep 3 09:47:56 UTC 2020
default cj-sleep-1599126420-sm887 0/1 Terminating 0 54s
Thu Sep 3 09:47:57 UTC 2020
default cj-sleep-1599126420-sm887 0/1 Terminating 0 55s
Thu Sep 3 09:47:58 UTC 2020
default cj-sleep-1599126420-sm887 0/1 Terminating 0 56s
Thu Sep 3 09:47:59 UTC 2020
default cj-sleep-1599126420-sm887 0/1 Terminating 0 57s
Thu Sep 3 09:48:00 UTC 2020
default cj-sleep-1599126420-sm887 0/1 Terminating 0 58s
Thu Sep 3 09:48:01 UTC 2020
Thu Sep 3 09:48:02 UTC 2020
default cj-sleep-1599126480-rlhlw 0/1 ContainerCreating 0 1s
Thu Sep 3 09:48:04 UTC 2020
default cj-sleep-1599126480-rlhlw 0/1 ContainerCreating 0 2s
Thu Sep 3 09:48:05 UTC 2020
default cj-sleep-1599126480-rlhlw 0/1 ContainerCreating 0 3s
Thu Sep 3 09:48:06 UTC 2020
default cj-sleep-1599126480-rlhlw 1/1 Running 0 4s