【发布时间】:2021-04-14 03:02:20
【问题描述】:
我正在开发一项服务,它将调用 Spring Cloud Dataflow (SCDF) 来为 Spring Batch 作业衍生一个新的 k8s Pod。
Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);
LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);
一切正常。问题是,每当我们拉取不存在的图像时(例如:由于某些连接问题),Pod 无法启动并且我们最终会收到挂起的任务(没有创建任何批处理作业)
例如,我们将在表task_execution(SCDF 表)中有任务结束时间为空
但batch_job_execution 表中没有相关职位。
起初看起来还不错,因为没有创建 pod,我们不消耗任何资源。但是当“待定工作”的数量达到 20 个时,我们遇到了著名的错误:
Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]
我正在尝试寻找一种方法来检测 pod 分拆失败(因此我们应该将任务标记为错误),但无济于事。
当任务启动一个新的 k8s pod 时,有没有办法检测任务启动是否失败?
更新
不确定是否相关,我们使用的是 SCDF 1.7.3.RELEASE
描述失败的 pod:
Name: podname-lp2nyowgmm
Namespace: my-namespace
Priority: 1000
Priority Class Name: test-cluster-default
Node: some-ip.compute.internal/XX.XXX.XXX.XX
Start Time: Thu, 14 Jan 2021 18:47:52 +0700
Labels: role=spring-app
spring-app-id=podname-lp2nyowgmm
spring-deployment-id=podname-lp2nyowgmm
task-name=podname
Annotations: iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
kubernetes.io/psp: eks.privileged
Status: Pending
IP: XX.XXX.XXX.XXX
IPs:
IP: XX.XXX.XXX.XXX
Containers:
podname-lp2nyowgmm:
Container ID:
Image: image_host:XXX/mysystem/myapp:notExist
Image ID:
Port: <none>
Host Port: <none>
Args:
--spring.datasource.username=postgres
--spring.cloud.task.name=podname
--spring.datasource.url=jdbc:postgresql://...
--spring.datasource.driverClassName=org.postgresql.Driver
--spring.datasource.password=XXXX
--fileId=XXXXXXXXXXX
--spring.application.name=app-name
--fileName=file_name.csv
...
--spring.cloud.task.executionid=3
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8Gi
Requests:
cpu: 2
memory: 8Gi
Environment:
ELASTIC_SEARCH_PORT: 80
ELASTIC_SEARCH_PROTOCOL: http
SPRING_RABBITMQ_PORT: ${RABBITMQ_SERVICE_PORT}
ELASTIC_SEARCH_URL: elasticsearch
SPRING_PROFILES_ACTIVE: kubernetes
CLIENT_SECRET: ${CLIENT_SECRET}
SPRING_RABBITMQ_HOST: ${RABBITMQ_SERVICE_HOST}
RELEASE_ENV_NAME: QA_TEST
SPRING_CLOUD_APPLICATION_GUID: ${HOSTNAME}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xxxxx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m22s default-scheduler Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
Normal Pulling 103s (x4 over 3m21s) kubelet Pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 102s (x4 over 3m19s) kubelet Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
Warning Failed 102s (x4 over 3m19s) kubelet Error: ErrImagePull
Normal BackOff 88s (x6 over 3m19s) kubelet Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 73s (x7 over 3m19s) kubelet Error: ImagePullBackOff
【问题讨论】:
标签: java spring kubernetes spring-batch spring-cloud-dataflow