【发布时间】:2021-09-29 00:14:09
【问题描述】:
在通宵运行的 Buildkite 管道期间,CloudFormation 无法创建 ECS 服务,超时。
超时的原因可能是什么?
CFN 日志
cfn INFO 03:08:51 CREATE_IN_PROGRESS build-amis-564-registrations: User Initiated
cfn INFO 03:08:56 CREATE_IN_PROGRESS TaskDefinition
cfn INFO 03:08:59 CREATE_IN_PROGRESS TaskDefinition: Resource creation Initiated
cfn INFO 03:08:59 CREATE_COMPLETE TaskDefinition
cfn INFO 03:09:02 CREATE_IN_PROGRESS ECSService
ecs INFO 03:09:10 Waiting for cluster to scale up. Please wait...
Traceback (most recent call last):
File ".buildkite/scripts/deploy/ecs_deploy.py", line 282, in <module>
main()
File ".buildkite/scripts/deploy/ecs_deploy.py", line 232, in main
ecs.wait_for_service_steady(cluster, stack_name, project_name, desired_count)
File "/app/ecs/__init__.py", line 680, in wait_for_service_steady
raise Exception("Timed out waiting for service deployment")
Exception: Timed out waiting for service deployment
Python 脚本
15 分钟后生成错误的 python 脚本摘录(CloudFormation 本身在尝试创建服务失败三个小时后继续超时)。
...
for event in filter_events_response(response, last_event_id) or []:
if "insufficient memory" in event["message"]:
message = info("Waiting for cluster to scale up. Please wait...")
else:
message = event["message"]
if log_progress:
logger.info(
"%s\t%s", event["createdAt"].strftime("%H:%M:%S"), message
)
last_event_id = event["id"]
waited = 0
if "steady" in event["message"]:
logger.debug(event)
return
if "deregistered" in event["message"]:
killed_tasks += 1
if killed_tasks > allowed_killed_tasks:
raise ServiceUnstableException(
"%s-%s service tasks are failing to start"
% (stack, service)
)
time.sleep(20)
waited += 20
if waited > 900:
raise Exception("Timed out waiting for service deployment")
...
【问题讨论】:
标签: python amazon-web-services amazon-ec2 amazon-cloudformation