【问题标题】:How to perform AWS CloudFormation autoscaling for ECS instance when cluster has insufficient memory available当集群可用内存不足时,如何为 ECS 实例执行 AWS CloudFormation 自动扩展
【发布时间】:2018-10-08 02:38:38
【问题描述】:

我创建了 CloudFormation 模板,该模板可以创建 ECS 服务和任务,并且具有任务的自动缩放功能。这是非常基本的 - 如果任务的 MemoruUtilization 达到某个值,则添加 1 个任务,反之亦然。以下是一些最相关的部件表单模板。

  EcsTd:
    Type: AWS::ECS::TaskDefinition
    DependsOn: LogGroup
    Properties:
      Family: !Sub ${EnvironmentName}-${PlatformName}-${Type}
      ContainerDefinitions:
      - Name: !Sub ${EnvironmentName}-${PlatformName}-${Type}
        Image: !Sub ${AWS::AccountId}.dkr.ecr.{AWS::Region}.amazonaws.com/${PlatformName}:${ImageVersion}
        Environment:
        - Name: APP_ENV
          Value: !If [isProd, "production", "staging"]
        - Name: APP_DEBUG
          Value: "false"
        ...

    PortMappings:
    - ContainerPort: 80
      HostPort: 0
    Memory: !Ref Memory
    Essential: true
  EcsService:
    Type: AWS::ECS::Service
    DependsOn: WaitForLoadBalancerListenerRulesCondition
    Properties:
      ServiceName: !Sub ${EnvironmentName}-${PlatformName}-${Type}
      Cluster:
        Fn::ImportValue: !Sub ${EnvironmentName}-ECS-${Type}
      DesiredCount: !Sub ${DesiredCount}
      TaskDefinition: !Ref EcsTd
      Role: "learningEcsServiceRole"
      LoadBalancers:
      - !If
        - isWeb
        - ContainerPort: 80
          ContainerName: !Sub ${EnvironmentName}-${PlatformName}-${Type}
          TargetGroupArn: !Ref AlbTargetGroup
        - !Ref AWS::NoValue
  ServiceScalableTarget:
    Type: "AWS::ApplicationAutoScaling::ScalableTarget"
    Properties:
      MaxCapacity: !Sub ${MaxCount}
      MinCapacity: !Sub ${MinCount}
      ResourceId: !Join
      - /
      - - service
        - !Sub ${EnvironmentName}-${Type}
        - !GetAtt EcsService.Name
      RoleARN: arn:aws:iam::645618565575:role/learningEcsServiceRole
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs

  ServiceScaleOutPolicy:
    Type : "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Sub ${EnvironmentName}-${PlatformName}-${Type}- ScaleOutPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref ServiceScalableTarget
      StepScalingPolicyConfiguration:
        AdjustmentType: ChangeInCapacity
        Cooldown: 1800
        MetricAggregationType: Average
        StepAdjustments:
        - MetricIntervalLowerBound: 0
          ScalingAdjustment: 1
  MemoryScaleOutAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub ${EnvironmentName}-${PlatformName}-${Type}-MemoryOver70PercentAlarm
      AlarmDescription: Alarm if memory utilization greater than 70% of reserved memory
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
      - Name: ClusterName
        Value: !Sub ${EnvironmentName}-${Type}
      - Name: ServiceName
        Value: !GetAtt EcsService.Name
      Statistic: Maximum
      Period: '60'
      EvaluationPeriods: '1'
      Threshold: '70'
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
      - !Ref ServiceScaleOutPolicy
      - !Ref EmailNotification

  ...

因此,当任务开始耗尽内存时,我们将添加新任务。但是,在某些时候,我们会达到集群外可用内存的限制。

例如,集群由一个 t2.small 实例组成,那么我们有 2Gb RAM。其中一小部分由在实例中运行的 ECS 任务使用,因此我们的 RAM 少于 2GB。如果我们将 Task 的内存值设置为 512Mb,那么除非我们扩大集群,否则我们只能在该集群中放置 3 个任务。

默认情况下,ECS 服务具有可用于自动扩展集群的 MemoryReservation 指标。我们会告诉当 MemoryReservation 超过 75% 时,将 1 个实例添加到集群中。这相对容易。

EcsCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub ${EnvironmentName}-${Type}
  SgEcsHost:
    ...
  ECSLaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: !FindInMap [AWSRegionToAMI, !Ref 'AWS::Region', AMIID]
      InstanceType: !Ref InstanceType
      SecurityGroups: [ !Ref SgEcsHost ]
      AssociatePublicIpAddress: true
      IamInstanceProfile: "ecsInstanceRole"
      KeyName: !Ref KeyName
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          echo ECS_CLUSTER=${EnvironmentName}-${Type} >> /etc/ecs/ecs.config
  ECSAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
      - Fn::ImportValue: !Sub ${EnvironmentName}-SubnetEC2AZ1
      - Fn::ImportValue: !Sub ${EnvironmentName}-SubnetEC2AZ2
      LaunchConfigurationName: !Ref ECSLaunchConfiguration
      MinSize: !Ref AsgMinSize
      MaxSize: !Ref AsgMaxSize
      DesiredCapacity: !Ref AsgDesiredSize
      Tags:
      - Key: Name
        Value: !Sub ${EnvironmentName}-ECS
        PropagateAtLaunch: true
  ScalePolicyUp:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AdjustmentType: ChangeInCapacity
      AutoScalingGroupName:
        Ref: ECSAutoScalingGroup
      Cooldown: '1'
      ScalingAdjustment: '1'
  MemoryReservationAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      EvaluationPeriods: '1'
      Statistic: Average
      Threshold: '75'
      AlarmDescription: Alarm if MemoryReservation is more then 75%
      Period: '60'
      AlarmActions:
      - Ref: ScalePolicyUp
      - Ref: EmailNotification
      Namespace: AWS/EC2
      Dimensions:
      - Name: AutoScalingGroupName
        Value:
          Ref: ECSAutoScalingGroup
      ComparisonOperator: GreaterThanThreshold
      MetricName: MemoryReservation

但这没有意义,因为添加第三个任务时会发生这种情况,因此新实例将是空的,直到第四个任务被扩展。这意味着我们将支付我们不使用的实例。

我注意到,当 ECS 服务尝试将任务添加到没有足够可用内存的集群时,我得到了

service Production-admin-worker 无法放置任务,因为没有 容器实例满足其所有要求。最接近的匹配 容器实例################### 有 可用内存不足。

在这个例子中,模板的参数是:

EnvironmentName=Production
PlatformName=Admin
Type=worker

是否可以创建查看 ECS 集群事件并查找特定模式的 AWS::CloudWatch::Alarm?想法是仅当AWS::ApplicationAutoScaling::ScalingPolicy 添加集群中没有空间的任务时,才使用AWS::AutoScaling::AutoScalingGroup 增加集群中的实例计数。并在 MemoryReservation 小于 25% 时缩小集群(这意味着那里没有正在运行的任务 - AWS::ApplicationAutoScaling::ScalingPolicy 已删除它们)。

【问题讨论】:

    标签: amazon-web-services amazon-cloudformation amazon-cloudwatch amazon-cloudwatch-metrics


    【解决方案1】:

    这意味着我们将支付我们不使用的实例。

    您可以提前为额外/备份容量付费,或者实施逻辑重试由于容量不足而失败的容量。

    我能想到的几种方法:

    From docs:

    当指标过滤器在您的日志事件中找到其中一个术语、短语或值时,您可以增加 CloudWatch 指标的值。

    【讨论】:

      猜你喜欢
      • 2019-12-01
      • 2021-04-22
      • 2017-02-28
      • 2020-01-28
      • 1970-01-01
      • 2019-03-02
      • 1970-01-01
      • 2020-03-27
      • 2020-02-26
      相关资源
      最近更新 更多