Automatically update instances in an Amazon ECS cluster using the AMI ID parameter

This post is contributed by Adam McLean – Solutions Developer at AWS and Chirill Cucereavii – Application Architect at AWS 

In this post, we show you how to automatically refresh the container instances in an active Amazon Elastic Container Service (ECS) cluster with instances built from a newly released AMI.

The Amazon ECS-optimized AMI  comes prepackaged with the ECS container agent, Docker agent, and the ecs-init upstart service. We recommend that you use the Amazon ECS-optimized AMI for your container instances unless your application requires any of the following:

  • A specific operating system
  • Custom security and monitoring agents installed
  • Root volumes encryption enabled
  • A Docker version that is not yet available in the Amazon ECS-optimized AMI

Regardless of the type of AMI that you choose, AWS recommends updating your ECS containers instance fleet with the latest AMI whenever possible. It’s easier than trying to patch existing instances in place.

Solution overview

In this solution, you deploy the ECS cluster and, specify cluster size, instance type, AMI ID, and other parameters. After the ECS cluster has been created and instances registered, you can update the ECS cluster with another AMI ID to trigger the following events:

  1. A new launch configuration is created using the new AMI ID.
  2. The Auto Scaling group adds one new instance using the new launch configuration. This executes the ‘Adding Instances’ process described below.
  3. The adding instances process finishes for the single new node with the new AMI. Then, the removing instances process is started against the oldest instance with the old AMI ID.
  4. After the removing nodes process is finished, steps 2 and 3 are repeated until all nodes in the cluster have been replaced.
  5. If an error is encountered during the rollout, the new launch configuration is deleted, and the old one is put back in place.

Scaling a cluster out (adding instances)

Take a closer look at each step in scaling out a cluster:

  1. A stack update changes the AMI ID parameter.
  2. CloudFormation updates the launch configuration and tells the Auto Scaling group to add an instance.
  3. Auto Scaling launches an instance using new AMI ID to join the ECS cluster.
  4. Auto Scaling invokes the Launch Lambda function.
  5. Lambda asks the ECS cluster if the newly launched instance has joined and is showing healthy.
  6. Lambda tells Auto Scaling whether the launch succeeded or failed.
  7. Auto Scaling tells CloudFormation whether the scale-up has succeeded.
  8. The stack update succeeds, or rolls back.

Scaling a cluster in (removing instances)

Take a closer look at each step in scaling in a cluster:

  1. CloudFormation tells the Auto Scaling group to remove an instance.
  2. Auto Scaling chooses an instance to be terminated.
  3. Auto Scaling invokes the Terminate Lambda function.
  4. The Lambda function performs the following tasks:
    1. Sets the instance to be terminated to DRAINING mode.
    2. Confirms that all ECS tasks are drained from the instance marked for termination.
    3. Confirms that the ECS cluster services and tasks are stable.
  5. Lambda tells Auto Scaling to proceed with termination.
  6. Auto Scaling tells CloudFormation whether the scale-in has succeeded.
  7. The stack update succeeds, or rolls back.

Solution technologies

Here are the technologies used for this solution, with more details.

  • AWS CloudFormation
  • AWS Auto Scaling
  • Amazon CloudWatch Events
  • AWS Systems Manager Parameter Store
  • AWS Lambda

AWS CloudFormation

AWS CloudFormation is used to deploy the stack, and should be used for lifecycle management. Do not directly edit Auto Scaling groups, Lambda functions, and so on. Instead, update the CloudFormation template.

This forces the resolution of the latest AMI, as well as providing an opportunity to change the size or instance type involved in the ECS cluster.

CloudFormation has rollback capabilities to return to the last known good state if errors are encountered. It is the recommended mechanism for management through the clusters lifecycle.

AWS Auto Scaling

For ECS, the primary scaling and rollout mechanism is AWS Auto Scaling. Auto Scaling allows you to define a desired state environment, and keep that desired state as necessary by launching and terminating instances.

When a new AMI has been selected, CloudFormation informs Auto Scaling that it should replace the existing fleet of instances. This is controlled by an Auto Scaling update policy.

This solution rolls a single instance out to the ECS cluster, then drain, and terminate a single instance in response. This cycle continues until all instances in the ECS cluster have been replaced.

Auto Scaling lifecycle hooks

Auto Scaling permits the use of a lifecycle hooks. This is code that executes when a scaling operation occurs. This solution uses a Lambda function that is informed when an instance is launched or terminated.

A lifecycle hook informs Auto Scaling whether it can proceed with the activity or if it should abandon it. In this case, the ECS cluster remains healthy and all tasks have been redistributed before allowing Auto Scaling to proceed.

Lifecycles also have a timeout. In this case, it is 3600 seconds (1 hour) before Auto Scaling gives up. In that case, the default activity is to abandon the operation.

Amazon CloudWatch Events

CloudWatch Events is a mechanism for watching calls made to the AWS APIs, and then activating functions in response. This is the mechanism used to launch the Lambda functions when a lifecycle event occurs. It’s also the mechanism used to re-launch the Lambda function when it times out (Lambda maximum execution time is 15 minutes).

In this solution, four CloudWatch Events are created. Two to pick up the initial scale-up event. Two more to pick up a continuation from the Lambda function.

AWS Systems Manager Parameter Store

AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management.

This solution relies on the AMI IDs stored in Parameter Store. Given a naming standard of /ami/ecs/latest, this always resolves to the latest available AMI for ECS.

CloudFormation now supports using the values stored in Parameter Store as inputs to CloudFormation templates. The template can be simply passed a value—/ami/ecs/latest—and it resolve that to the latest AMI.

AWS Lambda

The Lambda functions are used to handle the Auto Scaling lifecycle hooks. They operate against the ECS cluster to assure it is healthy, and inform Auto Scaling that it can proceed, or to abandon its current operation.

The functions are invoked by CloudWatch Events in response to scaling operations so they are idle unless there are changes happening in the cluster.

They’re written in Python, and use the boto3 SDK to communicate with the ECS cluster, and Auto Scaling.

The launch Lambda function waits until the instance has fully joined the ECS cluster. This is shown by the instance being marked ‘ACTIVE’ by the ECS control plane, and it’s ECS agent status showing as connected. This means that the new instance is ready to run tasks for the cluster.

The terminate Lambda function waits until the instance has fully drained all running tasks. It also checks that all tasks, and services are in a stable state before allowing Auto Scaling to terminate an instance. This assures the instance is truly idle, and the cluster stable before an instance can be removed.

Deployment

Before you begin deployment, you need the following:

You need an AWS account with enough room to accommodate the additional EC2 instances required by the ECS cluster.

  • (Optional) Linux system 

Use AWS CLI and optionally JQ to deploy the solution. Although a Linux system is recommended, it’s not required.

You need an IAM admin user with permissions to create IAM policies and roles and create and update CloudFormation stacks. The user must also be able to deploy the ECS cluster, Lambda functions, Systems Manager parameters, and other resources.

Clone or download the project from https://github.com/awslabs/ecs-cluster-manager on GitHub:

git clone git@github.com:awslabs/ecs-cluster-manager.git

AMI ID parameter

Create an Systems Manager parameter where the desired AMI ID is stored.

The first run does not use the latest ECS optimized AMI. Later, you update the ECS cluster to the latest AMI.

Use the AMI released on 2017.09. Run the following commands to create /ami/ecs/latest parameter in Parameter Store with a corresponding AMI value.

AMI_ID=$(aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux/amzn-ami-2017.09.l-amazon-ecs-optimized --region us-east-1 --query "Parameters[].Value" --output text | jq -r .image_id)
aws ssm put-parameter 
  --overwrite 
  --name "/ami/ecs/latest" 
  --type "String" 
  --value $AMI_ID 
  --region us-east-1 
  --profile devAdmin

Substitute us-east-1 with your desired Region.

In the AWS Management Console, choose AWS Systems Manager, Parameter Store.

You should see the /ami/ecs/latest parameter that you just created.

Select the /ami/ecs/latest parameter and make sure that the AMI ID is present in parameter value. If you are using the us-east-1 Region, you should see the following value:

ami-aff65ad2

Upload the Lambda function code to Amazon S3

The Lambda functions are too large to embed in the CloudFormation template. Therefore, they must be loaded into an S3 bucket before CloudFormation stack is created.

Assuming you’re using an S3 bucket called ecs-deployment,  copy each Lambda function zip file as follows:

cd ./ecs-cluster-manager
aws s3 cp lambda/ecs-lifecycle-hook-launch.zip s3://ecs-deployment
aws s3 cp lambda/ecs-lifecycle-hook-terminate.zip s3://ecs-deployment

Refer to these when running your CloudFormation template later so that CloudFormation knows where to find the Lambda files.

Lambda function role

The Lambda functions that execute require read permissions to EC2, write permissions to ECS, and permissions to submit a result or heartbeat to Auto Scaling.

Create a new LambdaECSScaling IAM policy in your AWS account. Use the following JSON as the policy body:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:DescribeScalingActivities",
                "autoscaling:RecordLifecycleActionHeartbeat",
                "ecs:UpdateContainerInstancesState",
                "ecs:Describe*",
                "ecs:List*"
            ],
            "Resource": "*"
        }
    ]
}

 

Now, create a new LambdaECSScalingRole IAM role. For Trusted Entity, choose AWS Service, Lambda. Attach the following permissions policies:

  • LambdaECSScaling (created in the previous step)
  • ReadOnlyAccess (AWS managed policy)
  • AWSLambdaBasicExecutionRole (AWS managed policy)

ECS cluster instance profile

The ECS cluster nodes must have an instance profile attached that allows them to speak to the ECS service. This profile can also contain any other permissions that they would require (Systems Manager for management and executing commands for example).

These are all AWS managed policies so you only add the role.

Create a new IAM role called EcsInstanceRole, select AWS Service → EC2 as Trusted Entity. Attach the following AWS managed permissions policies:

  • AmazonEC2RoleforSSM
  • AmazonEC2ContainerServiceforEC2Role
  • AWSLambdaBasicExecutionRole

The AWSLambdaBasicExecutionRole policy may look out of place, but this allows the instance to create new CloudWatch Logs groups. These permissions facilitate using CloudWatch Logs as the primary logging mechanism with ECS. This managed policy grants the required permissions without you needing to manage a custom role.

CloudFormation parameter file

We recommend using a parameter file for the CloudFormation template. This documents the desired parameters for the template. It is usually less error prone to do this versus using the console for inputting parameters.

There is a file called blank_parameter_file.json in the source code project. Copy this file to something new and with a more meaningful name (such as dev-cluster.json), then fill out the parameters.

The file looks like this:

[
  {
    "ParameterKey": "EcsClusterName",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "EcsAmiParameterKey",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "IamRoleInstanceProfile",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "EcsInstanceType",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "EbsVolumeSize",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "ClusterSize",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "ClusterMaxSize",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "KeyName",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "SubnetIds",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "SecurityGroupIds",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "DeploymentS3Bucket",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LifecycleLaunchFunctionZip",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LifecycleTerminateFunctionZip",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LambdaFunctionRole",
    "ParameterValue": ""
  }
]

Here are the details for each parameter:

  • EcsClusterName:  The name of the ECS cluster to create.
  • EcsAmiParameterKey:  The Systems Manager parameter that contains the AMI ID to be used. This defaults to /ami/ecs/latest.
  • IamRoleInstanceProfile:  The name of the EC2 instance profile used by the ECS cluster members. Discussed in the prerequisite section.
  • EcsInstanceType:  The instance type to use for the cluster. Use whatever is appropriate for your workloads.
  • EbsVolumeSize:  The size of the Docker storage setup that is created using LVM. ECS typically defaults to 100 GB.
  • ClusterSize:  The desired number of EC2 instances for the cluster.
  • ClusterMaxSize:  This value should always be double the amount contained in ClusterSize. CloudFormation has no ‘math’ operators or we wouldn’t prompt for this. This allows rolling updates to be performed safely by doubling the cluster size, then contracting back.
  • KeyName:  The name of the EC2 key pair to place on the ECS instance to support SSH.
  • SubnetIds: A comma-separated list of subnet IDs that the cluster should be allowed to launch instances into. These should map to at least two zones for a resilient cluster, for example subnet-a70508df,subnet-e009eb89.
  • SecurityGroupIds:  A comma-separated list of security group IDs that are attached to each node, for example sg-bd9d1bd4,sg-ac9127dca (a single value is fine).
  • DeploymentS3Bucket: This is the bucket where the two Lambda functions for scale in/scale out lifecycle hooks can be found.
  • LifecycleLaunchFunctionZip: This is the full path within the DeploymentS3Bucket where the ecs-lifecycle-hook-launch.zip contents can be found.
  • LifecycleTerminateFunctionZip:  The full path within the DeploymentS3Bucket where the ecs-lifecycle-hook-terminate.zip contents can be found.
  • LambdaFunctionRole:  The name of the role that the Lambda functions use. Discussed in the prerequisite section.

A completed parameter file looks like the following:

[
  {
    "ParameterKey": "EcsClusterName",
    "ParameterValue": "DevCluster"
  }, 
  {
    "ParameterKey": "EcsAmiParameterKey",
    "ParameterValue": "/ami/ecs/latest"
  },
  {
    "ParameterKey": "IamRoleInstanceProfile",
    "ParameterValue": "EcsInstanceRole"
  }, 
  {
    "ParameterKey": "EcsInstanceType",
    "ParameterValue": "m4.large"
  },
  {
    "ParameterKey": "EbsVolumeSize",
    "ParameterValue": "100"
  }, 
  {
    "ParameterKey": "ClusterSize",
    "ParameterValue": "2"
  },
  {
    "ParameterKey": "ClusterMaxSize",
    "ParameterValue": "4"
  },
  {
    "ParameterKey": "KeyName",
    "ParameterValue": "dev-cluster"
  },
  {
    "ParameterKey": "SubnetIds",
    "ParameterValue": "subnet-a70508df,subnet-e009eb89"
  },
  {
    "ParameterKey": "SecurityGroupIds",
    "ParameterValue": "sg-bd9d1bd4"
  },
  {
    "ParameterKey": "DeploymentS3Bucket",
    "ParameterValue": "ecs-deployment"
  },
  {
    "ParameterKey": "LifecycleLaunchFunctionZip",
    "ParameterValue": "ecs-lifecycle-hook-launch.zip"
  },
  {
    "ParameterKey": "LifecycleTerminateFunctionZip",
    "ParameterValue": "ecs-lifecycle-hook-terminate.zip"
  },
  {
    "ParameterKey": "LambdaFunctionRole",
    "ParameterValue": "LambdaECSScalingRole"
  }
]

Deployment

Given the CloudFormation template and the parameter file, you can deploy the stack using the AWS CLI or the console.

Here’s an example deploying through the AWS CLI. This example uses a stack named ecs-dev and a parameter file named dev-cluster.json. It also uses the --profile argument to assure that the CLI assumes a role in the right account for deployment. Use the corresponding Region and profile from your local ~/.aws/config file.

This command outputs the stack ID as soon as it is executed, even though the other required resources are still being created.

aws cloudformation create-stack 
  --stack-name ecs-dev 
  --template-body file://./ecs-cluster.yaml 
  --parameters file://./dev-cluster.json 
  --region us-east-1 
  --profile devAdmin

Use the AWS Management Console to check whether the stack is done creating. Or, run the following command:

aws cloudformation wait stack-create-complete 
  --stack-name ecs-dev 
  --region us-east-1 
  --profile devAdmin

Use the AWS Management Console to check whether the stack is done creating. Or, run the following command:

aws cloudformation wait stack-create-complete 
  --stack-name ecs-dev 
  --region us-east-1 
  --profile devAdmin

After the CloudFormation stack has been created, go to the ECS console. and open the DevCluster cluster that you just created. There are no tasks running, although you should see two container instances registered with the cluster.

You also see a warning message indicating that the container instances are not running the latest version of Amazon ECS container agent. The reason is that you did not use the latest available version of the ECS-Optimized AMI.

Fix this issue by updating the container instances AMI.

Update the cluster instances AMI

Run the following commands to set the /ami/ecs/latest parameter to the latest AMI ID.

AMI_ID=$(aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux/recommended --region us-east-1 --query "Parameters[].Value" --output text | jq -r .image_id)

 aws ssm put-parameter 
   --overwrite 
   --name "/ami/ecs/latest" 
   --type "String" 
   --value $AMI_ID
   --region us-east-1 
   --profile devAdmin

Make sure that the parameter value has been updated in the console.

To update your ECS cluster, run the update-stack command without changing any parameters. CloudFormation evaluates the value stored by /ami/ecs/latest. If it has changed, CloudFormation makes updates as appropriate.

aws cloudformation update-stack 
  --stack-name ecs-dev 
  --template-body file://./ecs-cluster.yaml 
  --parameters file://./dev-cluster.json 
  --region us-east-1 
  --profile devAdmin

Supervising updates

We recommend supervising your updates to the ECS cluster while they are being deployed. This assures that the cluster remains stable. For the majority of situations, there is no manual intervention required.

  • Keep an eye on Auto Scaling activities. In the Auto Scaling groups section of the EC2 console, select the Auto Scaling group for a cluster and choose Activity History.
  • Keep an eye on the ECS instances to ensure that new instances are joining and draining instances are leaving. In the ECS console, choose Cluster, ECS Instances.
  • Lambda function logs help troubleshoot things that aren’t behaving as expected. In the Lambda console, select the LifeCycleLaunch or LifeCycleTerminate functions, and choose Monitoring, View logs in CloudWatch. Expand the logs for the latest executions and see what’s going on:

When you go back to the ECS cluster page, notice that the “Outdated Amazon ECS container agent” warning message has disappeared.

Select one of the cluster’s EC2 instance IDs and observe that the latest ECS optimized AMI is used.

Summary

In this post, you saw how to use CloudFormation, Lambda, CloudWatch Events, and Auto Scaling lifecycle hooks to update your ECS cluster instances with a new AMI.

The sample code is available on GitHub for you to use and extend. Contributions are always welcome!