AWS API Gateway for HPC Job Submission

AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. AWS API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.

In this post we combine AWS ParallelCluster and AWS API Gateway to allow an HTTP interaction with the scheduler. You can submit, monitor, and terminate jobs using the API, instead of connecting to the master node via SSH. This makes it possible to integrate ParallelCluster programmatically with other applications running on premises or on AWS.

The API uses AWS Lambda and AWS Systems Manager to execute the user commands without granting direct SSH access to the nodes, thus enhancing the security of whole cluster.

VPC configuration

The VPC used for this configuration can be created using the VPC Wizard. You can also use an existing VPC that respects the AWS ParallelCluster network requirements.

 

Launch VPC Wizard

 

In Select a VPC Configuration, choose VPC with Public and Private Subnets and then Select.

 

Select a VPC Configuration

Before starting the VPC Wizard, allocate an Elastic IP Address. This will be used to configure a NAT gateway for the private subnet. A NAT gateway is required to enable compute nodes in the AWS ParallelCluster private subnet to download the required packages and to access the AWS services public endpoints. See AWS ParallelCluster network requirements.

You can find more details about the VPC creation and configuration options in VPC with Public and Private Subnets (NAT).

The example below uses the following configuration:

IPv4 CIDR block: 10.0.0.0/16
VPC name: Cluster VPC
Public subnet’s IPv4 CIDR: 10.0.0.0/24
Availability Zone: eu-west-1a
Public subnet name: Public subnet
Private subnet’s IPv4 CIDR:1 0.0.1.0/24
Availability Zone: eu-west-1b
Private subnet name: Private subnet
Elastic IP Allocation ID: <id of the allocated Elastic IP>
Enable DNS hostnames: yes

VPC with Public and Private Subnets

AWS ParallelCluster configuration

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS cloud; to get started, see Installing AWS ParallelCluster.

After the AWS ParallelCluster command line has been configured, create the cluster template file below in .parallelcluster/config. The master_subnet_idcontains the id of the created public subnet and the compute_subnet_idcontains the private one. The ec2_iam_roleis the role that will be used for all the instances of the cluster. The steps to create this role will be explained below.

[aws]
aws_region_name = eu-west-1

[cluster slurm]
scheduler = slurm
compute_instance_type = c5.large
initial_queue_size = 2
max_queue_size = 10
maintain_initial_size = false
base_os = alinux
key_name = AWS_Ireland
vpc_settings = public
ec2_iam_role = parallelcluster-custom-role

[vpc public]
master_subnet_id = subnet-01fc20e143543f8af
compute_subnet_id = subnet-0b1ae2790497d83ec
vpc_id = vpc-0cdee679c5a6163bd

[global]
update_check = true
sanity_check = true
cluster_template = slurm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

IAM custom Roles for SSM endpoints

To allow ParallelCluster nodes to call Lambda and SSM endpoints, it is necessary to configure a custom IAM Role.

See AWS Identity and Access Management Roles in AWS ParallelCluster for details on the default AWS ParallelCluster policy.

From the AWS console:

  • Access the AWS Identity and Access Management (IAM) service and click on Policies.
  • Choose Create policy and paste the following policy into the JSONsection. Be sure to modify <REGION>, <AWS ACCOUNT ID>to match the values for your account, and also update the S3 bucket name from pcluster-scriptsto the the bucket you want to use to store the input/output data from jobs and save the output of SSM execution commands.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:AttachVolume",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeInstances",
                "ec2:DescribeRegions"
            ],
            "Sid": "EC2",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "dynamodb:ListTables"
            ],
            "Sid": "DynamoDBList",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
            ],
            "Action": [
                "sqs:SendMessage",
                "sqs:ReceiveMessage",
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl"
            ],
            "Sid": "SQSQueue",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:DescribeTags",
                "autoScaling:UpdateAutoScalingGroup",
                "autoscaling:SetInstanceHealth"
            ],
            "Sid": "Autoscaling",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
            ],
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:GetItem",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable"
            ],
            "Sid": "DynamoDBTable",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:s3:::<REGION>-aws-parallelcluster/*"
            ],
            "Action": [
                "s3:GetObject"
            ],
            "Sid": "S3GetObj",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
            ],
            "Action": [
                "cloudformation:DescribeStacks"
            ],
            "Sid": "CloudFormationDescribe",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "sqs:ListQueues"
            ],
            "Sid": "SQSList",
            "Effect": "Allow"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:DescribeAssociation",
                "ssm:GetDeployablePatchSnapshotForInstance",
                "ssm:GetDocument",
                "ssm:DescribeDocument",
                "ssm:GetManifest",
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:ListAssociations",
                "ssm:ListInstanceAssociations",
                "ssm:PutInventory",
                "ssm:PutComplianceItems",
                "ssm:PutConfigurePackageResult",
                "ssm:UpdateAssociationStatus",
                "ssm:UpdateInstanceAssociationStatus",
                "ssm:UpdateInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2messages:AcknowledgeMessage",
                "ec2messages:DeleteMessage",
                "ec2messages:FailMessage",
                "ec2messages:GetEndpoint",
                "ec2messages:GetMessages",
                "ec2messages:SendReply"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::pcluster-data/*"
            ]
        }
    ]
}

Choose Review policy and, in the next section, enter parallelcluster-custom-policystring and choose Create policy.

Now you can create the Role. Choose Role in the left menu and then Create role.

Select AWS service as type of trusted entity and EC2 as service that will use this role as shown here:

 

Create role

 

Choose Next Permissions to proceed.

In the policy selection, select the parallelcluster-custom-policy that you just created.

Choose Next: Tags and then Next: Review.

In the Real Name box, enter parallelcluster-custom-roleand confirm by choosing Create role.

Slurm commands execution with AWS Lambda

AWS Lambda allows you to run your code without provisioning or managing servers. Lambda is used, in this solution, to execute the Slurm commands in the Master node. The AWS Lambda function can be created from the AWS console as explained in the Create a Lambda Function with the Console documentation.

For Function name, enter slurmAPI.

As Runtime, enter Python 2.7.

Choose Create function to create it.

 

Create function

The code below should be pasted into the Function code section, which you can see by scrolling further down the page. The Lambda function uses AWS Systems Manager to execute the scheduler commands, preventing any SSH access to the node. Please modify <REGION>appropriately, and update the S3 bucket name from pcluster-datato the name you chose earlier.

import boto3
import time
import json
import random
import string

def lambda_handler(event, context):
    instance_id = event["queryStringParameters"]["instanceid"]
    selected_function = event["queryStringParameters"]["function"]
    if selected_function == 'list_jobs':
      command='squeue'
    elif selected_function == 'list_nodes':
      command='scontrol show nodes'
    elif selected_function == 'list_partitions':
      command='scontrol show partitions'
    elif selected_function == 'job_details':
      jobid = event["queryStringParameters"]["jobid"]
      command='scontrol show jobs %s'%jobid
    elif selected_function == 'submit_job':
      script_name = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(10)])
      jobscript_location = event["queryStringParameters"]["jobscript_location"]
      command = 'aws s3 cp s3://%s %s.sh; chmod +x %s.sh'%(jobscript_location,script_name,script_name)
      s3_tmp_out = execute_command(command,instance_id)
      submitopts = ''
      try:
        submitopts = event["headers"]["submitopts"]
      except Exception as e:
        submitopts = ''
      command = 'sbatch %s %s.sh'%(submitopts,script_name)
    body = execute_command(command,instance_id)
    return {
        'statusCode': 200,
        'body': body
    }
    
def execute_command(command,instance_id):
    bucket_name = 'pcluster-data'
    ssm_client = boto3.client('ssm', region_name="<REGION>")
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    username='ec2-user'
    response = ssm_client.send_command(
             InstanceIds=[
                "%s"%instance_id
                     ],
             DocumentName="AWS-RunShellScript",
             OutputS3BucketName=bucket_name,
             OutputS3KeyPrefix="ssm",
             Parameters={
                'commands':[
                     'sudo su - %s -c "%s"'%(username,command)
                       ]
                   },
             )
    command_id = response['Command']['CommandId']
    time.sleep(1)
    output = ssm_client.get_command_invocation(
      CommandId=command_id,
      InstanceId=instance_id,
    )
    while output['Status'] != 'Success':
      time.sleep(1)
      output = ssm_client.get_command_invocation(CommandId=command_id,InstanceId=instance_id)
      if (output['Status'] == 'Failed') or (output['Status'] =='Cancelled') or (output['Status'] == 'TimedOut'):
        break
    body = ''
    files = list(bucket.objects.filter(Prefix='ssm/%s/%s/awsrunShellScript/0.awsrunShellScript'%(command_id,instance_id)))
    for obj in files:
      key = obj.key
      body += obj.get()['Body'].read()
    return body

In the Basic settings section, set 10 seconds as Timeout.

Choose Save in the top right to save the function.

In the Execution role section, choose View the join-domain-finction-role role on the IAM console (indicated by the red arrow in the image below).

 

Execution role

 

In the newly-opened tab, Choose Attach Policy and then Create Policy.

 

Permissions policies

This last action will open a new tab in your Browser. From this new tab, choose Create policy and then Json.

 

Attach Permissions

Create policy

 

Modify the <REGION>, <AWS ACCOUNT ID>appropriately, and also update the S3 bucket name from pcluster-datato the name you chose earlier.

 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:SendCommand"
            ],
            "Resource": [
                "arn:aws:ec2:<REGION>:<AWS ACCOUNT ID>:instance/*",
                "arn:aws:ssm:<REGION>::document/AWS-RunShellScript",
                "arn:aws:s3:::pcluster-data/ssm"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetCommandInvocation"
            ],
            "Resource": [
                "arn:aws:ssm:<REGION>:<AWS ACCOUNT ID>:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::pcluster-data",
                "arn:aws:s3:::pcluster-data/*"
            ]
        }
    ]
}

In the next section, enter as Name the ExecuteSlurmCommandsstring and then choose Create policy.

Close the current tab and move to the previous one.

Refresh the list, select the ExecuteSlurmCommands policy and then Attach policy, as shown here:

 

Attach Permissions

Execute the AWS Lambda function with AWS API Gateway

The AWS API Gateway allows the creation of REST and WebSocket APIs that act as a “front door” for applications to access data, business logic, or functionality from your backend services like AWS Lambda.

Sign in to the API Gateway console.

If this is your first time using API Gateway, you will see a page that introduces you to the features of the service. Choose Get Started. When the Create Example API popup appears, choose OK.

If this is not your first time using API Gateway, choose Create API.

Create an empty API as follows and choose Create API:

 

Create API

 

You can now create the slurmresource choosing the root resource (/) in the Resources tree and selecting Create Resource from the Actions dropdown menu as shown here:

 

Actions dropdown

 

The new resource can be configured as follows:

Configure as proxy resource: unchecked
Resource Name: slurm
Resource Path: /slurm
Enable API Gateway CORS: unchecked

To confirm the configuration, choose Create Resource.

 

New Child Resource

In the Resource list, choose /slurm and then Actions and Create method as shown here:

 

Create Method

Choose ANYfrom the dropdown menu, and choose the checkmark icon.

In the “/slurm – ANY – Setup” section, use the following values:

Integration type: Lambda Function
Use Lambda Proxy integration: checked
Lambda Region: eu-west-1
Lambda Function: slurmAPI
Use Default Timeout: checked

and then choose Save.

 

slurm - ANY - Setup

Choose OK when prompted with Add Permission to Lambda Function.

You can now deploy the API by choosing Deploy API from the Actions dropdown menu as shown here:

 

Deploy API

 

For Deployment stage choose [new stage], for Stage name enter slurmand then choose Deploy:

 

Deploy API

Take note of the API’s Invoke URL – it will be required for the API interaction.

Deploy the Cluster

The cluster can now be created using the following command line:

pcluster create -t slurm slurmcluster

-t slurm indicates which section of the cluster template to use.
slurmcluster is the name of the cluster that will be created.

For more details, see the AWS ParallelCluster Documentation. A detailed explanation of the pcluster command line parameters can be found in AWS ParallelCluster CLI Commands.

How to interact with the slurm API

The slurm API created in the previous steps requires some parameters:

  • instanceid– the instance id of the Master node.
  • function– the API function to execute. Accepted values .are list_jobs, list_nodes, list_partitions, job_detailsand submit_job.
  • jobscript_location– the s3 location of the job script (required only when function=submit_job) .
  • submitopts– the submission parameters passed to the scheduler (optional, can be used when function=submit_job).

Here is an example of the interaction with the API:

#Submit a job
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=submit_job&jobscript_location=pcluster-data/job_script.sh" -H 'submitopts: --job-name=TestJob --partition=compute'
Submitted batch job 11

#List of the jobs
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=list_jobs"
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                11   compute  TestJob ec2-user  R       0:14      1 ip-10-0-3-209
 
 #Job details
 $ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=job_details&jobid=11" 
JobId=11 JobName=TestJob
   UserId=ec2-user(500) GroupId=ec2-user(500) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-06-26T14:42:09 EligibleTime=2019-06-26T14:42:09
   AccrueTime=Unknown
   StartTime=2019-06-26T14:49:18 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-06-26T14:49:18
   Partition=compute AllocNode:Sid=ip-10-0-1-181:28284
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=ip-10-0-3-209
   BatchHost=ip-10-0-3-209
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/ec2-user/C7XMOG2hPo.sh
   WorkDir=/home/ec2-user
   StdErr=/home/ec2-user/slurm-11.out
   StdIn=/dev/null
   StdOut=/home/ec2-user/slurm-11.out
   Power=

The authentication to the API can be managed following the Controlling and Managing Access to a REST API in API Gateway Documentation.

Teardown

When you have finished your computation, the cluster can be destroyed using the following command:

pcluster delete slurmcluster

The additional created resources can be destroyed following the official AWS documentation:

Conclusion

This post has shown you how to deploy a Slurm cluster using AWS ParallelCluster, and integrate it with the AWS API Gateway.

This solution uses the AWS API Gateway, AWS Lambda, and AWS Systems Manager to simplify interaction with the cluster without granting access to the command line of the Master node, improving the overall security. You can extend the API by adding additional schedulers or interaction workflows and can be integrated with external applications.