Many customers use Amazon EMR to run big data workloads, such as Apache Spark and Apache Hive queries, in their development environment. Data analysts and data scientists frequently use these types of clusters, known as analytics EMR clusters. Users often forget to terminate the clusters after their work is done. This leads to idle running of the clusters and in turn, adds up unnecessary costs.
To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.
In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.
You must have the following before you can create and deploy this framework:
Note: This solution is designed as an additional feature. It can be applied to any existing EMR clusters by executing the scheduler script (explained later in the post) as an EMR step. If you want to implement this solution as a mandatory feature for your future clusters, you can include the EMR step as part of your cluster deployment. You can also apply this solution to EMR clusters that are spun up through AWS CloudFormation, the AWS CLI, and even the AWS Management Console.
The following are the key components of the solution.
Analytics EMR cluster
Amazon EMR provides a managed Apache Hadoop framework that lets you easily process large amounts of data across dynamically scalable Amazon EC2 instances. Data scientists use analytics EMR clusters for data analysis, machine learning using notebook applications (such as Apache Zeppelin or JupyterHub), and running big data workloads based on Apache Spark, Presto, etc.
The schedule_script.sh is the shell script to be executed as an Amazon EMR step. When executed, it copies the monitoring script from the Amazon S3 artifacts folder and schedules the monitoring script to run every 5 minutes. The S3 location of the monitoring script should be passed as an argument.
The pushShutDownMetrin.sh script is a monitoring script that is implemented using shell commands. It should be installed in the master node of the EMR cluster as an Amazon EMR step. The script is scheduled to run every 5 minutes and sends the cluster activity status to CloudWatch.
JupyterHub API token script
The jupyterhub_addAdminToken.sh script is a shell script to be executed as an Amazon EMR step if JupyterHub is enabled on the cluster. In our design, the monitoring script uses REST APIs provided by JupyterHub to check whether the application is in use.
To send the request to JupyterHub, you must pass an API token along with the request. By default, the application does not generate API tokens. This script generates the API token and assigns it to the admin user, which is then picked up by the jupyterhub module in the monitoring script to track the activity of the application.
Custom CloudWatch metric
All Amazon EMR clusters send data for several metrics to CloudWatch. Metrics are updated every 5 minutes, automatically collected, and pushed to CloudWatch. For this use case, we created the Amazon EMR metric EMR-INUSE. This metric represents the active status of the cluster based on the module checks implemented in the monitoring script. The metric is set to 0 when the cluster is inactive and 1 when active.
CloudWatch is a monitoring service that you can use to set high-resolution alarms to take automated actions. In this case, CloudWatch triggers an alarm if it receives 0 continuously for the configured number of hours.
Lambda is a serverless technology that lets you run code without provisioning or managing servers. With Lambda, you can run code for virtually any type of application or backend service—all with zero administration. You can set up your code to automatically trigger from other AWS services. In this case, the triggered CloudWatch alarm mentioned earlier signals Lambda to terminate the cluster.
The following diagram illustrates the sequence of events when the solution is enabled, showing what happens to the EMR cluster that is spun up via AWS CloudFormation.
The diagram shows the following steps:
- The AWS CloudFormation stack is launched to spin up an EMR cluster.
- The Amazon EMR step is executed (installs the pushShutDownMetric.sh and then schedules it as a cron job to run every 5 minutes).
- If the EMR cluster is active (executing jobs), the master node sets the EMR-INUSE metric to 1 and sends it to CloudWatch.
- If the EMR cluster is inactive, the master node sets the EMR-INUSE metric to 0 and sends it to CloudWatch.
- On receiving 0 for a predefined number of data points, CloudWatch triggers a CloudWatch alarm.
- The CloudWatch alarm sends notification to AWS Lambda to terminate the cluster.
- AWS Lambda executes the Lambda function.
- The Lambda function then deletes all the stack resources associated with the cluster.
- Finally, the EMR cluster is terminated, and the Stack ID is removed from AWS CloudFormation.
Modules in the monitoring script
Following are the different activity checks that are implemented in the monitoring script (pushShutDownMetric.sh). The script is designed in a modular fashion so that you can easily include new modules without modifying the core functionality.
The ActiveSSHCheck module checks whether there are any active SSH connections to the master node. If there is an active SSH connection, and it’s idle for less than 10 minutes, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch.
Apache Hadoop YARN is the resource manager of the EMR Hadoop ecosystem. All the Spark Submits and Hive queries reach YARN initially. It then schedules and processes these jobs. The YARNCheck module checks whether there are any running jobs in YARN or jobs completed within last 5 minutes. If it finds any, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch.The checks are performed by calling REST APIs exposed by YARN.
The API to fetch the running jobs is http://localhost:8088/ws/v1/cluster/apps?state=RUNNING.
The API to fetch the completed jobs is
Presto is an open-source distributed query engine for running interactive analytic queries. It is included in EMR release version 5.0.0 and later.The PRESTOCheck module checks whether there are any running Presto queries or if any queries have been completed within last 5 minutes. If there are some, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. These checks are performed by calling REST APIs exposed by the Presto server.
The API to fetch the Presto jobs is http://localhost:8889/v1/query.
Amazon EMR users use Apache Zeppelin as a notebook for interactive data exploration. The ZeppelinCheck module checks whether there are any jobs running or if any have been completed within the last 5 minutes. If so, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. These checks are performed by calling the REST APIs exposed by Zeppelin.
The API to fetch the list of notebook IDs is http://localhost:8890/api/notebook.
The API to fetch the status of each cell inside each notebook ID is http://localhost:8890/api/notebook/job/$notebookID.
Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text. JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server.The JupyterHubCheck module checks whether any Jupyter notebook is currently in use.
The function uses REST APIs exposed by JupyterHub to fetch the list of Jupyter notebook users and gathers the data about individual notebook servers. From the response, it extracts the last activity time of the servers and checks whether any server was used in the last 5 minutes. If so, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. The jupyterhub_addAdminToken.sh script needs to be executed as an EMR step before enabling the scheduler script.
The API to fetch the list of notebook users is https://localhost:9443/hub/api/users -H "Authorization: token $admin_token".
The API to fetch individual server information is https://localhost:9443/hub/api/users/$user -H "Authorization: token $admin_token.
If any one of these checks fails, the cluster is considered to be inactive, and the monitoring script sets the EMR-INUSE metric to 0 and pushes it to CloudWatch.
The scheduler script schedules the monitoring script (pushShutDownMetric.sh) to run every 5 minutes. Internal cron jobs that run for a very few minutes are not considered in calibrating the EMR-INUSE metric.
Deploying each component
Follow the steps in this section to deploy each component of the proposed design.
Step 1. Create the Lambda function and SNS subscription
The Lambda function and the SNS subscription are the core components of the design. You must set up these components initially, and they are common for every cluster. The following are the AWS resources to be created for these components:
- Execution role for the Lambda function
- Terminate Idle EMR Lambda function
- SNS topic and Lambda subscription
For one-step deployment, use this AWS CloudFormation template to launch and configure the resources in a single go.
The following parameters are available in the template.
|s3Bucket||emr-shutdown-blogartifacts||The name of the S3 bucket that contains the Lambda file|
|s3Key||EMRTerminate.zip||The Amazon S3 key of the Lambda file|
For manual deployment, follow these steps on the AWS Management Console.
Execution role for the Lambda function
- Open the AWS Identity and Access Management (IAM) consoleand choose Policies, Create policy.
- Choose the JSON tab, paste the following policy text, and then choose Review policy.
- For Name, enter TerminateEMRPolicy and choose Create policy.
- Choose Roles, Create role.
- Under Choose the service that will use this role, choose Lambda, and then choose Next: Permissions.
- For Attach permissions policies, choose the arrow next to Filter policies and choose Customer managed in the drop-down list.
- Attach the TerminateEMRPolicy policy that you just created, and choose Review.
- For Role name, enter TerminateEMRLambdaRole and then choose Create role.
Terminate idle EMR – Lambda function
I created a deployment package to use with this function.
- Open the Lambda consoleand choose Create function.
- Choose Author from scratch, and provide the details as shown in the following screenshot:
- Name: lambdaTerminateEMR
- Runtime: Python 2.7
- Role: Choose an existing role
- Existing role: TerminateEMRLambdaRole
- Choose Create function.
- In the Function code section, for Code entry type, choose Upload a file from Amazon S3, and for Runtime, choose Python 2.7.
The Lambda function S3 link URL is
Link to the function: https://s3.amazonaws.com/emr-shutdown-blogartifacts/EMRTerminate.zip
This Lambda function is triggered by a CloudWatch alarm. It parses the input event, retrieves the JobFlowId, and deletes the AWS CloudFormation stack of the corresponding JobFlowId.
SNS topic and Lambda subscription
For setting the CloudWatch alarm in the further stages, you must create an Amazon SNS topic that notifies the preceding Lambda function to execute. Follow these steps to create an SNS topic and configure the Lambda endpoint.
- Navigate to the Amazon SNS console, and choose Create topic.
- Enter the Topic name and Display name, and choose Create topic.
- The topic is created and displayed in the Topics
- Select the topic and choose Actions, Subscribe to topic.
- In the Create subscription, choose the AWS Lambda Choose lambdaterminateEMR as the endpoint, and choose Create subscription.
Step 2. Execute the JupyterHub API token script as an EMR step
This step is required only when JupyterHub is enabled in the cluster.
Navigate to the EMR cluster to be monitored, and execute the scheduler script as an EMR step.
This script generates an API token and assigns it to the admin user. It is then picked up by the jupyterhub module in the monitoring script to track the activity of the application.
Step 3. Execute the scheduler script as an EMR step
Navigate to the EMR cluster to be monitored and execute the scheduler script as an EMR step.
Ensure that termination protection is disabled in the cluster. The termination protection flag causes the Lambda function to fail.
The step function copies the pushShutDownMetric.sh script to the master node and schedules it to run every 5 minutes.
The schedule_script.sh is at https://s3.amazonaws.com/emr-shutdown-blogartifacts/schedule_script.sh.
The pushShutDownMetrin.sh is at https://s3.amazonaws.com/emr-shutdown-blogartifacts/pushShutDownMetrin.sh.
Step 4. Create a CloudWatch alarm
For single-step deployment, use this AWS CloudFormation template to launch and configure the resources in a single go.
The following parameters are available in the template.
|AlarmName||TerminateIDLE-EMRAlarm||The name for the alarm.|
|EMRJobFlowID||Requires input||The Jobflowid of the cluster.|
|EvaluationPeriod||Requires input||The idle timeout value—input should be in data points (1 data point equals 5 minutes). For example, to terminate the cluster if it is idle for 20 minutes, the input should be 4.|
|SNSSubscribeTopic||Requires input||The Amazon Resource Name (ARN) of the SNS topic to be triggered on the alarm.|
The AWS CloudFormation CLI command is as follows:
For manual deployment, follow these steps to create the alarm.
- Open the Amazon CloudWatch console and choose Alarms.
- Choose Create Alarm.
- On the Select Metric page, under Custom Metrics, choose EMRShutdown/Cluster-Metric.
- Choose the isEMRUsed metric of the EMR JobFlowId, and then choose Next.
- Define the alarm as required. In this case, the alarm is set to send notification to the SNS topic shutDownEMRTest when CloudWatch receives the IsEMRUsed metric as 0 for every data point in the last 2 hours.
- Choose Create Alarm.
In this post, we focused on building a framework to cut down the additional cost that you might incur due to the idle running of an EMR cluster. The modules implemented in the shell script, the tracking of the execution status of the Spark scripts, and the Hive/Presto queries using the lightweight REST API calls make this approach an efficient solution.
If you have questions or suggestions, please comment below.
About the Author
Praveen Krishnamoorthy Ravikumar is an associate big data consultant with Amazon Web Services.