Amazon ElastiCache has become synonymous with real-time applications. Redis’ high performance, simplicity, and support for diverse data structures have made it one of the most popular non-relational key value stores. With the growth of business-critical, real-time use cases on Redis, ensuring availability becomes an important consideration.
To provide high availability, Amazon ElastiCache for Redis supports Redis Cluster configuration, which delivers superior scalability and availability. In addition, Amazon ElastiCache offers multiple Availability Zone (Multi-AZ) support with auto failover that enables you to set up a cluster with one or more replicas across zones. In the event of a failure on the primary node, Amazon ElastiCache for Redis automatically fails over to a replica to ensure high availability.
Recently, Amazon ElastiCache for Redis made several announcements that improve the end-to-end availability of your Redis applications.
- Cluster availability during planned maintenance improves availability for auto-failover enabled clusters, during patching, updates, and other maintenance-related activities that involve node replacements. For Redis Cluster configurations set up to use Redis Cluster clients, the planned maintenance and node replacements now complete without any write interruption. For non-Redis Cluster (non-sharded) configurations, you may notice a brief write interruption of up to a few seconds, associated with DNS updates.
- Self-service updates allow you to minimize any maintenance impacts by controlling when to initiate maintenance updates.
- Single reader endpoints for non-Redis Cluster configuration allow you to direct read traffic, without having to track individual replica endpoint changes. This improves availability by eliminating the need for your application to track changes to individual node endpoints. For Redis Cluster configuration, this capability is typically already handled by the Redis cluster smart clients.
- Dynamic rename for Redis commands allows you to rename Redis commands in an online manner, without any reboots or availability impact.
To get the most out of these improvements and overall availability, review your configuration and make sure that it is set up to offer the best availability. The following sections walk through best practices for configuring Amazon ElastiCache for Redis clusters, Redis clients, as well as general application tips for availability.
Configuring Amazon ElastiCache for Redis
Amazon ElastiCache for Redis can be setup by selecting the appropriate node types, Redis configuration (Redis Cluster or non-Redis Cluster), number of replicas, and other opt-in features. As a first step, review the configuration of your Amazon ElastiCache for Redis cluster:
- Enable Multi-AZ with automatic failover: Enabling Multi-AZ minimizes downtime by performing automatic failovers from primary node to replicas, in case of any planned or unplanned maintenance. For more information, see Multi-AZ auto failover.
- Three-shard Redis Cluster: Having a minimum of three shards provides improved availability by providing faster recovery during both planned and unplanned failovers.
- Set up two or more replicas across Availability Zones: Having two replicas provides improved read scalability and also read availability in scenarios where one replica is undergoing maintenance. This is important if you are not using single reader endpoint and chose to direct your read requests to read replicas only (client setting).
- Use Nitro system-based node types: These node types—including R5 and M5—benefit from the advanced Nitro system, which delivers performance indistinguishable from bare metal and enhanced network processing. Amazon ElastiCache for Redis has further optimized performance on these nodes. As a result, you get better replication and synchronization performance, resulting in overall improved availability. For more information, see our previous blog post.
- Monitor and right-size to deal with anticipated traffic peaks: Under heavy load, the Redis engine may become unresponsive, which affects availability.
BytesUsedForCacheis a good indicator of your memory usage, whereas
ReplicationLagis an indicator of your replication health based on your write rate. You can use these metrics to trigger cluster scaling. For more information about monitoring and sizing, see Metrics for Redis, Managing Reserved Memory, and Choosing Your Node Size.
- Avoid maintenance and upgrades during peak hour: A lower write load eases failovers and minimize any application impact.
Configuring the Redis client
Redis provides a robust client ecosystem which gives you flexibility to choose a client based on your preference. The list below provides general guidance that is applicable across most clients:
- Redis Cluster mode: Use Cluster-aware Redis clients and connect to the cluster using the configuration endpoint. This allows the client to automatically discover the shard and slot mappings. Redis Cluster mode also provides online resharding (scale in/out) for resizing your cluster, and allows you to complete planned maintenance and node replacements without any write interruptions. The Redis Cluster client can discover the primary and replica nodes and appropriately direct client-specific read and write traffic.
- Non-Redis Cluster mode: Use the primary endpoint for all write traffic. During any configuration changes or failovers, Amazon ElastiCache ensures that the DNS of the primary endpoint is updated to always point to the primary node. Use the reader endpoint to direct all read traffic. Amazon ElastiCache ensures that the reader endpoint is kept up-to-date with the cluster changes in real time as replicas are added or removed. Individual node endpoints are also available but using reader endpoint frees up your application from tracking any individual node endpoint changes. Hence, it’s best to use primary endpoint for writes and single reader endpoint for reads.
- Socket timeout: Ensure that the socket timeout of the client is set to at least one second (vs. the typical “none” default in several clients). Setting the timeout too low can lead to numerous timeouts when the server load is high. Setting it too high can result in your application taking a long time to detect connection issues.
- DNS caching: If your client has a DNS caching mechanism built in, it is recommended to have a lower TTL (as low as 5–10 seconds). Having a higher TTL poses a risk of your application not reaching the desired node. Also, do not use the “cache forever” option.
Application best practices
In addition to configuring your Amazon ElastiCache for Redis cluster and Redis clients, it is helpful to review your application logic for general best practices and availability tips listed below:
- Avoid long-running LUA scripts: This can cause the Redis engine to be unresponsive and affect availability. If you must use a LUA script, make sure that you are sized appropriately to deal with CPU spikes.
- Consider expiration over eviction: Your eviction policy can be computationally more expensive than expiration. To reduce memory pressure, consider expiration on your keys.
- Avoid expensive command operations: Expensive commands such as KEYS can cause degradation in performance and hamper the managed operations on the cluster. An alternative is to use the SCAN command, which offers constant time complexity rather than linear time. Likewise, large objects of the Sorted Sets or Hash data type can cause sync issues and affect managed operations, including maintenance and upgrades.
To avoid accidental use of expensive commands, consider dynamic renaming of Redis commands. Amazon ElastiCache allows you to rename Redis commands while the cluster stays online, without requiring any reboots. For more information, see Amazon ElastiCache for Redis Version 5.0.3 (Enhanced).
We are excited to bring these availability improvements and recommendations to you. And this is just Day 1. Our team is continuing to enhance end-to-end system availability. Stay tuned for more updates and best practices. To get started with Amazon ElastiCache for Redis, access the Amazon ElastiCache console.
About the Author
Ruchita Arora is a Senior Product Manager at Amazon ElastiCache and works closely on all aspects of Amazon ElastiCache service. Besides databases, she has worked across storage, enterprise application development and telecommunication domains, in various engineering and product management roles.
Nirmal George Eapen is a Software Development Engineer at Amazon ElastiCache.