Amazon DynamoDB is a NoSQL cloud database service that is designed to provide low-latency and high-throughput performance for applications and services running at scale. Example use cases include:
- Massively multiplayer online games (MMOGs)
- Virtual and augmented reality
- Checkout and order processing in ecommerce
- Real-time stock pricing and trading
When you operate such systems globally, you can occasionally experience latency spikes. These spikes can occur because of retries from transient network interruptions, service-side and network-side issues, or overloaded and slow clients.
Regardless of the root causes, an application that interacts with the DynamoDB service should be tuned to follow a retry strategy that helps avoid latency spikes. Depending on the AWS SDK in use, the underlying HTTP client behavior can be reconfigured from the default settings to help ensure that low-level client-server communications over HTTP honor the application’s latency requirements. In this blog post, I discuss the AWS Java SDK configuration options that are available to fine-tune the HTTP request timeout and retry behaviors for latency-aware DynamoDB clients. I will also show two hypothetical application scenarios to illustrate the benefits of a proper configuration.
AWS Java SDK HTTP settings for DynamoDB clients
The AWS Java SDK gives you full control over the HTTP client behavior and retry strategies. For information about standard HTTP configurations, see Client Configuration. However, the more specific configurations that are required to build a latency-aware DynamoDB application client are in ClientConfiguration (JavaDocs) code implementations.
For this blog post, I build an asynchronous DynamoDB client in Java from scratch, and show how to use the
ClientConfiguration implementation from the AWS SDK to define the application-specific latency requirements. In this example, I create an asynchronous DynamoDB client that can make multiple sequential DynamoDB API calls to the service endpoint, without waiting for the responses to return before issuing the next API call. Because I want our DynamoDB application to be latency sensitive, an asynchronous client application is a good choice. It can prepare and process increasing numbers of API calls from different modules or microservices to the backend, and thus decouple individual execution processes.
First, I create a Java class called
MyDynamoDBClientConfig with functions
createDynamoDBClient() function returns an asynchronous DynamoDB client object that uses the low-level HTTP client configurations from a
ClientConfiguration object that are returned by the
createDynamoDBClientConfiguration() private API operation. As you can see in the following code example, five HTTP client configuration parameters are set during the
ClientConfiguration object creation:
- the DynamoDB default retry policy for HTTP API calls with a custom maximum error retry count
In the following subsections, I provide details about these client configuration parameters, to explain their significance when creating a latency-aware DynamoDB client application.
ConnectionTimeout is the maximum amount of time that the client waits for the underlying HTTP client in the SDK to establish a TCP connection with the DynamoDB endpoint. The connection is an end-to-end, two-way communication link between the client and server, and it is used and reused to make API calls and receive responses. The default value of this setting is 10 seconds. If the establishment of a socket with TCP and TLS takes longer than 10 seconds, there might be larger issues related to the network path, packet loss, or other unspecified problems that are outside of your control.
ClientExecutionTimeout is the maximum allowed total time spent to perform an end-to-end operation and receive the desired response, including any retries that might occur. Essentially, this is the SLA of your DynamoDB operation – the timeframe for completion of all HTTP requests, including retries.
ClientExecutionTimeout controls the overall execution time of an application-level operation. (If you want to control the behavior of an individual HTTP request, you can use the
RequestTimeout option, discussed next.) By default,
ClientExecutionTimeout is disabled in the default HTTP client configuration. However, based on the significance of this setting in defining and controlling the application SLA for an operation, you should set it to an appropriate value, to help you control the worst-case scenario of waiting for a return from DynamoDB. For example, you could estimate and use the longest potential blocking time for any nonstreaming application-specific operation.
RequestTimeout is the time it takes for the client to perform a single HTTP request.
RequestTimeout is measured from the moment that a DynamoDB API call (such as
GetItem) is made until a response is received from the service. Logically, the value of this timeout should be less than the
ClientExecutionTimeout. As with
ClientExecutionTimeout, this setting is disabled by default. When you estimate a reasonable request timeout value, be careful not to set it to an extreme value. For example, if the value is set too low, even minor transient network failures involving TCP packet loss and subsequent retransmissions on the transport layer could cause request failures. Also keep in mind that if
RequestTimeout is kept to the default (disabled), retries might be prolonged until either the
ClientExecutionTimeout (if set) or
SocketTimeout threshold is reached.
Note that the
RequestTimeout parameters set approximate limits on the time of an operation, but the timers can be activated even seconds after the actual timeout should have occurred. This means that API calls that return large responses can take several seconds to abort after a timeout occurs. At the SDK level, a thread pool is created when one of these two settings is enabled, with up to five threads across all of the request threads to monitor timers in the request and client contexts. We recommend that you set
ClientExecutionTimeout along with the
RequestTimeout setting so that the time taken for the individual retries of a single HTTP request can be approximated, based on the real-world scenarios, with
ClientExecutionTimeout acting as a higher-level safeguard.
SocketTimeout defines the maximum amount of time that the HTTP client waits to read data from an already established TCP connection. This is the time between when an HTTP POST ends and the entire response of the request is received, and it includes the service and network round-trip times. In certain cases when the socket hangs—for example, due to I/O exceptions—this setting prevents the client from blocking for too long. The general recommendation is to set this value a little higher than the
RequestTimeout setting if they are used together. For operations such as
BatchGetItem, the best practice is to set
SocketTimeout to a high value such as 5,500 milliseconds, because a high value helps ensure that the records are returned as UnprocessedItems if one or more items have problems at the service end. A high value is recommended for batch operations because DynamoDB will time out individual item read or write operations after five seconds, and if you wait a shorter time for the service response, you won’t know which item is problematic. The benefit of letting DynamoDB return the item as an unprocessed item is that the client can then retry the failed items and not the entire batch.
The DynamoDB default retry policy with custom maximum retry settings
The default retry policy available in the AWS Java SDK for DynamoDB is a good starting point to define client-side retry strategies for the underlying HTTP client. The default policy starts with a maximum of 10 retries with a predefined base delay of 25 milliseconds for any 5XX server-side exceptions (such as “HTTP status code – 500 Internal Server Error” or “HTTP status code – 503 Service Unavailable”), and 500 milliseconds for any 4XX client-side exceptions (such as “HTTP status code 400 – ProvisionedThroughputExceededException”).
PredefinedBackoffStrategies (JavaDocs) includes two predefined backoff strategies that are used in these retries. For the nonthrottled 5XX requests, the
FullJitterBackoffStrategy is picked up and uses the base delay of 25 milliseconds with a maximum delay of 20 seconds. For the throttled 4XX requests,
EqualJitterBackoffStrategy is used. It starts with a 500-millisecond base delay, and can reach up to a maximum of 20 seconds, exponentially growing by 500 milliseconds, 1,000 milliseconds, 2,000 milliseconds, and so on, until reaching 20,000 milliseconds. (These two strategies are discussed in detail in Exponential Backoff and Jitter on the AWS Architecture blog.)
If you use the DynamoDBMapper class in your application, it internally initiates the client with the default
ClientConfiguration options that I mentioned previously. Additionally, this class also uses its own retry mechanism,
DefaultBatchWriteRetryStrategy, for the unprocessed items from a
BatchWriteItem API call. It includes a one-second minimum delay and a maximum delay of three seconds, and an exponential backoff strategy. Therefore, the use of
DynamoDBMapper class with its default settings can add unexpected extra latency to your requests. In other words, in order to maintain maximum control, you should consider using the low-level DynamoDB API operations first wherever possible, and then tweak the SDK-level settings to define the application’s behavior in production.
Finally, if the default strategies do not address your use case, disable the default retry option with
NO_RETRY_POLICY by specifying it with
PredefinedRetryPolicies.NO_RETRY_POLICY) when creating the
ClientConfiguration. You also can implement your own retry logic by extending the V2CompatibleBackoffStrategyAdapter class. As a best practice, you should retry the 5XX server-side exceptions at a faster rate because these types of issues are typically transient. For example, you might want to start with a base delay of 25 milliseconds and with a linear increase up to a maximum of one second. Similarly, for the 4XX client-side exceptions, start with a base delay of 100 milliseconds and a maximum of 500 milliseconds. With 4XX client-side exceptions, you should always try to slip a throttled request to the next second to fully consume the DynamoDB table’s capacity. In both cases, the number of retries to make depends on your real-world use case and your own judgment.
ThrottledRetries is another helpful
ClientConfiguration parameter that can be used to fail-fast (in other words, to detect longer running server-side failures that can be failed-over if necessary). This feature is enabled by default in the ClientConfiguration. A finite size retry pool is maintained in the AmazonHttpClient class in the SDK, and each retry request consumes a certain capacity from this pool, eventually draining it. The default size of this pool is 100 retries. Based on the strategy defined in
RetryPolicy, the actual number of retries can vary before the throttling kicks in and the client is no longer able to make successful retry requests. When the server-side issues are resolved, the pool is again filled up and retry requests are honored. Retry throttling kicks in only when the increasing number of retry attempts fail with a 5XX HTTP response code. This means that transient retries are not affected by the retry throttling.
ThrottledRetries is not a circuit-breaker, so new requests to the service endpoint are not stopped. The general recommendation is to keep this parameter set to the default. If you decide to set the value of
maxErrorRetries (discussed earlier) low (for example, one or two retries), you should disable
ThrottledRetries and gracefully handle the retries, or rely on your custom retry logic.
In the following code example, I create a Java class called
MyDynamoDBClientParameters to define the HTTP client configuration parameters with their values. This class then can be fed into the
createDynamoDBClientConfiguration() function of the
MyDynamoDBClientConfig class that I created earlier.
In this example, we assume that an application operation involves multiple
BatchWriteItem calls, and the HTTP client times out after five seconds (via
clientExecutionTimeout). We also assume that the client can time out if it fails to establish a connection within one second (via
connectionTimeout) or if an established connection is idle for more than three seconds (via
socketTimeout). Finally, we assume that a single
BatchWriteItem request takes at most 500 milliseconds to complete (via
requestTimeout), based on the number of items it processes, and we use the default retry count of 10.
In this section, I present two example application use-case scenarios in which DynamoDB can be configured to reduce application-level latency and maintain an application’s SLA during service or network interruptions.
Adding items to a shopping cart
Let’s consider the example of a hypothetical ecommerce web application. This application’s microservice module is responsible for adding a single item to a customer’s shopping cart. From a business perspective, you never want to lose the opportunity to add an item to the customer’s cart. Therefore, the latency SLA for this no-streaming operation that involves a single DynamoDB
PutItem call is set to as low as 20 milliseconds.
Now, imagine that you have created the DynamoDB client in your Java class as shown in the following code, without overwriting any of the
ClientConfiguration parameters described in this post. In such a case, all of the default values are applied. An asynchronous DynamoDB client is created with a connection timeout value of 10 seconds and a socket timeout value of 50 seconds, with a maximum of 10 error retries.
When testing and evaluating a prototype in your development environment, this configuration would be sufficient. However, in a production environment, we need a way to safeguard the microservice module and its associated downstream applications in the event of a brief service interruption, a network flip, or unresponsive application modules that call this API. Given that client execution and request timeouts are disabled by default, a default client configuration can make things worse by rapidly increasing the overall application latency. This can propagate a transient failure in a cascading way over the time that elapses to complete the 10 default retries.
Now let’s consider a more robust configuration. In this example, I configure a DynamoDB client as shown in the following code example. Each
PutItem call must be completed within 20 milliseconds. In case the call encounters throttling, it retries five times. The client waits up to 50 milliseconds for connection establishment and timeout if the socket is still transferring data for more than 25 milliseconds. Finally, in the worst-case scenario, the client terminates the execution of this operation after 100 milliseconds and returns back to its caller.
The following code example creates a DynamoDB client object (
dynamoDBClient) in Java using the
MyDynamoDBClientConfig Java class that I created in the previous section. This creates the low-level client and the DynamoDB document API client object (
dynamoDB) for any high-level API interactions. The client configuration parameters are fed through the
Saving game states by using BatchWriteItem
Let’s consider another nonstreaming operation scenario in which I save the state of an MMOG application in real time. This requires a combination of
PutItem calls in the form of a
BatchWriteItem into a DynamoDB table. In this scenario, creating a default client can be problematic. In a real-time MMOG application, saving a game state triggers several other state transitions and also propagates the changes to other online users. As a result, an application module can severely affect all of its downstream modules if there is a transient service or network issue or even a delay induced by some other customers using the application itself. To handle such situations, you should configure the client with a client execution timeout that is less than one second, and with a lower number of internal retries on error. In the worst-case scenario, if the application module fails on all of its retries, gamers can make a retry while the application survives the temporary failures without affecting its downstream in-application modules.
For this use case, I can configure the client as shown in the following code example, if it spends longer than 500 milliseconds sending and receiving the response for the
BatchWriteItem call. Additionally, any failed request is retried three times before the client times out after one second. The underlying socket also times out if there is no network packet transmission or receptions within 550 milliseconds. As always, I want to abort the connection establishment after 50 milliseconds if any network-related issues exist.
In this post, I showed how to tune your DynamoDB client configuration while using the AWS Java SDK, based both on your use case and the application-defined SLA. Tuning these SDK parameters for the underlying HTTP client for DynamoDB requires an understanding of the average latency requirements and record characteristics (such as the number of items and their average size) for your application, interdependencies between different application modules or microservices, and the deployment platform. Careful application API design, proper timeout values, and a retry strategy can prepare your application for unavoidable network and server-side issues.
About the authors
Joarder Kamal is an AWS cloud support engineer. He likes building and automating systems that combine distributed communications, real-time data, and collective intelligence.
Sean Shriver is a Dallas-based senior NoSQL specialist solutions architect focused on DynamoDB.