Best practices for Amazon RDS for PostgreSQL cross-Region read replicas

One of the managed service offerings of Amazon RDS for PostgreSQL is cross-Region read replicas. Cross-Region read replicas enable you to have a disaster recovery solution, scaling read database workload, and cross-Region migration. You can create a cross-Region read replica by using the Amazon RDS console, AWS CLI, or API create-db-instance-read-replica. There are additional data transfer charges for cross-Region read replicas.

Although Amazon RDS makes cross-Region read replicas simple, you still must be aware of several caveats to get the best experience. This post discusses some of the best practices for Amazon RDS for PostgreSQL cross-Region replication on the following topics:

  • What happens during a cross-Region read replica creation
  • How to minimize replication lag
  • Common issues while using an Amazon RDS for PostgreSQL cross-Region replica

What happens during a cross-Region read replica creation

Amazon RDS for PostgreSQL supports cross-Region read replicas for versions 9.5.2 and later. During the creation of a cross-Region replica, the system performs the following steps:

  • Takes a snapshot at the source instance.
  • Copies the snapshot to the destination Region.
  • Configures replication slot streaming between the source and cross-Region replica.
  • Starts streaming write-ahead logs (WALs) from the source to replica. WAL files are PostgreSQL transaction logs. Before writing any data changes to permanent storage, these changes are first recorded in WAL files.

The easiest way to create a cross-Region replica for Amazon RDS for PostgreSQL is by completing the following steps:

  1. On the Amazon RDS console, choose your Amazon RDS for PostgreSQL source instance.
  2. From the Actions drop-down menu, choose Create Read Replica.
  3. Choose the appropriate target Region.

You can also use AWS CLI create-db-instance-read-replica to create a cross-Region read replica. Unlike streaming physical replication for replicas in the same Region, Amazon RDS for PostgreSQL uses physical streaming replication with slots for cross-Region replicas. In the traditional streaming replication, the parameter wal_keep_segments maintains the number of WAL files at the instance. The archive_command parameter determines the archival of any additional WAL files in Amazon S3. Physical replication slots resolve dependency on this parameter. Replication slots provide an automated way to make sure that the master does not remove WAL segments until the cross-Region replica receives them. This makes sure that during any network issues between the Regions, WALs are accumulated at the source instance and not archived. Restoring WAL files from the archived location adds replication lag and hampers replication performance.

The following diagram illustrates how Amazon RDS for PostgreSQL performs replication between a source and replica in different Regions.

Minimizing replication lag

There are many scenarios in which data changes happening on the source instance are not immediately reflected on the cross-Region replica. The following section describes best practices to minimize replication lag while using Amazon RDS for PostgreSQL cross-Region read replicas.

Instance class and storage type

For optimal performance, your replica instance must be the same class or above as the source instance. Replica instances not only replay similar write activity as the master, but also serve additional read workloads. Therefore, it’s imperative to have the replica on the proper instance class. If the replica is hosted on lower instance class, it may experience deficiency of CPU, memory, and throughput. This may result in overall poor replica performance, which causes replication lag.

Similarly, it’s a good practice to have the replica on the same storage type as your source instance. For example, if the master instance is using 500 GB Provisioned IOPS SSD type storage with 20000 PIOPS, a cross-Region replica may perform worse with 500 GB General Purpose SSD storage while using the source optimally. Due to the deficiency of resources (such as CPU, memory, and IOPS) at the replica if it fails to replay the write activity of the source instance, the replication lag increases.

Write activity at the source instance

To write data on RDS storage, first log the transactions in the WAL files and then write those changes to physical storage blocks. High write activity at the source instance can create a high influx of WAL files. The high generation of WAL files and replaying of these files on the read replica slows down overall replication performance. You can monitor the write performance at the source instance by monitoring the metrics WriteIOPS and WriteThroughput in Amazon CloudWatch.

The following screenshot shows high write activity at the source instance with 100 GB GP2 storage.

The following screenshot shows high cross-Region replica lag due to high write activity at the source instance.

Exclusive locks on the source instance

PostgreSQL commands such as ALTER TABLE, TRUNCATE, DROP TABLE, REINDEX, CLUSTER, and VACUUM FULL cause ACCESS EXCLUSIVE LOCK on the table, which is the most restrictive lock mode. This mode conflicts with all other lock modes. This lock prevents all other transactions from accessing the table for the lock’s hold duration. Generally, the table remains locked until the transaction ends. The result of the transaction that took the lock can lead to a replication conflict. For example, if you drop a table on the source instance, the replica does not apply the change if queries are accessing that table. For more information, see Explicit Locking on the PostgreSQL website.

To avoid such situations, you should monitor for this situation by periodically querying the pg_locks and pg_stat_activity catalog tables. For example, the following code monitors locks in PostgreSQL 9.6 and newer PostgreSQL instances:

SELECT pid, 
       usename, 
       pg_blocking_pids(pid) AS blocked_by, 
       QUERY AS blocked_query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0;

Alternatively, you can also use Amazon RDS Performance Insights to monitor database locks. Performance Insights monitors Amazon RDS DB instance load so that you can analyze and troubleshoot your database performance. With the Performance Insights dashboard, you can visualize the database load and filter the load by waits, SQL statements, hosts, or users. The following graph shows high LWLock:buffer_mapping and LWLock:buffer_content wait events between 22:23 and 22:53.

Parameter settings at the cross-Region replica

With default configurations, the source instance is not aware of the read query start time and information of accessed rows at the cross-Region replica. It cleans up (VACUUM) the old versions without knowing that these rows may be used at the replica instance. The replica has to replay this cleanup and forcibly cancel all queries reading these rows. Therefore, a long-running read query at the cross-Region replica can fail with the following error if the corresponding rows are updated or deleted at the source instance:

ERROR: canceling statement due to conflict with recovery 
Detail: User query might have needed to see row versions that must be removed.

You can avoid this situation by enabling the hot_standby_feedback parameter. This setting postpones VACUUM on corresponding tables until the old version of rows is accessed at the replica. The downside of this setting is that this can bloat the table severely and cause an out-of-storage issue and risk of transaction ID wraparound. For more information, see Routine Vacuuming on the PostgreSQL website.

Alternatively, you can enable parameters like max_standby_archive_delay or max_standby_streaming_delay on the replica instance so it can complete long-running read queries. Both parameters pause WAL replay at the replica if the corresponding source data is modified while read queries are running on the replica. A value of -1 lets the WAL replay wait until the read query completes. However, this pause increases replication lag indefinitely and causes high storage consumption at the source due to WAL accumulation. High values of these parameters can cause high replication lag at the cross-Region replica. For example, if a query is running at the replica for 30 minutes, the setting of max_standby_streaming_delay at 30 minutes can cause up to 30 minutes of replication lag and an accumulation of 30 minutes of WAL logs at the source instance. If you modify any of these three parameters, keep the source instance healthy and any replication lag manageable by watching out for long-running read queries at the replica instance.

You may also find the following SQL query useful. In you enter the following code on the read replica, you kill all read transactions running longer than 5 minutes:

SELECT
   pg_terminate_backend(pid) 
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

Common issues while using an Amazon RDS for PostgreSQL cross-Region replica

You can resolve most of the issues of cross-Region replicas by managing the workload and modifying the instance configurations. This section discusses some of the common issues and resolutions related to Amazon RDS for PostgreSQL cross-Region replicas.

Replica is experiencing replication lag

If no workload is running on the source instance, you can expect up to 5 minutes of lag at the cross-Region replica. If you have unexpected lag, look at the write workload intensity at the source, verify that the instance and storage class match for the source and replica, and look for any existing network issues between the source and replica. Also, make sure that the source or replica are not throttling on resources such as CPU, memory, or IOPS.

If you still can’t determine the source of the lag, open a case with AWS Support to track any network issues between Regions. It’s also important to look at the pg_stat_activity view at the source instance and terminate certain running queries, which can prevent the replication process. You can monitor replication lag by using the CloudWatch metric ReplicaLag. You can also check that the LSN on the replica is increasing with the command select pg_current_wal_lsn();.

Low free storage at the source instance

The cross-Region read replica uses a physical replication slot created at the source instance. Because replication slots retain WAL files, cross-Region replication lag causes high WAL accumulation at the source instance and can eventually cause serious issues, such as being out of storage. You can use the CloudWatch metrics OldestReplicationSlotLag and TransactionLogsDiskUsage to determine the size of the replica lagging and the disk space used by transaction logs. The immediate solution to this issue is to drop the existing cross-Region replica and create a new one.

Setting a CloudWatch alarm for FreeStorageSpace allows some time to fix workload-related issues at the source instance and avoid replica recreate. For more information, see How can I create CloudWatch alarms to monitor the Amazon RDS free storage space and prevent storage full issues?

Summary

AWS continues to make your database replication performance more reliable and streamlined. Amazon RDS for PostgreSQL cross-Region replicas are an excellent tool for reducing read latency across Regions, cross-Region database migration, and low RPO/RTO disaster recovery. Although Amazon RDS manages single-click replica creation, you must be aware of the workload and instance configurations to avoid any replication issues. For more information about Amazon RDS for PostgreSQL read replica limitations, monitoring, and troubleshooting, see Working with PostgreSQL Read Replicas in Amazon RDS.

As always, AWS welcomes feedback. Please submit your thoughts or questions in the comments.

 


About the Author

 

Vivek Singh is a Senior Database Specialist with AWS focusing on Amazon RDS/Aurora PostgreSQL engines. He works with enterprise customers providing technical assistance on PostgreSQL operational performance and sharing database best practices.