This is a guest blog post by Arman Zakaryan (Director of Hosting Operations) and Michael Martin (Software Engineer) from Pagely. Pagely, in their own words, “provides a massively-scalable managed hosting solution for WordPress. They work with the largest and most innovative brands in the world to create tailor-made WordPress hosting solutions. Clients enjoy service and support from an all-star team of veteran DevOps engineers who prioritize customer happiness above all else. Pagely is Enterprise WordPress hosting at its finest.”
WordPress powers 30 percent of all websites. It is the content management system that we’ve built our business on at Pagely. Our managed WordPress hosting runs entirely on Amazon Web Services. In the same way that Amazon has freed customers from the worries of managing physical hardware and data centers, Pagely enables clients to stop worrying about managing WordPress and instead focus on their mission. Pagely’s dedicated support and experience with successfully running WordPress at scale pairs nicely with Amazon’s technology offerings.
One of the most crucial aspects of running WordPress is the MySQL database. Amazon empowers companies like Pagely to manage operational duties much more gracefully and efficiently than with other solutions. This is due to the innovations that are present in Amazon Relational Database Service (Amazon RDS), and especially with Amazon Aurora.
In this post, we illustrate how Amazon Aurora makes it a breeze to increase capacity on production workloads. We walk you through how the built-in features in Aurora make this process more efficient and dependable. You can certainly achieve similar results using a self-managed approach with enough time and the right in-house talent, or even with other commercially available solutions. But the process takes far less effort, is better integrated, and is more approachable when you use Aurora.
How does Aurora go above and beyond MySQL?
In one word: practicality.
Aside from the performance benefits, upkeep is straightforward. Things like creating a new replica and performing a failover are faster and provide more control than on an Amazon RDS instance using the MySQL or MariaDB engines. Taking it a step further, compared to a “vanilla” MySQL environment not using RDS, Aurora offers more flexibility and ease of use.
We previously blogged about the performance gains for WordPress on Aurora. But without the flexibility to handle what various workloads might throw at you, it doesn’t make a difference. These same features could take longer to build in-house, or it might take weeks to deploy alternative commercial solutions like MySQL Cluster or Galera Cluster into production for the first time. If you’re not using AWS yet, AWS Database Migration Service (AWS DMS) and AWS Direct Connect are tools that can help you migrate your onsite MySQL and PostgreSQL systems to Aurora.
What does Amazon Aurora do?
Amazon Aurora is a relational database engine that combines the speed and reliability of a high-end commercial database with the simplicity and cost-effectiveness of an open-source database. You can read more about it in the User Guide for Aurora.
The standard configuration of an Aurora relational database service is a Multi-AZ cluster with a primary (writer) instance and a single, secondary (read replica) instance. Although it is possible and even necessary in some cases, WordPress does not natively support the splitting of read and write queries. So the replica is actually unused in normal operation. Therefore, the purpose of the replica is to serve as a resizable failover target in the event that we need to scale up for more capacity. This is because we’ve found it’s generally faster to resize an existing instance than it is to spin up a new one.
It’s important for us to know right away if a database instance starts getting overloaded. We must be able to trace the cause of that problem back to a specific query or customer database schema, increase capacity if needed, and minimize the amount of downtime during modifications.
To monitor the health of our Aurora clusters, we use Amazon CloudWatch. Alarms inform our team of operational issues like high CPU usage or a spike/drop in select throughput. We can even be notified of things like unexpected failover events using RDS event notifications. These are valuable insights when managing large-scale WordPress deployments.
How do we work with Aurora?
Our clients range from Fortune 500 enterprise companies to top-tier universities in the public sector, and a wide range of B2B users in between. Spikes can come from updates, traffic, promotions, and other sources, and our clients’ sites need to be ready before they happen. Here’s how Aurora can support our efforts during scaling and for recovery from high-load events.
Managing sudden spikes
Let’s consider an Aurora DB cluster powering a segment of customer sites’ databases that experiences a sudden spike in CPU usage, consuming most of the spare capacity we’ve reserved.
This situation could happen for a number of reasons—for example, if a larger group of databases receives a buggy update from a tool that they have in common. During initial assessment of the load issue, we’ll decide if it’s necessary to resize the read replica to have higher capacity and perform a failover.
Next, we validate that the CPU usage is within the acceptable thresholds following failover, and that actual application performance has returned to a nominal state.
With the capacity issue addressed in the immediacy, we identify the cause of the higher resource usage, which could be something like the following:
- A bad plugin update on WordPress
- A bad query
- A bad site
- A site missing a web cache, or something else
With Aurora, we can scale up to absorb the unexpected increase in workload, address the issue that caused the spike, and monitor for improvement. After systems stabilize, we can switch back to the original writer and downsize the reader once again.
Facilitating uninterrupted growth
Now consider an enterprise client with a private Aurora RDS cluster who has been growing in popularity. Our evaluation of the application indicates that their database endpoint requires more capacity. That can be CPU, memory, and even network throughput, all of which are what a larger instance gives you more of.
The client, of course, does not want to experience extended downtime, so we need a solution that keeps their site up and running well. The way we handle this is to add a temporary read replica of equivalent size to their Aurora cluster. We perform a failover to it, and then resize the original instance while it acts as the reader endpoint.
After the resize is complete, we fail over, once again, to the original writer and ensure that the new capacity is sufficient before removing the temporary replica. If your specific integration with Aurora requires keeping instance names the same for automation to work correctly, this process would also work for you. If you can have a new instance name be reflected more fluidly, then just create a bigger replica, fail over, and remove the other instance.
The very best part about this is that the process works as reliably for a database footprint of hundreds of thousands of tables as it does for more modest workloads. Some of the WordPress applications we host have massive amounts of data (relative to WordPress norms), and we consider Aurora the obvious choice for them.
What tips do we have for using Aurora?
Here are some things we’ve learned along the way that might save you time with your own Aurora usage.
Inform clients about upcoming maintenance
Always communicate with your clients on scheduled maintenance, even if you anticipate that little to no downtime will occur. They’ll appreciate that you are keeping them informed. It also gives them a chance to notify their own clients. If there does end up being a little downtime, it’s okay because expectations were properly set. Standing up a status page that lets your customers subscribe to get maintenance announcements helps make this an efficient and routine step.
Develop custom tools for your use case
You can use the Amazon RDS console to perform all the necessary tasks. But the underlying API and software development kit enable integration with a workflow that better fits your specific use case. We highly recommend developing custom tooling that works with the API. This lets you delegate complex operations directly to your customers or internally to a wider net of staff members in a more simplified, mission-oriented format—without giving out direct access to RDS.
What are some things that might work? Using orchestration and pipeline tools such as Rundeck, Red Hat Ansible, or Jenkins, you can tap into AWS APIs and create your own way of conducting any given scenario.
Monitor usage and other trends
It’s important to set up CloudWatch alarms ahead of time. Make sure that they are visible to your team via email or with a pager system.
At the very least, you should be monitoring your Aurora clusters for CPU usage. Integrating this with SNS and PagerDuty allows your own Site Reliability Engineering (SRE) team to be informed of dissatisfying trends. They can then initiate the proper remediation protocols quickly. Explore all the metrics available, especially following a load issue, to better understand what additional alarms would help provide early warning of problems for your use case.
Plan for workload spikes
Always maintain spare CPU and memory capacity on Aurora clusters handling production workloads.
Generally, maintaining a 50 percent or lower baseline CPU utilization is the goal for workloads that experience significant usage spikes. The spare headroom is so that you can absorb most spikes and avoid the need for a failover wherever possible.
While we’re on the topic, Aurora Serverless with auto scaling capabilities is in the works, and it could be a real game-changer. Everything we covered so far is operating under the strategy of provisioning for peak and failing over to a larger instance when needed. This is an effective strategy, but nonetheless requires a human being or your own automation to drive those decisions and events. Serverless promises to mostly eliminate this as an area of concern. Although you will still experience spikes, you can focus on tracking down the cause of that problem instead of worrying about maintaining sufficient headroom. We are excited to see this new product take shape.
Keeping a read replica online as a failover target is crucial for instances handling workloads for multiple customers or high-end enterprise clients who need a rapid scale-up path. Creating multiple replicas with Aurora is an option that can be used as part of a comprehensive disaster recovery (DR) plan spanning across AWS Regions.
Even if you roll with a single writer Aurora cluster for cost savings, adding an Aurora replica usually takes less time than non-Aurora RDS MySQL. The replica can be brought online as needed and used for failover/scaling-up, and then removed later on.
With a non-Aurora RDS cluster, these processes would still be doable, but Aurora helps to streamline them. On self-managed non-RDS MySQL servers, you need to have a little more knowledge and handle with care.
Amazon Aurora, as a package, comes equipped with built-in tools that are available to you right off the bat. Having these tools saves us the time that we would have to spend building something ourselves. It also provides a solution that we can trust and build upon, which, in an industry like ours, is invaluable.
About the Authors
Arman Zakaryan is the Director of Hosting Operations at Pagely.
Michael Martin is a software engineer at Pagely.