3
3
Table of Contents

In today’s cloud-native world, database availability is absolutely critical. Amazon Aurora, being a high-performance and fully managed relational database, is designed with fault tolerance in mind. But like with most things in tech, the way you configure it can make a huge difference when things go sideways.

In this post, I’ll walk you through a real incident from a production environment that we recently handled. It’s a good example of how things can still go wrong despite having a seemingly resilient setup—and what you can do to avoid similar issues.

The Scenario: Unexpected Downtime Despite Redundancy

One of our customers was running an Aurora cluster with the standard setup—one Writer and one Reader instance. To manage database connections efficiently, they were using RDS Proxy with a read-only endpoint.

On paper, this sounds pretty solid:

  • A separate Reader handles read traffic.
  • RDS Proxy manages database connections efficiently.

However, the Reader instance experienced a host-level failure. At this point, RDS Proxy, which was wired to the read-only endpoint, just waited for the Reader to come back.

What didn’t happen?

  • Proxy didn’t reroute traffic to the Writer, even though it was still healthy and available.
  • As a result, read queries started failing, application performance tanked, and users experienced ten minutes of downtime for read-only queries.

Digging Into the Root Cause

When we looked deeper, here’s what we found:

  • The Reader became unreachable due to a host issue.
  • Aurora kicked off its automatic recovery, which involved replacing the host and rebooting.
  • Read queries started failing with the error:
    ERROR 9501 (HY000): Timed-out waiting to acquire database connection.
  • Since there were no other Readers, and RDS Proxy does not redirect read-only queries to the Writer, the read-only queries timed out.

Now, this might feel counterintuitive, but according to the AWS documentation, this is expected behavior.

What We Learned: Tips for Better Availability

If you’re using Aurora with RDS Proxy, there are a few things you should definitely consider to avoid this kind of scenario.

Option 1: Improve RDS Proxy Configuration

1. Always Have More Than One Reader

  • RDS Proxy’s read-only endpoint depends on at least one Reader being available.
  • If there’s only one Reader instance and it goes down, the read queries will time out waiting for the Reader to come back.
  • Solution: Always add a second Reader, which could be of a smaller configuration to optimize costs.

2. Set Failover Priorities Wisely

  • Aurora allows you to assign failover tiers to Writer and Reader instances.
  • Don’t keep all instances at the same priority level.
  • Tip: Keep the Writer at a higher priority and Readers at different lower tiers to give the system more flexibility during recovery.

Option 2: Remove RDS Proxy and Use Cluster Endpoints

Instead of relying on RDS Proxy, use Aurora cluster endpoints directly.

  • Cluster endpoints handle failover dynamically, unlike instance endpoints, which are tied to specific instances.
  • Implement connection pooling at the application level to manage database connections efficiently.
  • This avoids the dependency on RDS Proxy and reduces unnecessary waiting time in case of failovers.

Wrapping Up

At the end of the day, Aurora gives you a solid foundation, but it’s up to us to build a setup that’s actually resilient.

Just adding one more Reader and adjusting failover priorities could have prevented the downtime in this case. These might sound like small tweaks, but in production, they can make the difference between smooth failover and frustrated users.

Also, remember:

  • RDS Proxy is powerful, but it has its own behaviors.
  • Make sure you understand how its endpoints work—especially in failover scenarios.
12
Let's discuss your cloud challenges and see how CloudKeeper can solve them all!
Meet the Author
  • Himanshu Sengar
    Senior DevOps Engineer

    Himanshu is a Senior DevOps Engineer with hands-on experience in cloud infrastructure, automation, and implementing modern DevOps best practices.

0 Comment
Leave a Comment

Speak with our advisors to learn how you can take control of your Cloud Cost