Amazon Aurora Notes

At Marqeta, we strive to continually evolve our platform to make it scalable and highly performant. We rely heavily on MySQL, and have many MySQL instances hosted across data centers, as well as on EC2s for various purposes. While refactoring some of our APIs, we thought of giving Amazon Aurora a try. Having heard about Aurora’s performance and high availability, this was definitely a great opportunity. Setting up a single node cluster (one db.t2.small) via the Console was the first step. After a few clicks, we had our first Aurora Cluster running happily. Next step was to fire up our regression tests while pointing to a schema in Aurora. Our database fixtures worked like a charm, and we were surprised to see all of our (thousand+) tests pass - while we knew it was a MySQL drop in replacement, we still expected some drama. Great first impression!

This post is a collection of notes from online resources on Aurora (listed under References) that helped us learn more about the service, along with the excellent documentation on AWS.

History

Work on Aurora started around ~2011, and has it been around as a service for about 2.5 years. Traditional relational databases are built on a monolithic stack, with SQL, Transactions, Caching and Logging all lumped together. This has not changed since the 1970s. Scaling out a monolith would mean replicating the same stack many times.

Motivation

The motivation to build Aurora DB was to combine the performance and availability of commercial databases with simplicity and cost effectiveness of open source databases. Apart from being a drop in replacement to MySQL and it’s clones, the following principles form the core -

Scale out, distributed design
Service oriented to leverage AWS services
Automation of administrative tasks that burden DBAs

Distributed, Scale out, Fault tolerant and Multitenant

Logging and Storage layers are peeled off from the monolith to be distributed and multitenant. This allows for optimization at the network as well as storage level. Storage is distributed across three availability zones, replicated twice in each availability zone. The storage volume is also striped across hundreds of storage nodes. Finally, everything comes together with a purpose built storage layer protocol which utilizes redo logs (compared to the MySQL bin logs and buffers). The traditional storage protocols are replaced with simple, redo log streams.

Integration with AWS Services

Aurora integrates with Lambda - stored procedures and triggers can invoke Lambda functions
S3 is central to aurora - the snapshots and backups, etc. are stored in S3
IAM roles and policies are used to manage DB access control
The logs can be ingested into cloudwatch, and can be then used to create alarms

Automate tasks

The customers can focus on schema design and query optimization, while AWS manages failover, backup/recovery, snapshotting, etc. With a very different design than traditional databases, a lot of database management/administration activities are performed much quicker, with minimal to no impact to the database performance during the execution.

Pricing

Aurora has a very simple Pay as you go pricing model. The customers only pay for the storage and IO used. In other words, no provisioning is needed, which takes out the over or underestimation of IO and/or storage. Billing grows linearly with traffic. Aurora costs 1/10th the cost of commercial databases like Oracle and SQL Server. TCO is lower even for MySQL on EC2 due to no need for standby replica, less IOPS to pay for, and smaller instance size (and less number of instances) for the same workload.

Performance

Aurora scales for reads and writes with instance size. In other words, an EC2 running MySQL will not scale linearly with the instance size, but Aurora would.

The SysBench benchmark tests put Aurora up to 5X faster than MySQL on a 32-core, 244GB RAM setup.

120K writes/sec vs. 25K writes/sec on MySQL
600K reads/sec on Aurora vs. 150K reads/sec on MySQL

Additionally, Amazon’s own tests report the following at 30K IOPS -

With 5K connections - 8X faster
With 10K tables - 11X faster
With 1TB DB size - 21X faster
TPCC benchmark - 136X faster with 800GB DB size

Design

Aurora deviates from a traditional storage system by using distributed, heavily replicated storage based on the redo log records. This makes the design much simpler, and dramatically reduces the frequency, and size of the IO to perform the same tasks as MySQL. This is true even after replicating the data six times (2x across 3 AZs). The less amount of IO comes from not dealing with many buffers and many logs synchronously as MySQL. On an average, there is 7-9 times less traffic than MySQL. Another optimization is around Asynchronous Group Commits, which replace disk IO with network IO. The threading model is based on adaptive thread pool which can gracefully handle 5000+ concurrent connections on r3.8xlarge instance. The locking model allows for concurrent access to the lock chains, unlike a single lock in MySQL.

A good overview can be found here.

High Availability

6-way replication - 2 replicas in each of 3 Availability Zones
4 out of 6 write quorum for durable writes
3 out of 6 read quorum for durable reads
Peer to peer replication for repairs
Volume striped across hundreds of storage nodes
Up to 15 promotable read replicas, which can be distributed across 3 AZs
Automatic monitoring of the master, and one of the replicas is promoted on failure
There are read replica endpoints, which can load balance read traffic across the read replicas for read requests
Replication protocol does not use binlogs, but uses redo log streams - this means the replication lag is in milliseconds vs. seconds (or mins) for MySQL
Automatic failover takes 15-30 seconds. The heartbeat is checked every second, and Aurora waits for 5 beats to fail before failing over. Once a failover is triggered, promotion involves DNS propagation (30s) and DB recovery (5-10s) compared to several minutes for MySQL
Cross-region read replicas can be used for faster DR and enhanced data locally

Ease of use

Automated storage management - up to 64TB storage volume, starts with 10G and more storage is automatically added in increments of 10G
Continuous incremental backups with no performance impact
Automatic hotspot management, encryption, mirror repair, re-striping

Security and compliance

Aurora uses KMS integration for encryption at rest for the storage volume
All the replication and network communication is over SSL
Industry Standards Compliant - SOC, ISO, PCI/DSS, HIPAA

Monitoring

There are more than 50 system and Operating System level metrics captured at 1-60s granularity. These metrics can be egressed to cloudwatch logs.

Use Cases

Consolidate multiple MySQL shards into 1 large Aurora instance
NoSQL workloads - massively concurrent event stores
Near realtime analytics and reporting with no lag read replicas
Event driven data pipelines via Lambda

Conclusion

While these notes do not cover recent developments with Aurora - like PostgreSQL support, I hope they convince you to take Aurora for a spin and check it out. There is a lot of information available in the official documentation, as well as on other blogs like AWS Open Guide.