At Marqeta, we strive to continually evolve our platform to make it scalable and highly performant. We rely heavily on MySQL, and have many MySQL instances hosted across data centers, as well as on EC2s for various purposes. While refactoring some of our APIs, we thought of giving Amazon Aurora a try. Having heard about Aurora’s performance and high availability, this was definitely a great opportunity. Setting up a single node cluster (one db.t2.small) via the Console was the first step. After a few clicks, we had our first Aurora Cluster running happily. Next step was to fire up our regression tests while pointing to a schema in Aurora. Our database fixtures worked like a charm, and we were surprised to see all of our (thousand+) tests pass - while we knew it was a MySQL drop in replacement, we still expected some drama. Great first impression!

This post is a collection of notes from online resources on Aurora (listed under References) that helped us learn more about the service, along with the excellent documentation on AWS.

History

Work on Aurora started around ~2011, and has it been around as a service for about 2.5 years. Traditional relational databases are built on a monolithic stack, with SQL, Transactions, Caching and Logging all lumped together. This has not changed since the 1970s. Scaling out a monolith would mean replicating the same stack many times.

Motivation

The motivation to build Aurora DB was to combine the performance and availability of commercial databases with simplicity and cost effectiveness of open source databases. Apart from being a drop in replacement to MySQL and it’s clones, the following principles form the core -

  • Scale out, distributed design
  • Service oriented to leverage AWS services
  • Automation of administrative tasks that burden DBAs

Distributed, Scale out, Fault tolerant and Multitenant

Logging and Storage layers are peeled off from the monolith to be distributed and multitenant. This allows for optimization at the network as well as storage level. Storage is distributed across three availability zones, replicated twice in each availability zone. The storage volume is also striped across hundreds of storage nodes. Finally, everything comes together with a purpose built storage layer protocol which utilizes redo logs (compared to the MySQL bin logs and buffers). The traditional storage protocols are replaced with simple, redo log streams.

Integration with AWS Services

  • Aurora integrates with Lambda - stored procedures and triggers can invoke Lambda functions
  • S3 is central to aurora - the snapshots and backups, etc. are stored in S3
  • IAM roles and policies are used to manage DB access control
  • The logs can be ingested into cloudwatch, and can be then used to create alarms

Automate tasks

The customers can focus on schema design and query optimization, while AWS manages failover, backup/recovery, snapshotting, etc. With a very different design than traditional databases, a lot of database management/administration activities are performed much quicker, with minimal to no impact to the database performance during the execution.

Pricing

Aurora has a very simple Pay as you go pricing model. The customers only pay for the storage and IO used. In other words, no provisioning is needed, which takes out the over or underestimation of IO and/or storage. Billing grows linearly with traffic. Aurora costs 1/10th the cost of commercial databases like Oracle and SQL Server. TCO is lower even for MySQL on EC2 due to no need for standby replica, less IOPS to pay for, and smaller instance size (and less number of instances) for the same workload.

Performance

Aurora scales for reads and writes with instance size. In other words, an EC2 running MySQL will not scale linearly with the instance size, but Aurora would.

The SysBench benchmark tests put Aurora up to 5X faster than MySQL on a 32-core, 244GB RAM setup.

  • 120K writes/sec vs. 25K writes/sec on MySQL
  • 600K reads/sec on Aurora vs. 150K reads/sec on MySQL

Additionally, Amazon’s own tests report the following at 30K IOPS -

  • With 5K connections - 8X faster
  • With 10K tables - 11X faster
  • With 1TB DB size - 21X faster
  • TPCC benchmark - 136X faster with 800GB DB size

Design

Aurora deviates from a traditional storage system by using distributed, heavily replicated storage based on the redo log records. This makes the design much simpler, and dramatically reduces the frequency, and size of the IO to perform the same tasks as MySQL. This is true even after replicating the data six times (2x across 3 AZs). The less amount of IO comes from not dealing with many buffers and many logs synchronously as MySQL. On an average, there is 7-9 times less traffic than MySQL. Another optimization is around Asynchronous Group Commits, which replace disk IO with network IO. The threading model is based on adaptive thread pool which can gracefully handle 5000+ concurrent connections on r3.8xlarge instance. The locking model allows for concurrent access to the lock chains, unlike a single lock in MySQL.

A good overview can be found here.

High Availability

  • 6-way replication - 2 replicas in each of 3 Availability Zones
  • 4 out of 6 write quorum for durable writes
  • 3 out of 6 read quorum for durable reads
  • Peer to peer replication for repairs
  • Volume striped across hundreds of storage nodes
  • Up to 15 promotable read replicas, which can be distributed across 3 AZs
  • Automatic monitoring of the master, and one of the replicas is promoted on failure
  • There are read replica endpoints, which can load balance read traffic across the read replicas for read requests
  • Replication protocol does not use binlogs, but uses redo log streams - this means the replication lag is in milliseconds vs. seconds (or mins) for MySQL
  • Automatic failover takes 15-30 seconds. The heartbeat is checked every second, and Aurora waits for 5 beats to fail before failing over. Once a failover is triggered, promotion involves DNS propagation (30s) and DB recovery (5-10s) compared to several minutes for MySQL
  • Cross-region read replicas can be used for faster DR and enhanced data locally

Ease of use

  • Automated storage management - up to 64TB storage volume, starts with 10G and more storage is automatically added in increments of 10G
  • Continuous incremental backups with no performance impact
  • Automatic hotspot management, encryption, mirror repair, re-striping

Security and compliance

  • Aurora uses KMS integration for encryption at rest for the storage volume
  • All the replication and network communication is over SSL
  • Industry Standards Compliant - SOC, ISO, PCI/DSS, HIPAA

Monitoring

There are more than 50 system and Operating System level metrics captured at 1-60s granularity. These metrics can be egressed to cloudwatch logs.

Use Cases

  • Consolidate multiple MySQL shards into 1 large Aurora instance
  • NoSQL workloads - massively concurrent event stores
  • Near realtime analytics and reporting with no lag read replicas
  • Event driven data pipelines via Lambda

Conclusion

While these notes do not cover recent developments with Aurora - like PostgreSQL support, I hope they convince you to take Aurora for a spin and check it out. There is a lot of information available in the official documentation, as well as on other blogs like AWS Open Guide.

References

Comments