My CSAA from 2016 had expired, and I was in Vegas to attend re:Invent 2018. I took this opportunity to recertify the credential. This is a newer version of the exam, which made it exciting, but at the same time there were a lot of services that I had not really used, so had to go through the FAQs and documentation for those, along with the excellent course.

It took me about 4 days of serious prep, and I scored 927/1000. Working with AWS professionally for 6 years helped me in a lot of areas - particularly the well architected framework. Also, I loved my exam experience at re:Invent, compared to the typical exam centers we go to. It was far more relaxing, and I scored quite a bit of swag.

Here are the notes that I took while preparing for the exam.


Exam Blueprint


  • Domain 1: Design Resilient Architectures
    • 1.1 Choose reliable/resilient storage.
    • 1.2 Determine how to design decoupling mechanisms using AWS services.
    • 1.3 Determine how to design a multi-tier architecture solution.
    • 1.4 Determine how to design high availability and/or fault tolerant architectures.
  • Domain 2: Define Performant Architectures
    • 2.1 Choose performant storage and databases.
    • 2.2 Apply caching to improve performance.
    • 2.3 Design solutions for elasticity and scalability.
  • Domain 3: Specify Secure Applications and Architectures
    • 3.1 Determine how to secure application tiers.
    • 3.2 Determine how to secure data.
    • 3.3 Define the networking infrastructure for a single VPC application.
  • Domain 4: Design Cost-Optimized Architectures
    • 4.1 Determine how to design cost-optimized storage.
    • 4.2 Determine how to design cost-optimized compute.
  • Domain 5: Define Operationally-Excellent Architectures
    • 5.1 Choose design features in solutions that enable operational excellence.


  • 130 minutes for 65 questions, score 100-1000, pass score 720, $150


AWS Global Infrastructure

  • Details
  • Regions are physical locations spread out globally (18 as of 8/6/2018)
  • A region consists of at least two AZs or Availability Zones, which are data centers that are isolated from each other. This is to ensure high availability and failure isolation (55 as of 8/6/2018)
  • Edge Locations are where CloudFront, Amazon’s CDN (Content Delivery Network) caches content for geo-distribution. There are way more edge locations than regions or AZs. (96 as of 8/6/2018)


  • EC2 (Elastic Compute Cloud) - The VMs (can also be bare metal)
  • ECS (Elastic Container Service, formerly EC2 Container Service) - to run and manage docker containers
  • Elastic Beanstalk - Upload the code and it provisions the load balancers, EC2s, Security Groups etc.
  • Lambda - FaaS, Serverless Platform
  • Lightsail - Amazon’s Website Hosting Service (Virtual Private Service). Get ssh access and a DB access with a static (fixed) IP.
  • Batch - Used for Batch Computing where Batch Jobs are defined as docker containers


  • S3 (Simple Storage Service) - Object based storage service
  • EFS (Elastic File System) - Network Attached Storage or NAS that can be mounted to multiple EC2s.
  • Glacier - Data archival, cold storage. High cost (and time) to retrieve, low cost to store.
  • Snowball - To import large amount (petabytes) of data into AWS. Looks like a suitcase.
  • Storage Gateway - Virtual appliances that are hosted on-prem that transfer (replicate) data to AWS


  • RDS (Relational Database Service) for MySQL, MSSQL, Oracle, Postgres, Aurora, MariaDB
  • DynamoDB - Non-relational, NoSQL database service
  • Elasticache - Cache Service supporting Memcached and Redis
  • Redshift - Data Warehouse/OLAP


  • AWS Migration Hub
  • Application Discovery Service
  • Database Migration Service (DMS)- On prem database to RDS migration
  • Server Migration Service - VM/Physical to EC2 for Lift and Shift type of migrations
  • Snowball - Migrate large amount of data into AWS (petabyte scale)

Networking and Content Delivery

  • VPC (Virtual Private Cloud) - A virtual datacenter
  • CloudFront - Amazon’s CDN
  • Route53 - Amazon’s DNS Service
  • API Gateway - Enables exposing services as APIs
  • Direct Connect - A dedicated line from on-prem to AWS VPC

Developer Tools

  • CodeStar - Project Managing/Collaborating the code toolchain
  • CodeCommit - Version Controlled Code Repository (like github)
  • CodeBuild - Code Builder (like Jenkins)
  • Code Deploy - Deployment Service to deploy artifacts
  • Code Pipeline - Continuous Delivery Service to model, visualize, and automate the release steps
  • XRay - To debug, trace and troubleshoot performance bottlenecks, etc.
  • Cloud9 - in-browser IDE, mostly used to code lambda functions in-line

Management Tools

  • Cloudwatch - Monitoring (and alerting) Service
  • CloudFormation - IaaC, Infrastructure as Code. The artifacts are called Templates.
  • CloudTrail - Audit logging changes to AWS Environment, by default only stores API calls for a week
  • Config - Monitors the configuration and gives a visual representation of the changes, and can go back in time. Like Time-machine for your AWS.
  • OpsWorks - Managed Chef and Puppet Service (Configuration Management)
  • Service Catalog - Manage a catalog of approved services for the AWS account. Used by enterprises
  • Systems Manager - Interface for managing AWS resources like patch management for EC2s, categorize AWS resources
  • Trusted Advisor - Advice around security, and saving $$$
  • Managed Services

Media Services

  • Elastic Transcoder - Think of it as Managed ffmpeg
  • MediaConvert - File based media converter, used for VODs
  • MediaLive - Broadcast live video streams
  • MediaPackage - Prepares and protects video
  • MediaStore - Storage Service optimized for VOD and Live Video
  • MediaTailor - Targeted advertising into video streams

Machine Learning

  • SageMaker - Deep learning algorithms
  • Comprehend - Sentiment Analysis of data
  • DeepLens - A camera that runs deep learning algorithms on the device
  • Lex - AI based interaction service
  • Machine Learning - Intelligence out of data, recommendation systems
  • Polly - Text to Speech, highly customizable
  • Rekognition - Analyze video and/or images
  • Translate - Translate languages from one to another
  • Transcribe - Subtitles from video, speech to text (Opposite of Polly)


  • Athena - Run SQL on S3 bucket data like CSV or spreadsheets, serverless
  • EMR (Elastic Map Reduce) - Big data service
  • CloudSearch
  • ElasticSearch - Managed ElasticSearch Cluster
  • Kinesis - To ingest and process streaming data
  • Kinesis Video Streams - To ingest and process streaming video for analytics
  • QuickSight - BI Tool to analyze and visualize data
  • Data Pipeline - Move data between different AWS services
  • Glue - ETL

Security, Identity and Compliance

  • IAM (Identity and Access Management)
  • Cognito - Managed Authentication Service, also supports federated logins, gives temporary access to AWS
  • GuardDuty - Monitors for malicious activity in the AWS account
  • Inspector - An agent installed on the EC2s and generates a vulnerability report
  • Macie - Scans S3 buckets for PII and secrets
  • Certificate Manager - Manage SSL certificates
  • CloudHSM - Hardware Security Module as a Service, $1.20 an hour
  • Directory Service - Integrate Microsoft ActiveDirectory with AWS
  • WAF - Layer 7 (Application Layer) firewall
  • Shield - DDoS mitigation, free for CloudFront, ALBs, R53. Advanced Shield gives you a dedicated team (3K a month) 24x7 to help out.
  • Artifact - Audit and Compliance, allows to download security and compliance reports from AWS

Mobile Services

  • Mobile Hub - Management Console for Mobile Apps to manage the AWS services that are a backend for the mobile app
  • Pinpoint - Targeted push notifications to drive mobile engagement, like FourSquare
  • AppSync - Automatically update data in the mobile apps including offline updates
  • Device Farm - A test farm to test the app on live devices running in AWS
  • Mobile Analytics - Analytics Service for Mobile


  • Details
  • Sumerian for Augmented Reality/Virtual Reality, is in preview

Application Integration

  • Step Functions - to manage lambda workflows
  • Amazon MQ - Managed ActiveMQ service
  • SQS (Simple Queue Service) - Pull, used to build decoupled architectures
  • SNS (Simple Notification Service) - Push notifications
  • SWF (Simple Workflow Service) something Amazon uses on their website to manage workflows needed to ship orders

Customer Engagement

  • Connect - Call Center in the Cloud
  • SES (Simple Email Service) - Highly Scalable Email Service

Business Productivity

  • Alexa for Business - Helpdesk type of service
  • Chime - Video Conferencing
  • WorkDocs - Dropbox for AWS
  • WorkMail - Outlook for AWS

Desktop and App Streaming

  • Workspaces - VDI in the cloud, Desktop as a Service
  • AppStream 2.0 - Stream the application that is running in the cloud (compared to desktop or VDI)

Internet of Things

  • Details
  • iOT - Scalable service to handle IoT device data coming in from sensors etc.
  • iOT Device Management - Manage connected devices at scale
  • FreeRTOS - Operating System for Microcontrollers
  • GreenGrass - Allows running compute, caching and messaging locally on the connected device in a secure manner (in an offline mode)

Game Development

  • GameLift - A service to develop games in AWS


4 Plans - Basic (free, included by default), Developer ($29/month), Business ($100/month, 10% of the bill) and Enterprise ($15k/month)

IAM - Identity and Access Management

  • Centralized control of AWS account
  • Users - end users who need access, either to the console via a password, or programmatic access to AWS resources via access keys, or both
  • Groups - collection of users who share permissions
  • Role - permissions assigned to AWS resources like EC2, ECS, Lambdas etc. Roles are a secure way to grant permissions to entities that you trust. Roles are preferred over credentials (access key and secret key). A change in the role’s policy takes effect immediately. The roles can be attached to or detached from a running EC2 instance.
  • Policies - permissions defined in a policy document in JSON
  • A policy document contains a statement, which has a collection of Effect, Action and Resource attributes. The Action and Resource can themselves be collections. These can be associated with users, groups and roles.
      "Version": "2012-10-17",
      "Statement": [
              "Effect": "Allow",
              "Action": [
              "Resource": "*"
  • IAM is global, not tied to a region
  • We can create alias for the account, which shows up in the default signin link ( to (
  • Always activate MFA on the root account
  • A root account is the email that is used to setup the account, and it has complete admin access by default
  • Create users in the account and assign them permissions instead of sharing root user credentials. New users have no permissions when created.
  • PowerUserAccess is AdministratorAccess - IAM.
  • Create groups to hold permissions and put users in those groups instead of attaching permissions to users individually
  • For programatic access (via CLI, SDK and APIs), users get an access key ID and secret access key. For logging into the console, users get a userId and a password
  • A user can belong to multiple groups
  • Permissions (IAM Policies) can also be attached to the users, roles and groups directly
  • There can be a maximum of two pairs of key/secret active at any given time, this is to facilitate access key rotation
  • Password policies can be set in IAM like rotation, complexity, expiration, etc.
  • Roles can be assigned to cross-account IAM users, code running on AWS in an EC2, ECS, Lambda, etc., AWS Service, or Federated Identity Users
  • Web Identify Federation allows users access to AWS resources after they have authenticated with a web based identity provider like Facebook or Google. The auth code received after this authentication is used to exchange temp AWS credentials.
  • Amazon Cognito acts as an identity broker between the web identity providers and the application.
  • Cognito User Pools are directories used to manage signup and signin for mobile and web applications.
  • Successful authentication with a User Pool generates a number of JWTs
  • Cognito Identity Pools allow creation of unique identities for the users and authenticate them with identity providers. These identities are used to obtain short lived limited privilege credentials to access other AWS services.
  • User signs into a User Pool using Google credentials. This results in a JWTs. The Identity Pool exchanges the JWTs for AWS Credentials, and these credentials are used by the user to access AWS resources.
  • 3 kinds of IAM policies
    • Managed Policies are created and administered by AWS, like S3FullAccess, AmazonDynamoDBFullAccess. They cannot be modified.
    • Customer Managed Policies are created and managed by the customers within their account.
    • Inline Policies are policies that can be embedded in the user, role or group directly. They cannot be shared between entities. They are deleted if the user, role, or group they’re embedded in is deleted.
  • The AWS sign-in endpoint for SAML is

S3 - Simple Storage Service

  • Object based storage (vs. block based like EBS) which is highly available and durable.
  • Data is stored across multiple facilities and devices.
  • 100 buckets per account
  • Can store files, but cannot run a database or operating system off of it
  • Objects can be 0 bytes to 5TB in size
  • We can upload up to 5GB in single operation, for larger objects, use multipart upload API
  • Unlimited Storage
  • Successful uploads return a HTTP 200 response code via API or CLI
  • Files are stored in buckets whose names are globally unique (they have a DNS tied to them). Buckets can have folders in them. These folders do not need to be unique names except within the bucket.
  • By default all buckets are private.
  • The URL looks like https://s3.{region}{bucketname} OR https://s3-{region} For example,
  • Read after Write consistency for new Object PUTs. You can write and immediately read the content.
  • Eventual Consistency for overwrite PUTs and DELETEs. You can overwrite or delete, but if you read immediately, you may not get the current state. This can take minutes to hours.
  • S3 Object consists of a Key which is the name of the object (file name like foo.txt), value which is the data (byte[]), Version ID, metadata (tags), Sub-resources like ACL, Torrents
  • S3 has 99.99% Availability, and 99.999999999% (11 9s) durability.
  • There are different storage classes or tiers -
    • STANDARD - 99.99% availability, 11 9s durability, designed to sustain the loss of 2 facilities (AZs) concurrently. Stored in >= 3 AZs. No retrieval fee.
    • STANDARD_IA or S3_IA - Infrequent Access, lower cost than S3 Standard, but charged a retrieval fee. Stored in >= 3 AZs. 99%.9 Availability.
    • ONEZONE_IA - Infrequently Accessed but stored in only 1 AZ, lower cost than S3 IA. 99.5% Availability SLA. No resilience of data as it is stored in only one AZ. 99.5%
    • GLACIER - Very cheap but only for archival. 3 retrieval modes - Expedited (1-5 mins, 0.03/GB), Standard (3-5 hours, 0.01/GB), or Bulk (5-12 hours,.0025/GB). No availability SLA.
    • RRS - Reduced Redundancy Storage, 99.99% durability, not offered as an option anymore.
  • Availability is 99.99% for Standard, and 99.9% for S3 IA and 99.5 for ONEZONE_IA.
  • You pay for storage, number of requests, tags, data xfer (including cross region replication), transfer acceleration (optional), storage management (inventory, tags), data xfer out, transfer acceleration.
  • S3 Transfer Acceleration utilizes an edge location so users can perform faster uploads. The users can upload to an edge location, and from there on the data is uploaded to S3 using AWS backbone.
  • S3 objects can have lifecycles
  • S3 objects can be versioned, encrypted (SSE), locked down via ACLs and Bucket Policies
  • By default, all buckets are private
  • S3 is a global service in the AWS Console, just like IAM. However, the bucket URL does have the region name in it, and a bucket is tied to a region.
  • Bucket policies are applied at the bucket level, ACLs are applied at the object level. They’re JSONs.
  • S3 buckets can have access logs, which can be stored in a different bucket, and can also be in a different AWS account.
  • Storage classes can be picked at the object level.
  • Bucket policies apply at the bucket level, while ACLs apply at the individual object level.

S3 Encryption

  • S3 buckets can be encrypted Server Side via AES-256 (SSE-S3 using S3 Managed Keys) or AWS-KMS (SSE-KMS with KMS managed keys) or SSE-C with customer provided keys. This can be set at bucket, folder, and object level.
  • Encryption at transit uses SSL/TLS
  • Encryption at rest can be server side - S3 Managed (SSE-S3), SSE-KMS, or SSE-C, or client side
  • The header to specify encryption during PUT is x-amz-server-side-encryption with possible values as AES256 for SSE-S3 or aws:kms for SSE-KMS.
  • A bucket policy can be created to reject any request that does not have the x-amz-server-side-encryption thereby enforcing encryption always. This is done by adding a Condition in the bucket policy with StringNotEquals for Key x-amz-server-side-encryption and Value as aws:kms or AES256 for Action as s3:PutObject.
  • Each object gets a link which looks like this Note that there is no region in there.

S3 Versioning

  • Once enabled, versioning cannot be disabled and only be suspended.
  • Buckets can be versioned, however, each version is stored a 100% (no incremental or deltas). Hence, each version of the object adds to the storage cost.
  • Deleted objects are also stored as versions.
  • Delete Markers - A deleted object has an associated delete marker. If the delete marker is deleted, the deleted version is restored.
  • Read more here
  • MFA Delete is an added layer of security in S3 buckets, where a second factor auth is required to delete an object version, or to change versioning state.

Cross Region Replication

  • Versioning needs to be turned on for both buckets (in either region)
  • Destination bucket can be in the current account, or in a different account
  • Storage class of the destination bucket can be changed (to be different than that of the source bucket)
  • A cross region replication role gets created when configuring replication from the console.
  • Only new objects will be replicated, not the existing ones. Use CLI or manual steps to copy the existing objects over.
  • If the delete marker is deleted, the deletion of the delete marker is not replicated important. This is for security, to prevent someone from deleting stuff in the primary bucket and having it reflect in the replicated bucket.
  • The permissions are also replicated from source to destination bucket
  • If the versions are reverted in the source bucket, they do not replicate to the destination bucket
  • No daisy chaining of replication buckets
  • 1 bucket can be replicated to only 1 other bucket (no 1 to many replication)

Lifecycle Management

  • This is how S3 manages objects (mostly moving objects between various storage classes or expiring them based on time)
  • Can be configured on previous versions or current version
  • Can be used without versioning
  • Transition to IA after 30 days of creation, transition to Glacier 30 days after IA, can be expired and permanently deleted as well.

Static Website Hosting

  • Serverless! But is is 100% static (like a blog).
  • Use a bucket policy to make the entire bucket public-read.
  • When selecting to host a website, it’d ask for an index page, and an error page.
  • The URL is different than the S3 bucket URL - it looks like http://lobster1234-94568.s3-website.{region}
  • The bucket name comes first, while in the S3 URL, the bucket name is the path.
  • Pre-signed URLs can be inserted in a page to share private content/user specific content.

  • Read S3 FAQs before the exam


  • Cross Origin Resource Sharing
  • CORS is a way for pages from one S3 bucket to access contents from another S3 bucket.
  • CORS needs to be enabled from the bucket that is being accessed, to allow the URL of the bucket that is accessing it.
  • To do so, click on CORS tab in the bucket being accessed, enter the HTTPS URL of the bucket that is trying to access under Access-Control-Allow-Origin which will say * in the template.
  • Use the published URL of the bucket as origin, not the S3 URL (use the one with website in the name).
  • Example CORS configuration would look like this, where {originbucketname} is trying to access contents of this bucket.

Storage Gateway

  • It is a service that allows an on-prem VM to S3 to migrate/replicate data
  • This VM can run with VMWare EXSi or Microsoft HyperV
  • 4 Types
    • They can all use DX or S2S VPN or Internet
    • File Gateway (uses NFS, only for file based) New Files are stored in S3 as objects. Think of this as an NFS interface to S3. Nothing is stored on-prem.
    • Volume Gateway (block based, iSCSI, virtual hard disks where the AWS side is EBS)
      • Stored Volume - Entire copy is stored on-prem with async backups (incremental) to EBS as EBS snapshots. 1GB to 16TB storage limits.
      • Cached Volume - Only the most recently read data is kept on prem (vs. everything like Stored Volume Gateway). All the data is replicated on EBS. 1GB to 32TB storage limits.
    • VTL (Virtual Tape Library) or a Tape Gateway. Works with popular tape backup software to act as virtual tapes.


  • Import-Export Disk’s successor
  • Puts data in S3 (and pulls from it if we want data exported out of AWS)
  • Called a suitcase at Marqeta :)
  • Petabyte Scale import/export appliance, up to 80TB.
  • Secure physically + encrypted (AES 256). Once the data is xfered, it goes through a complete wipe.
  • Snowball Edge is up to 100 TB and also has on-device compute capability. For example, the suitcase can run code to pull data in and store it.
  • Snowmobile is a truck, Exabyte scale data transfer. 100 PB storage limit.


  • This is confusing as hell, so there you go :
    • S3 Bucket Acceleration URL - {bucketname} no region
    • S3 Object URL -{bucketname}/{key} no region
    • S3 Bucket URL - https://s3-{region}{bucketname}
    • S3 Cloudfront Origin URL - no region, no protocol
    • S3 Static Website URL (note it is NOT https)- http://{bucketname}.s3-website.{region} for the rest.

S3 Pricing

  • S3 Standard is 0.023 per GB
  • S3 IA is 0.0124 per GB
  • S3 OneZone-IA is 0.01 per GB
  • S3 RRS is 0.024 per GB (almost same as S3 Standard)
  • Glacier is 0.004 per GB


  • Amazon’s CDN (Content Delivery Network)
  • Edge locations are the ones where content is cached, which is not same as AZ or region.
  • The origin can be S3 bucket, EC2 instance, ELB, or Route53 address. It can also be a non-AWS origin (like a server in a data center).
  • Distributions
    • Web Distribution is for websites
    • RTMP (Real Time Messaging Protocol) is used for Media Streaming
  • A distribution is identified by a domain
  • Edge Locations can be written to (PUT), this is used in S3 accelerated transfers.
  • Objects are cached at the edge location for a TTL (Max: 365 days, Default: 24 hours)
  • It costs to invalidate the cache on an object basis (if we want to do it before the TTL expires)
  • For performance of GET intensive workloads, use cloudfront
  • For performance of mixed workloads, hash the S3 key by adding a random prefix to the key name. By doing so, there is no IO Contention on the same partition.
  • S3 origins look like (notice no region in the URL).
  • An Origin Access Identity is set up in the distribution which gives access to Cloudfront to read from the origin S3 bucket. The bucket policy is updated on the origin bucket to allow access to this identity.
  • Cloudfront uses pre-signed URLs and signed cookies to restrict access of content (just like S3)
  • The clouddront distribution URL gets a domain name Please note that foobar is not the distribution ID.
  • Cloudfront allows geo whitelisting OR blacklisting of countries. Cannot do both, has to be either.

EC2 - Elastic Compute Cloud

  • Elastic Compute Capacity in the cloud, pay for the capacity that you use.
  • Instance Allocation
    • On Demand : Pay by hour for windows, by second for linux. No commitment. Great for unpredictable workloads which cannot be interrupted.
    • Reserved : Like a contract, 1 or 3 years, pay no, partial, or full upfront. Up to 75% off on-demand. Great for predictable, sustained workloads. (Standard, Convertible, Scheduled). Think of it as a phone contract. Great for predictable, steady state usage. 3 years all upfront gets most savings obviously (75% off). They’re tied to a region.
    • Spot : Allows for the cheapest option, bid for the price you want but only if the process can be interrupted. AWS will terminate the instance if the bid price goes higher. You will not be charged for the hour in which AWS terminates the instance. If you terminate the instance, you pay for the full hour. Used when you have flexible start/end times.
    • Dedicated : Non multi-tenant, Bare Metal, used for Regulatory Requirements like Healthcare. Also when software licenses are tied to a host.
  • Instance Types
    • F1 : Genomic research, financial analysis, video processing
    • I3 : Storage Optimized, DBs, DW
    • G3 : Graphics, video encoding
    • H1 : High Disk Throughput, Map Reduce, HDFS
    • T2 : Low Cost General Purpose
    • D2 : Dense Storage
    • R4 : Memory Optimized
    • M5 : General Purpose
    • C5 : Compute Optimized
    • P3 : Graphics, GPU
    • X1 : Memory Optimized To Remember: F (FPGA) I (IOPS) G (Graphics) H (High Disk Throughput) T (cheap GP) D (Density) R (RAM Memory) M (GP) C (Compute) P (Graphics) X (Extreme Memory) = FIGHTDRMCPX
  • EC2 User Data is used to add bootstrap scripts to the instance. It always starts with shbang (#!/bin/bash).
  • To log in to the instance, use ssh -i <path_to_pem> ec2-user@<IP address>. The PEM has 400 permissions to ensure it is hidden from everyone except the owner.
  • We can encrypt the root volume (where the OS is installed) using OS level encryption like Windows BitLocker.
  • Another way to encrypt the root volume is to snap it, encrypt the snap, create an AMI from this snap and use this to launch the EC2.
  • To retrieve instance metadata or userdata, the endpoint used is
  • Instance User-Data :
    [root@ip-172-31-56-227 log]# curl
    sudo yum update -y
  • Instance Meta-Data :
    [ec2-user@ip-172-31-51-163 ~]$ curl
    [ec2-user@ip-172-31-51-163 ~]$ curl
  • EC2 instance roles are created in IAM which eliminate the need for using security credentials (aws access key and secret) to access AWS services.
  • The roles can be changed on a running instance, and is effective immediately. (just like security groups)
  • Xen and Nitro are the underlying hypervisors for EC2

    Real world - Make sure to click on “i” on each option of the Launch Instance Wizard steps. Lots of nuggets and gotchas there.

Security Groups

  • A Security Group is a virtual firewall for the EC2 instance, to control traffic to and from the instance. One EC2 can have many security groups associated with it.
  • By default, a security group would allow all outbound traffic to any destination, any protocol.
  • Any change made to a security group is applied immediately (like adding/removing ports, etc.)
  • Security Groups are stateful. When an inbound rule is added, outbound traffic is automatically allowed.
  • Security Group rules are only to allow traffic, not to deny. By default all inbound traffic is denied, all outbound traffic is allowed.
  • All VPCs get a default security group. This SG has only 1 rule, where all the instances associated with that SG can talk to each other (source is itself)


  • EBS is a virtual disk. EBS Volumes can be mounted to an EC2 instance. They belong to 1 availability zone and are replicated across multiple physical disks.
    • GP2 : General Purpose SSD, 3 IOPS per GB, bursts up to 10K IOPS, bursts up to 3000 IOPS for extended periods of time for volumes 3334 GB and above.
    • Provisioned IOPS : More than 10K IOPS, can provision up to 20K IOPS per volume
    • ST1 Throughput Optimized HDD : Cannot be boot volumes. DW, Logs are good use cases.
    • SC1 Cold HDD : Cannot be boot volumes. Good for cold storage. Lowest cost.
    • Standard Magnetic : Legacy, can be bootable. cheapest bootable
  • The mim volume size for HDD is 500GB
  • EBS volumes for an instance are in the same AZ. You can only mount the EBS volumes in the same AZ as the EC2 instance.
  • EBS volume types and sizes can be changed on the fly, without any downtime. There is a performance hit for a bit.
  • To move EBS volumes across AZs or Regions, use EBS snapshots.
  • Use Snapshot Copy to create a copy in a different region. Use Create Volume to create a new volume in a different AZ.
  • To encrypt an unencrypted EBS volume:
    • Create a snapshot
    • Copy the snapshot and select encryption
    • Create a volume from this encrypted snapshot
  • The only time an EBS volume can be create as an encrypted volume is during the creation.
  • EBS snapshots sit on S3, and are incremental
  • Snapshots of an encrypted EBS volume will always be encrypted.
  • Snapshots can be shared with other accounts and can be made public. Encrypted snapshots cannot be shared.
  • Root device types can be EBS backed or instance backed.
  • Instance Store backed instances cannot be stopped and started, can only be rebooted.
  • Instance stores are ephemeral.
  • An instance can have many EBS volumes attached, but an EBS volume can be attached to only 1 instance at any time. There is no such thing as shared EBS, for that requirement, consider EFS.

Load Balancers

  • Elastic Load Balancers - allows us to balance the load between different servers.
    • Application Load Balancer : Layer 7. They support advanced request routing based on HTTP request characteristics like path, headers, etc.
    • Network Load Balancer : Very High Performance, Layer 4, Most expensive. They support millions of request per second.
    • Classic Load Balancer : Dumber Layer 7, Legacy. Also supports Layer 4. The only thing supported at Layer 7 is X-Forwarded-For and sticky sessions.
  • ELB responds with HTTP 504 Gateway Timeout when the application does not respond.
  • The DNS names for the load balancers are {LB-name}.{region}
  • The healthcheck statuses for instances behind LB can be InService or OutOfService.
  • When a healtcheck for an instance fails, the load balancer stops sending traffic to that instance.


  • Basic monitoring sends metrics every 5 minutes, detailed monitoring can send every 1 minute but that costs extra.
  • Standard EC2 metrics (by default) are CPU usage, disk IO, network IO, CPU credits, Status checks.
  • Metrics like RAM, Disk utilization, swap usage, etc. would need creation of custom metrics.
  • Cloudwatch Alarms have 3 states - INSUFFICIENT_DATA, OK and ALARM
  • Cloudwatch Events allow to set up rules to trigger actions - like AWS Batch job completion can trigger a lambda.
  • Cloudwatch Logs act as a central location for all logs (like lambda system.out, etc.)

Auto Scaling Groups

  • An autoscaling group = Launch Configuration + Scaling Policies.
  • Launch Templates are newly announced, but Launch Configurations have been around the longest
  • A Launch Configuration has the AMI, Instance Type, Instance details like IAM role, user data, IP allocation details, as well as storage and security groups.
  • Once a Launch Configuration is created, you can create an AutoScalingGroup. This is where details like VPC, subnets, number of instances, load balancer are entered.
  • Scaling policies are set up in ASG to tie cloudwatch alarms with autoscaling activities.
  • The EC2s are launched as soon as an ASG is created.
  • A Launch Configuration cannot be modified. However, it can be copied as a new configuration.
  • An ASG can be modified at any time and even the Launch Configuration associated with it can be changed.
  • Deleting an ASG will terminate the EC2 instances but will not delete the launch configuration associated with it.
  • ASG can also be created from an EC2 instance, where all the EC2 instance information is used to create a launch configuration, with these limitations -
    • The block device information from the AMI is copied over, but not the devices that were attached after the instance launch.
    • Tags are not copied over to the ASG
    • The load balancer is not copied over (if the instance was behind one) in the ASG.
  • Scaling options -
    • Manual : Update the min, max and desired number of instances manually.
    • Scheduled : Good for predictable scaling needs/traffic patterns. A cron pattern can be specified for a schedule for recurring events.
    • Dynamic : The advanced but most common, based on cloudwatch events.
  • Multiple scaling policies can be associated with an ASG.
  • Termination Policies - when the ASG decided to terminate an instance for scale down.
  • Default Termination Policy


  • Elastic File System, AWS’s NFS (NFSv4)filesystem which is Petabyte scale and scales up on demand.
  • An EFS is provisioned in multiple AZs and gets an (private) IP per AZ. The instances in each AZ mount to that IP address.
  • EFS has much better performance compared to EBS PIOPs
  • Pay for the storage used
  • Read after Write consistency model
  • Data is stored across multiple AZ within the same region.

Placement Groups

  • Two types - Clustered and Spread
  • Clustered Placement Group has been around for long, where instances share the same AZ. This is for low network latency and/or high network throughput. Only certain instance types (memory optimized, compute optimized, network optimized) can be launched in a clustered placement group.
  • Spread Placement Group ensures the instances are on different underlying hardware, and multiple AZs.
  • AWS recommends having the same instance types in a clustered placement group.
  • You cannot move an existing instance to an existing placement group. You can however launch an AMI from that instance into the placement group.


  • Released in reInvent 2015
  • Amazon’s Event Driven Compute Service where a function is run without the customer needing to provision any servers.
  • Lambda has many event sources -
    • Cloudwatch (events, logs, alarms)
    • S3
    • SNS
    • API Gateway via HTTP requests
    • DynamoDB,
    • IoT
    • Alexa Skills
    • Kinesis
    • Cloudfront
    • SQS
    • SES
    • CodeCommit
    • Cognito
    • CloudFormation
  • Lambda scales out automatically, and runs concurrently (as the events occur), default limit is 1000 concurrent executions.
  • Lambda can be tied to a VPC, security group(s), and IAM role.
  • API Gateway is used to trigger Lambdas as a response to HTTP requests.
  • Each HTTP request translates into one lambda. In other words, 1 lambda function is not shared between multiple requests. This is 100% stateless by design.
  • Lambda supports Node, C#, Java, Python
  • Lambda free tier is 1M requests, and 20c per 1M requests thereafter.
  • For billing, lambda execution time is rounded to 100ms, and memory is rounded to 128MB.
  • Lambda can only run for 5 min max (recently 15 min), and max memory is 3008MB
  • Failure of asynchronous invocation (like SNS) is retried twice with delay in between (so total of 3 attempts), but sync will return 429 error for failure with no automatic retries.
  • Lambda based systems can get pretty complex when it comes to debugging/troubleshooting. Amazon X-Ray helps with that.

Route 53

  • Route 53 is Amazon’s DNS Service, allowing domain name mapping to EC2s, Load Balancers and S3 buckets.
  • Route53 is a global service, just like IAM.
  • ipv4 has 32 bit space, ipv6 is 128 bits
  • Last word in any DNS name is the top level domain name (.com, .gov, .in), the one before is second level domain name (,
  • The domain names are registered via domain name registrars (amazon, godaddy, wix, etc.) with InterNIC which maintains the whois database.
  • SOA record (Start of Authority) has the info for TTL (seconds), zone admin, zone server.
  • NS records are the name server records.
  • The ISP looks up the top level domain to ask for an NS record, which points to a name server, then the ISP will contact the NS server, which points to SOA record, which has an A (Address) record which has the IP address.
  • Cnames are used to convert one domain to another - like aliases.
  • Alias records are unique to Route 53, they’re just like cnames.
  • A cname cannot be used for naked domain names. That’s why AWS came up with alias records for route53.
  • Naked Domain Name == Zone Apex Record, is a domain name without the www
  • Alias Record Set == CNAME record set, which is created for an AWS Resource. It is only supported by A and AAAA (ipv6) DNS record types. Alias target can be ALB, CLB, NLB, Cloudfront Distribution, S3 website.
  • Routing Policies
    • Simple : This is the default routing policy. No intelligence, just a simple resolution to a resource like a web server LB.
    • Weighted : Split traffic by assigning weights. Can be used for DR tests, canaries.
    • Latency : Route traffic based on the lowest latency for the end user location. Latency Resource Record sets are needed in route53 for this.
    • Failover : For active/passive (usually DR) setup. This utilizes healthchecks on the primary site.
    • Geolocation : Send traffic to localized servers based on the user’s geo location.
    • Multivalue Answer : Works like a load balancer, where multiple targets are (optionally) health-checked and are returned as multiple IPs randomly.
    • Geoproximity : Send traffic to the nearest resource based on the client’s location. Needs Route53 traffic flow enabled.
  • There is a soft limit of 50 domain names.
  • Be sure to read this


  • Relational Database Service for OLTP
  • 6 Instance Types - Aurora, MySQL, MariaDB, Oracle, MS SQL Server, Postgres.
  • Non Relational Databases or NoSQL Databases have a Collections (Tables), Documents (Rows) and Key-Value Pairs (Fields). The documents may be nested. The structure of the document is not fixed (schemaless). DynamoDB is Amazon’s NoSQL database.
  • Data Warehousing is used for BI. It is used to perform complex operations on complex data sets, which are very data intensive.
  • OLTP - Online Transaction Processing, typically small writes and reads, but happen very frequently.
  • OLAP - Online Analytics Processing, is like Data Warehousing. Very different architecture and infrastructure than OLTP. AWS has Redshift as the DW database.
  • AWS always gives an instance (service) endpoint which is a DNS address for the DB instance, never an IP.
  • The DB Security Group needs to allow inbound traffic on port 3306 from the security group of the EC2 instance that is trying to establish a connection.
  • There are 2 types of backups - Automated backups and Database snapshots, retained for 1-35 days.
  • Automated backups take a full daily backup. For recovery, AWS chooses the most recent backup and apply the transaction logs. This allows for point in time recovery within the retention window.
  • Automated backups are enabled by default, and stored in S3 (free storage). During the backup during a defined window, storage IO may be suspended. They are deleted if the RDS instance is deleted.
  • DB snapshots are manual. They survive the RDS instance deletion.
  • The restored RDS instance will have a new DNS and will be a new RDS instance.
  • Encryption at rest is supported for all RDS DB engines.
  • Encryption uses AWS KMS. If the RDS instance is encrypted, the backup, snapshots and replicas are also encrypted. Encryption has to be defined at instance creation time. For existing DB, encrypt the snapshot thereby creating a copy, and restore it to create a new, encrypted instance. (Just like what we’d do with EBS)
  • Snapshots can be copied across regions
  • Multi AZ means a copy of a database (standby) in a different AZ which is replicated synchronously. This is for DR only. The instance automatically fails over to the standby in another AZ in the event of a failure. The DNS endpoint will now have the IP of the multiAZ database.
  • Read Replicas are for performance (not the standbys) - They are read-only copies of the master, which are replicated asynchronously. This is ideal for read-heavy workloads. They are not available for MS SQL Server and Oracle. They’re used for scale-out. Up to 5 read replicas are possible.
  • Read replicas can be promoted to be their own databases, but this breaks replication.
  • Read replicas can be in a completely different region.
  • Read replicas can be encrypted even if the master is not.


  • Elasticache - Managed in-memory storage. Supported engines are Memcached and Redis.
  • Used to improve performance of read-heavy applications by providing low latency access.
  • Memcached is not multiAZ, Redis is. Redis also supports Master-Slave replication.
  • Memcached cluster can be scaled out just like an ASG
  • Redis supports rich data structures like lists, hashes, sets and provides persistence, pub-sub, and multi-AZ with failover just like RDS.


  • Amazon’s NoSQL database
  • You pay for read and write provisioned capacity + the storage.
  • Data is replicated across 3 data centers
  • Reads can be eventually consistent or strongly consistent
  • The largest record size is 400KB
  • Read Capacity = number of items that can fit in 4 KB and can be read from dynamodb in a second. So if they’re full size (400KB) then we’d need a read capacity of 100 to read 1 such item in a second. For strongly consistent reads, this capacity is 2X eventually consistent reads. So in this example, we’d need 200 units to perform the same throughput (1 full size record to be read in a second), but strongly consistent.
  • Write Capacity = Number of items which are 1 KB in size that can be written in a second. So just like the above case, if we have 1 full size item (400KB) that needs to be written, then we’d need a write capacity of 400 to be able to write it in 1 second. If we need to write 2 such items in a second, then it’d be 800 (21KB400).
  • A local secondary index can only be created at table creation time.
  • A local secondary index has the same primary partition key as the main table and can have a different sort key.
  • Global secondary index is the one where the index primary key can be different than the primary partition key of the main table, and of course the sort key can be different as well.
  • There is a 10GB limit of item collection (sum of size of all items in the table plus the local secondary indexes)
  • LSI share the provisioned read write capacity of the main table.
  • GSI need their own read write capacity, so think of GSI as another table of its own.
  • Changes to DynamoDB tables can be streamed using DynamoDB streams. These streams live in shards for 24 hours (like Kinesis).


  • Amazon’s OLAP service, fully managed data warehouse
  • Can be single node with 160GB data
  • To scale, use multi node where there is a leader node with compute nodes to do the work (128 compute nodes)
  • Redshift organizes the data based on columns (column based system).
  • You’re not charged for a leader node, only the compute nodes.
  • Redshift is only available in 1 AZ
  • Can be snapshotted and copied to another AZ
  • Redshift Spectrum allows to run SQL on exabytes of unstructured data in S3. No ETL needed.
  • Supports AES256 for data encryption at rest
  • Redshift attempts to maintain at least 3 copies of the data.
  • Redshift can asynchronously replicate data to S3 in another region for DR
  • Backup retention is same as RDS - 1 to 35 days


  • A virtual datacenter in the cloud (Virtual Private Cloud).
  • Number of IPs in a CIDR notation is 2^(32-N). So, a /32 is 2^(0) which is 1.
  • /16 is the largest VPC, and smallest is /28.
  • A region has a soft limit of 5 VPCs.
  • A VPC is divided into subnets, where 1 subnet can only be in 1 AZ (a subnet cannot spread across AZs).
  • Route Tables control traffic between subnets.
  • Internet Gateways (1 per VPC) are used to provide internet (in+out) to a subnet by adding a route to IGW in that subnet’s route table.
  • We get a /16 default VPC in each region, where all /20 subnets are public.
  • VPCs can be peered (even between different accounts and regions). The peering is in a star configuration where there is no transitive connectivity.
  • NACLs (Network Access Control Lists) sit at the subnet level, and are stateful.
  • A VPC can be on a dedicated tenancy, where all the instances that are launched in this VPC will use dedicated hardware.
  • When we create a new VPC, it will create a default NACL, default route table (called main), but no subnets or gateways.
  • The default NACLs allows all inbound and outbound traffic, default SG allows all inbound within the same security group and allows all outbound to anywhere, default route table (main) allows all traffic within the VPC.
  • We lose 5 IP addresses per subnet. These are first 4 and last 1 (.0 network address, .1, .2 and .3 are reserved and .255 is the broadcast).
  • For outbound internet access for private subnets, we need to route the traffic to either NAT gateway or NAT instance.
  • NAT instances are legacy, and are created from a community AMI. They’re placed in a public subnet with a security group that allows HTTP traffic inbound and all traffic outbound. Remember to turn off the source/dest check on the instance. By default all the EC2s only allow traffic that either originates or terminates at them.
  • NAT instances are difficult to scale up and out using the traditional ASG setup.
  • Use NAT Gateways. They’re managed by AWS and is highly available. They’re also created in a public subnet. They do not sit behind a security group either. They scale automatically up to 10Gbps.
  • Egress gateway is similar to NAT gateway, except its for ipv6.
  • It is a good practice to create 1 NAT Gateway per AZ for AZ failure isolation.
  • NACLs sit at the subnet level (which sits at the AZ level). There can only be 1 NACL per subnet, but multiple subnets can be associated with a NACL.
  • NACLs are stateless, so you’d need to explicitly allow outbound traffic when you enable inbound traffic on a port.
  • A new NACL has deny all for inbound and outbound (unlike default which is allow all).
  • The rules are evaluated in the increasing order (AWS recommends increments of 100) and are first match exit.
  • A * in a NACL rule set is the default, when there is no earlier match.
  • Ephemeral ports are super important - they’re ports from 1024-65535, which are used as short lived ports for the client. The client picks these ports to expect the response on. They need to be opened up for outbound traffic.
  • The changes to NACLs take effect immediately.
  • The internet facing ALBs need to be in at least 2 AZs, and both of the subnets have to be public.
  • VPC flow logs allow capturing IP traffic going to/from the network interfaces in the VPC. They are stored as cloudwatch logs. They’re created at VPC, Subnet, or ENI level.
  • Flow logs can include peered VPC only if the peered VPC is in the same account.
  • VPC endpoints allow access to AWS services (S3, SQS, SNS..) via the AWS backbone, bypassing the internet.
  • We get 5 Elastic IP Addresses per VPC. These are static IPs that can be detached and attached to another resource. For example, NAT Gateway gets an EIP.


  • A queueing service that acts as a buffer for messages between producers and consumers. Oldest AWS service. Used for decoupled architecture.
  • SQS supports encryption at rest.
  • Standard Queue -
    • At least once delivery.
    • Higher throughput
    • No guaranteed order
  • FIFO Queue -
    • More $$
    • Less Throughput (300 TPS)
    • Exactly Once Delivery
    • Retains order
  • SQS is a pull based system
  • 256 KB per message
  • Default retention period is 4 days but can be maxed at 14 days, min is 1 minute.
  • Default visibility timeout is 30 seconds, can be maxed at 12 hours.
  • Visibility timeout is the amount of time the message is hidden (in flight) from other consumers.
  • SQS Long polling is a good way to save costs, as it hangs out to the connection till the timeout happens, or a message shows up. The max is 20 seconds. (Receive message wait time)
  • Delivery Delay can be set up on the queue (0-15 mins)


  • Simple WorkFlow service that makes it easy to coordinate tasks between machines and humans with a task oriented API.
  • This is shopping experience for fulfillment.
  • Workflows can last for as long as 1 year
  • SWF Starter is the actor that starts the workflow
  • SWF Decider is a program that controls the coordination of tasks.
  • SWF Activity Worker is the code that executes to perform that task.
  • SWF assigns a task to only one worker (no duplication at all)
  • SWF Domain is the container that contains the related workflows, their tasks, etc.


  • Simple notification service
  • Push based, to send notifications, or act as a trigger for some other processing.
  • SNS supports Email, Email-JSON, HTTP/S, SMS, Lambda, Application, and SQS as transport protocols.
  • All messages published are stored redundantly across multiple AZs, and can be encrypted at rest (very recently announced).
  • An SNS topic acts as an endpoint for pushing messages to consumers.

API Gateway

  • A service that provides managed, secure HTTP interface to a lambda function (or LB, or EC2)
  • Scales Automatically
  • Supports API response caching to during a TTL the requests never hit the backend.
  • Supports Throttling to ensure the back end is not flooded with requests (like the database).
  • CORS would need to be enabled on the API Gateway, so pages from another domain can access the API Gateway Endpoints.


  • A service that allows ingesting, storing and processing streaming data.
  • Three Services
    • Kinesis Streams : Stores streaming data for 1-7 days in shards. Consumers pull data from the shards. Then they can send this data to be stored in DynamoDB, S3, redshift, etc. Capacity of the stream is the sum of the capacity of the shards.
    • Kinesis Firehose : No need for shards (so no retention). Data can be (optionally) analyzed with lambda and stored in S3, ES cluster, redshift (via S3)
    • Kinesis Analytics : Allows SQL queries to be run on Kinesis Streams as well as Kinesis Firehose.

Well Architected Framework Pillars


  • Data protection at rest and transit, Privilege Management, Infrastructure Protection, Detective Controls
  • Apply security at all levels (NACL, Security Groups, WAF, IAM policies)
  • Enable traceability (Cloudtrail, GuardDuty, Config, Cloudwatch)
  • Focus on security (Hardened AMIs, KMS encryption, S3 versioning, MFA deletes, IAM Password policies and MFA)
  • Automate Security (Cloudwatch alarms)
  • Shared responsibility model
    • AWS is responsible for security of the cloud (AWS Global Infrastructure, Physical Infrastructure)
    • Customer is responsible for security in the cloud (AMIs, Data Encryption, O/S, IAM, Application Data)


  • Test recovery procedures (chaos engineering)
  • Automated recovery (ELB, multi-AZ, Route53)
  • Scale horizontally (ELB)
  • Stop guessing capacity (Autoscale) and know the service limits
  • Assume failures
  • Change Management in AWS
  • Backup, recovery, RPO, RTO
  • IaaC, Failure Injection Queries of Aurora

Performance Efficiency

  • Evolve the platform as AWS evolves theirs
  • Use managed services for PaaS and IaaS
  • Go Global
  • Use Serverless
  • Experiment often
  • Across Storage, Network, Database and Compute - pick the right options across the stack.
  • Focus on reducing latency across the stack, and make it predictable.

Cost Optimization

  • Reduce the cost to run infrastructure
  • Transparent expenses (use tags, budgets, billing alerts, consolidated billing)
  • Use managed services
  • Pay for what you use, make resources idle when not in use (compute, autoscale, lambda)
  • Economies of Scale
  • Do not overlook data xfer charges

Operational Excellence

  • Perform operations with code
  • Apply monitoring and collect metrics
  • Make incremental changes
  • Prepare - Maintain runbooks and playbooks (cloudformation)
  • Change Visibility and Configuration Management - AWS Config, Cloudwatch, Tagging
  • Focus on No downtime deployments, focus on CI/CD
  • Have an automated rollback plan before making changes

AWS Organizations

  • AWS allows linking and managing multiple accounts together, centrally.
  • There is a paying account (root) and other accounts linked to it.
  • Consolidated billing aggregates expenses across accounts per service and you’re sent one, consolidated bill.
  • Volume pricing discount as multiple account usage adds up for lower pricing tier.
  • Reserved instances that are unused in one linked account can pay for the other account
  • No resources should be deployed in the paying (root) account.
  • There is a soft limit of 20 linked accounts.
  • The organizations allow using SCP (Service Control Policy) which can be used to control the AWS services that the linked accounts can use.
  • SCP will override IAM.

Cloudformation Structure

  • Resources - Define the resources to be created
  • Parameters - Parameters taken in to create the resources
  • Mappings - Used to map key values in the template
  • Outputs - Return the resources created after running the template


FAQs, APIs/CLIs and Documentation


Before you schedule the exam, measure your confidence in these areas in particular -

  • Ins and outs of VPC, EBS and S3.
  • Highly available, fault tolerant and cost effective architectures (and a combination of these)
  • Disaster Recovery