My CSAA from 2016 had expired, and I was in Vegas to attend re:Invent 2018. I took this opportunity to recertify the credential. This is a newer version of the exam, which made it exciting, but at the same time there were a lot of services that I had not really used, so had to go through the FAQs and documentation for those, along with the excellent acloud.guru course.

It took me about 4 days of serious prep, and I scored 927/1000. Working with AWS professionally for 6 years helped me in a lot of areas - particularly the well architected framework. Also, I loved my exam experience at re:Invent, compared to the typical exam centers we go to. It was far more relaxing, and I scored quite a bit of swag.

Here are the notes that I took while preparing for the exam.

swag

Exam Blueprint

Domains

Domain 1: Design Resilient Architectures
- 1.1 Choose reliable/resilient storage.
- 1.2 Determine how to design decoupling mechanisms using AWS services.
- 1.3 Determine how to design a multi-tier architecture solution.
- 1.4 Determine how to design high availability and/or fault tolerant architectures.
Domain 2: Define Performant Architectures
- 2.1 Choose performant storage and databases.
- 2.2 Apply caching to improve performance.
- 2.3 Design solutions for elasticity and scalability.
Domain 3: Specify Secure Applications and Architectures
- 3.1 Determine how to secure application tiers.
- 3.2 Determine how to secure data.
- 3.3 Define the networking infrastructure for a single VPC application.
Domain 4: Design Cost-Optimized Architectures
- 4.1 Determine how to design cost-optimized storage.
- 4.2 Determine how to design cost-optimized compute.
Domain 5: Define Operationally-Excellent Architectures
- 5.1 Choose design features in solutions that enable operational excellence.

Time

130 minutes for 65 questions, score 100-1000, pass score 720, $150

Overview

AWS Global Infrastructure

Details
Regions are physical locations spread out globally (18 as of 8/6/2018)
A region consists of at least two AZs or Availability Zones, which are data centers that are isolated from each other. This is to ensure high availability and failure isolation (55 as of 8/6/2018)
Edge Locations are where CloudFront, Amazon’s CDN (Content Delivery Network) caches content for geo-distribution. There are way more edge locations than regions or AZs. (96 as of 8/6/2018)

Compute

EC2 (Elastic Compute Cloud) - The VMs (can also be bare metal)
ECS (Elastic Container Service, formerly EC2 Container Service) - to run and manage docker containers
Elastic Beanstalk - Upload the code and it provisions the load balancers, EC2s, Security Groups etc.
Lambda - FaaS, Serverless Platform
Lightsail - Amazon’s Website Hosting Service (Virtual Private Service). Get ssh access and a DB access with a static (fixed) IP.
Batch - Used for Batch Computing where Batch Jobs are defined as docker containers

Storage

S3 (Simple Storage Service) - Object based storage service
EFS (Elastic File System) - Network Attached Storage or NAS that can be mounted to multiple EC2s.
Glacier - Data archival, cold storage. High cost (and time) to retrieve, low cost to store.
Snowball - To import large amount (petabytes) of data into AWS. Looks like a suitcase.
Storage Gateway - Virtual appliances that are hosted on-prem that transfer (replicate) data to AWS

Databases

RDS (Relational Database Service) for MySQL, MSSQL, Oracle, Postgres, Aurora, MariaDB
DynamoDB - Non-relational, NoSQL database service
Elasticache - Cache Service supporting Memcached and Redis
Redshift - Data Warehouse/OLAP

Migration

AWS Migration Hub
Application Discovery Service
Database Migration Service (DMS)- On prem database to RDS migration
Server Migration Service - VM/Physical to EC2 for Lift and Shift type of migrations
Snowball - Migrate large amount of data into AWS (petabyte scale)

Networking and Content Delivery

VPC (Virtual Private Cloud) - A virtual datacenter
CloudFront - Amazon’s CDN
Route53 - Amazon’s DNS Service
API Gateway - Enables exposing services as APIs
Direct Connect - A dedicated line from on-prem to AWS VPC

Developer Tools

CodeStar - Project Managing/Collaborating the code toolchain
CodeCommit - Version Controlled Code Repository (like github)
CodeBuild - Code Builder (like Jenkins)
Code Deploy - Deployment Service to deploy artifacts
Code Pipeline - Continuous Delivery Service to model, visualize, and automate the release steps
XRay - To debug, trace and troubleshoot performance bottlenecks, etc.
Cloud9 - in-browser IDE, mostly used to code lambda functions in-line

Management Tools

Cloudwatch - Monitoring (and alerting) Service
CloudFormation - IaaC, Infrastructure as Code. The artifacts are called Templates.
CloudTrail - Audit logging changes to AWS Environment, by default only stores API calls for a week
Config - Monitors the configuration and gives a visual representation of the changes, and can go back in time. Like Time-machine for your AWS.
OpsWorks - Managed Chef and Puppet Service (Configuration Management)
Service Catalog - Manage a catalog of approved services for the AWS account. Used by enterprises
Systems Manager - Interface for managing AWS resources like patch management for EC2s, categorize AWS resources
Trusted Advisor - Advice around security, and saving $$$
Managed Services

Media Services

Elastic Transcoder - Think of it as Managed ffmpeg
MediaConvert - File based media converter, used for VODs
MediaLive - Broadcast live video streams
MediaPackage - Prepares and protects video
MediaStore - Storage Service optimized for VOD and Live Video
MediaTailor - Targeted advertising into video streams

Machine Learning

SageMaker - Deep learning algorithms
Comprehend - Sentiment Analysis of data
DeepLens - A camera that runs deep learning algorithms on the device
Lex - AI based interaction service
Machine Learning - Intelligence out of data, recommendation systems
Polly - Text to Speech, highly customizable
Rekognition - Analyze video and/or images
Translate - Translate languages from one to another
Transcribe - Subtitles from video, speech to text (Opposite of Polly)

Analytics

Athena - Run SQL on S3 bucket data like CSV or spreadsheets, serverless
EMR (Elastic Map Reduce) - Big data service
CloudSearch
ElasticSearch - Managed ElasticSearch Cluster
Kinesis - To ingest and process streaming data
Kinesis Video Streams - To ingest and process streaming video for analytics
QuickSight - BI Tool to analyze and visualize data
Data Pipeline - Move data between different AWS services
Glue - ETL

Security, Identity and Compliance

IAM (Identity and Access Management)
Cognito - Managed Authentication Service, also supports federated logins, gives temporary access to AWS
GuardDuty - Monitors for malicious activity in the AWS account
Inspector - An agent installed on the EC2s and generates a vulnerability report
Macie - Scans S3 buckets for PII and secrets
Certificate Manager - Manage SSL certificates
CloudHSM - Hardware Security Module as a Service, $1.20 an hour
Directory Service - Integrate Microsoft ActiveDirectory with AWS
WAF - Layer 7 (Application Layer) firewall
Shield - DDoS mitigation, free for CloudFront, ALBs, R53. Advanced Shield gives you a dedicated team (3K a month) 24x7 to help out.
Artifact - Audit and Compliance, allows to download security and compliance reports from AWS

Mobile Services

Mobile Hub - Management Console for Mobile Apps to manage the AWS services that are a backend for the mobile app
Pinpoint - Targeted push notifications to drive mobile engagement, like FourSquare
AppSync - Automatically update data in the mobile apps including offline updates
Device Farm - A test farm to test the app on live devices running in AWS
Mobile Analytics - Analytics Service for Mobile

AR/VR

Details
Sumerian for Augmented Reality/Virtual Reality, is in preview

Application Integration

Step Functions - to manage lambda workflows
Amazon MQ - Managed ActiveMQ service
SQS (Simple Queue Service) - Pull, used to build decoupled architectures
SNS (Simple Notification Service) - Push notifications
SWF (Simple Workflow Service) something Amazon uses on their website to manage workflows needed to ship orders

Customer Engagement

Connect - Call Center in the Cloud
SES (Simple Email Service) - Highly Scalable Email Service

Business Productivity

Alexa for Business - Helpdesk type of service
Chime - Video Conferencing
WorkDocs - Dropbox for AWS
WorkMail - Outlook for AWS

Desktop and App Streaming

Workspaces - VDI in the cloud, Desktop as a Service
AppStream 2.0 - Stream the application that is running in the cloud (compared to desktop or VDI)

Internet of Things

Details
iOT - Scalable service to handle IoT device data coming in from sensors etc.
iOT Device Management - Manage connected devices at scale
FreeRTOS - Operating System for Microcontrollers
GreenGrass - Allows running compute, caching and messaging locally on the connected device in a secure manner (in an offline mode)

Game Development

GameLift - A service to develop games in AWS

Support

4 Plans - Basic (free, included by default), Developer ($29/month), Business ($100/month, 10% of the bill) and Enterprise ($15k/month)

IAM - Identity and Access Management

Centralized control of AWS account
Users - end users who need access, either to the console via a password, or programmatic access to AWS resources via access keys, or both
Groups - collection of users who share permissions
Role - permissions assigned to AWS resources like EC2, ECS, Lambdas etc. Roles are a secure way to grant permissions to entities that you trust. Roles are preferred over credentials (access key and secret key). A change in the role’s policy takes effect immediately. The roles can be attached to or detached from a running EC2 instance.
Policies - permissions defined in a policy document in JSON

A policy document contains a statement, which has a collection of Effect, Action and Resource attributes. The Action and Resource can themselves be collections. These can be associated with users, groups and roles.

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "s3:*",
              "cloudwatch:*",
              "ec2:*"
          ],
          "Resource": "*"
      }
  ]
}

IAM is global, not tied to a region
We can create alias for the account, which shows up in the default signin link (https://account_number.signin.aws.amazon.com/console) to (https://somealias.signin.aws.amazon.com/console)
Always activate MFA on the root account
A root account is the email that is used to setup the account, and it has complete admin access by default
Create users in the account and assign them permissions instead of sharing root user credentials. New users have no permissions when created.
PowerUserAccess is AdministratorAccess - IAM.
Create groups to hold permissions and put users in those groups instead of attaching permissions to users individually
For programatic access (via CLI, SDK and APIs), users get an access key ID and secret access key. For logging into the console, users get a userId and a password
A user can belong to multiple groups
Permissions (IAM Policies) can also be attached to the users, roles and groups directly
There can be a maximum of two pairs of key/secret active at any given time, this is to facilitate access key rotation
Password policies can be set in IAM like rotation, complexity, expiration, etc.
Roles can be assigned to cross-account IAM users, code running on AWS in an EC2, ECS, Lambda, etc., AWS Service, or Federated Identity Users
Web Identify Federation allows users access to AWS resources after they have authenticated with a web based identity provider like Facebook or Google. The auth code received after this authentication is used to exchange temp AWS credentials.
Amazon Cognito acts as an identity broker between the web identity providers and the application.
Cognito User Pools are directories used to manage signup and signin for mobile and web applications.
Successful authentication with a User Pool generates a number of JWTs
Cognito Identity Pools allow creation of unique identities for the users and authenticate them with identity providers. These identities are used to obtain short lived limited privilege credentials to access other AWS services.
User signs into a User Pool using Google credentials. This results in a JWTs. The Identity Pool exchanges the JWTs for AWS Credentials, and these credentials are used by the user to access AWS resources.
3 kinds of IAM policies
- Managed Policies are created and administered by AWS, like S3FullAccess, AmazonDynamoDBFullAccess. They cannot be modified.
- Customer Managed Policies are created and managed by the customers within their account.
- Inline Policies are policies that can be embedded in the user, role or group directly. They cannot be shared between entities. They are deleted if the user, role, or group they’re embedded in is deleted.
The AWS sign-in endpoint for SAML is https://signin.aws.amazon.com/saml.

S3 - Simple Storage Service

Object based storage (vs. block based like EBS) which is highly available and durable.
Data is stored across multiple facilities and devices.
100 buckets per account
Can store files, but cannot run a database or operating system off of it
Objects can be 0 bytes to 5TB in size
We can upload up to 5GB in single operation, for larger objects, use multipart upload API
Unlimited Storage
Successful uploads return a HTTP 200 response code via API or CLI
Files are stored in buckets whose names are globally unique (they have a DNS tied to them). Buckets can have folders in them. These folders do not need to be unique names except within the bucket.
By default all buckets are private.
The URL looks like https://s3.{region}.amazonaws.com/{bucketname} OR https://s3-{region}.amazonaws.com. For example, https://s3.us-east-1.amazonaws.com/lobster1234-94568
Read after Write consistency for new Object PUTs. You can write and immediately read the content.
Eventual Consistency for overwrite PUTs and DELETEs. You can overwrite or delete, but if you read immediately, you may not get the current state. This can take minutes to hours.
S3 Object consists of a Key which is the name of the object (file name like foo.txt), value which is the data (byte[]), Version ID, metadata (tags), Sub-resources like ACL, Torrents
S3 has 99.99% Availability, and 99.999999999% (11 9s) durability.
There are different storage classes or tiers -
- STANDARD - 99.99% availability, 11 9s durability, designed to sustain the loss of 2 facilities (AZs) concurrently. Stored in >= 3 AZs. No retrieval fee.
- STANDARD_IA or S3_IA - Infrequent Access, lower cost than S3 Standard, but charged a retrieval fee. Stored in >= 3 AZs. 99%.9 Availability.
- ONEZONE_IA - Infrequently Accessed but stored in only 1 AZ, lower cost than S3 IA. 99.5% Availability SLA. No resilience of data as it is stored in only one AZ. 99.5%
- GLACIER - Very cheap but only for archival. 3 retrieval modes - Expedited (1-5 mins, 0.03/GB), Standard (3-5 hours, 0.01/GB), or Bulk (5-12 hours,.0025/GB). No availability SLA.
- RRS - Reduced Redundancy Storage, 99.99% durability, not offered as an option anymore.
Availability is 99.99% for Standard, and 99.9% for S3 IA and 99.5 for ONEZONE_IA.
You pay for storage, number of requests, tags, data xfer (including cross region replication), transfer acceleration (optional), storage management (inventory, tags), data xfer out, transfer acceleration.
S3 Transfer Acceleration utilizes an edge location so users can perform faster uploads. The users can upload to an edge location, and from there on the data is uploaded to S3 using AWS backbone.
S3 objects can have lifecycles
S3 objects can be versioned, encrypted (SSE), locked down via ACLs and Bucket Policies
By default, all buckets are private
S3 is a global service in the AWS Console, just like IAM. However, the bucket URL does have the region name in it, and a bucket is tied to a region.
Bucket policies are applied at the bucket level, ACLs are applied at the object level. They’re JSONs.
S3 buckets can have access logs, which can be stored in a different bucket, and can also be in a different AWS account.
Storage classes can be picked at the object level.
Bucket policies apply at the bucket level, while ACLs apply at the individual object level.

S3 Encryption

S3 buckets can be encrypted Server Side via AES-256 (SSE-S3 using S3 Managed Keys) or AWS-KMS (SSE-KMS with KMS managed keys) or SSE-C with customer provided keys. This can be set at bucket, folder, and object level.
Encryption at transit uses SSL/TLS
Encryption at rest can be server side - S3 Managed (SSE-S3), SSE-KMS, or SSE-C, or client side
The header to specify encryption during PUT is x-amz-server-side-encryption with possible values as AES256 for SSE-S3 or aws:kms for SSE-KMS.
A bucket policy can be created to reject any request that does not have the x-amz-server-side-encryption thereby enforcing encryption always. This is done by adding a Condition in the bucket policy with StringNotEquals for Key x-amz-server-side-encryption and Value as aws:kms or AES256 for Action as s3:PutObject.
Each object gets a link which looks like this https://s3.amazonaws.com/mpandit-452001/IMG_1070.png. Note that there is no region in there.

S3 Versioning

Once enabled, versioning cannot be disabled and only be suspended.
Buckets can be versioned, however, each version is stored a 100% (no incremental or deltas). Hence, each version of the object adds to the storage cost.
Deleted objects are also stored as versions.
Delete Markers - A deleted object has an associated delete marker. If the delete marker is deleted, the deleted version is restored.
Read more here
MFA Delete is an added layer of security in S3 buckets, where a second factor auth is required to delete an object version, or to change versioning state.

Cross Region Replication

Versioning needs to be turned on for both buckets (in either region)
Destination bucket can be in the current account, or in a different account
Storage class of the destination bucket can be changed (to be different than that of the source bucket)
A cross region replication role gets created when configuring replication from the console.
Only new objects will be replicated, not the existing ones. Use CLI or manual steps to copy the existing objects over.
If the delete marker is deleted, the deletion of the delete marker is not replicated important. This is for security, to prevent someone from deleting stuff in the primary bucket and having it reflect in the replicated bucket.
The permissions are also replicated from source to destination bucket
If the versions are reverted in the source bucket, they do not replicate to the destination bucket
No daisy chaining of replication buckets
1 bucket can be replicated to only 1 other bucket (no 1 to many replication)

Lifecycle Management

This is how S3 manages objects (mostly moving objects between various storage classes or expiring them based on time)
Can be configured on previous versions or current version
Can be used without versioning
Transition to IA after 30 days of creation, transition to Glacier 30 days after IA, can be expired and permanently deleted as well.

Static Website Hosting

Serverless! But is is 100% static (like a blog).
Use a bucket policy to make the entire bucket public-read.
When selecting to host a website, it’d ask for an index page, and an error page.
The URL is different than the S3 bucket URL - it looks like http://lobster1234-94568.s3-website.{region}.amazonaws.com
The bucket name comes first, while in the S3 URL, the bucket name is the path.
Pre-signed URLs can be inserted in a page to share private content/user specific content.
Read S3 FAQs before the exam

CORS

Cross Origin Resource Sharing
CORS is a way for pages from one S3 bucket to access contents from another S3 bucket.
CORS needs to be enabled from the bucket that is being accessed, to allow the URL of the bucket that is accessing it.
To do so, click on CORS tab in the bucket being accessed, enter the HTTPS URL of the bucket that is trying to access under Access-Control-Allow-Origin which will say * in the template.
Use the published URL of the bucket as origin, not the S3 URL (use the one with website in the name).

Example CORS configuration would look like this, where {originbucketname} is trying to access contents of this bucket.

<CORSConfiguration>
  <CORSRule>
    <AllowedOrigin>http://{originbucketname}.s3-website.us-east-1.amazonaws.com</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedMethod>POST</AllowedMethod>
    <AllowedHeader>*</AllowedHeader>
  </CORSRule>
</CORSConfiguration>

Storage Gateway

It is a service that allows an on-prem VM to S3 to migrate/replicate data
This VM can run with VMWare EXSi or Microsoft HyperV
4 Types
- They can all use DX or S2S VPN or Internet
- File Gateway (uses NFS, only for file based) New Files are stored in S3 as objects. Think of this as an NFS interface to S3. Nothing is stored on-prem.
- Volume Gateway (block based, iSCSI, virtual hard disks where the AWS side is EBS)
  - Stored Volume - Entire copy is stored on-prem with async backups (incremental) to EBS as EBS snapshots. 1GB to 16TB storage limits.
  - Cached Volume - Only the most recently read data is kept on prem (vs. everything like Stored Volume Gateway). All the data is replicated on EBS. 1GB to 32TB storage limits.
- VTL (Virtual Tape Library) or a Tape Gateway. Works with popular tape backup software to act as virtual tapes.

Snowball

Import-Export Disk’s successor
Puts data in S3 (and pulls from it if we want data exported out of AWS)
Called a suitcase at Marqeta :)
Petabyte Scale import/export appliance, up to 80TB.
Secure physically + encrypted (AES 256). Once the data is xfered, it goes through a complete wipe.
Snowball Edge is up to 100 TB and also has on-device compute capability. For example, the suitcase can run code to pull data in and store it.
Snowmobile is a truck, Exabyte scale data transfer. 100 PB storage limit.

S3 URLs

This is confusing as hell, so there you go :
- S3 Bucket Acceleration URL - {bucketname}.s3-accelerate.amazonaws.com no region
- S3 Object URL - https://s3.amazonaws.com/{bucketname}/{key} no region
- S3 Bucket URL - https://s3-{region}.amazonaws.com/{bucketname}
- S3 Cloudfront Origin URL - bucketname.s3.amazonaws.com no region, no protocol
- S3 Static Website URL (note it is NOT https)- http://{bucketname}.s3-website.{region}.amazonaws.com for the rest.

S3 Pricing

S3 Standard is 0.023 per GB
S3 IA is 0.0124 per GB
S3 OneZone-IA is 0.01 per GB
S3 RRS is 0.024 per GB (almost same as S3 Standard)
Glacier is 0.004 per GB

Cloudfront

Amazon’s CDN (Content Delivery Network)
Edge locations are the ones where content is cached, which is not same as AZ or region.
The origin can be S3 bucket, EC2 instance, ELB, or Route53 address. It can also be a non-AWS origin (like a server in a data center).
Distributions
- Web Distribution is for websites
- RTMP (Real Time Messaging Protocol) is used for Media Streaming
A distribution is identified by a domain
Edge Locations can be written to (PUT), this is used in S3 accelerated transfers.
Objects are cached at the edge location for a TTL (Max: 365 days, Default: 24 hours)
It costs to invalidate the cache on an object basis (if we want to do it before the TTL expires)
For performance of GET intensive workloads, use cloudfront
For performance of mixed workloads, hash the S3 key by adding a random prefix to the key name. By doing so, there is no IO Contention on the same partition.
S3 origins look like bucketname.s3.amazonaws.com. (notice no region in the URL).
An Origin Access Identity is set up in the distribution which gives access to Cloudfront to read from the origin S3 bucket. The bucket policy is updated on the origin bucket to allow access to this identity.
Cloudfront uses pre-signed URLs and signed cookies to restrict access of content (just like S3)
The clouddront distribution URL gets a domain name foobar.cloudfront.net. Please note that foobar is not the distribution ID.
Cloudfront allows geo whitelisting OR blacklisting of countries. Cannot do both, has to be either.

EC2 - Elastic Compute Cloud

Elastic Compute Capacity in the cloud, pay for the capacity that you use.
Instance Allocation
- On Demand : Pay by hour for windows, by second for linux. No commitment. Great for unpredictable workloads which cannot be interrupted.
- Reserved : Like a contract, 1 or 3 years, pay no, partial, or full upfront. Up to 75% off on-demand. Great for predictable, sustained workloads. (Standard, Convertible, Scheduled). Think of it as a phone contract. Great for predictable, steady state usage. 3 years all upfront gets most savings obviously (75% off). They’re tied to a region.
- Spot : Allows for the cheapest option, bid for the price you want but only if the process can be interrupted. AWS will terminate the instance if the bid price goes higher. You will not be charged for the hour in which AWS terminates the instance. If you terminate the instance, you pay for the full hour. Used when you have flexible start/end times.
- Dedicated : Non multi-tenant, Bare Metal, used for Regulatory Requirements like Healthcare. Also when software licenses are tied to a host.
Instance Types
- F1 : Genomic research, financial analysis, video processing
- I3 : Storage Optimized, DBs, DW
- G3 : Graphics, video encoding
- H1 : High Disk Throughput, Map Reduce, HDFS
- T2 : Low Cost General Purpose
- D2 : Dense Storage
- R4 : Memory Optimized
- M5 : General Purpose
- C5 : Compute Optimized
- P3 : Graphics, GPU
- X1 : Memory Optimized To Remember: F (FPGA) I (IOPS) G (Graphics) H (High Disk Throughput) T (cheap GP) D (Density) R (RAM Memory) M (GP) C (Compute) P (Graphics) X (Extreme Memory) = FIGHTDRMCPX
EC2 User Data is used to add bootstrap scripts to the instance. It always starts with shbang (#!/bin/bash).
To log in to the instance, use ssh -i <path_to_pem> ec2-user@<IP address>. The PEM has 400 permissions to ensure it is hidden from everyone except the owner.
We can encrypt the root volume (where the OS is installed) using OS level encryption like Windows BitLocker.
Another way to encrypt the root volume is to snap it, encrypt the snap, create an AMI from this snap and use this to launch the EC2.
To retrieve instance metadata or userdata, the endpoint used is http://169.254.169.254/latest/user-data

Instance User-Data :

[root@ip-172-31-56-227 log]# curl http://169.254.169.254/latest/user-data
#!/bin/bash
sudo yum update -y

Instance Meta-Data :

[ec2-user@ip-172-31-51-163 ~]$ curl http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
instance-action
instance-id
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-hostname
public-ipv4
public-keys/
reservation-id
security-groups
services/

[ec2-user@ip-172-31-51-163 ~]$ curl http://169.254.169.254/latest/meta-data/instance-id
i-00dfc2841e9b83d1e

EC2 instance roles are created in IAM which eliminate the need for using security credentials (aws access key and secret) to access AWS services.
The roles can be changed on a running instance, and is effective immediately. (just like security groups)
Xen and Nitro are the underlying hypervisors for EC2

Real world - Make sure to click on “i” on each option of the Launch Instance Wizard steps. Lots of nuggets and gotchas there.

Security Groups

A Security Group is a virtual firewall for the EC2 instance, to control traffic to and from the instance. One EC2 can have many security groups associated with it.
By default, a security group would allow all outbound traffic to any destination, any protocol.
Any change made to a security group is applied immediately (like adding/removing ports, etc.)
Security Groups are stateful. When an inbound rule is added, outbound traffic is automatically allowed.
Security Group rules are only to allow traffic, not to deny. By default all inbound traffic is denied, all outbound traffic is allowed.
All VPCs get a default security group. This SG has only 1 rule, where all the instances associated with that SG can talk to each other (source is itself)

EBS

EBS is a virtual disk. EBS Volumes can be mounted to an EC2 instance. They belong to 1 availability zone and are replicated across multiple physical disks.
- GP2 : General Purpose SSD, 3 IOPS per GB, bursts up to 10K IOPS, bursts up to 3000 IOPS for extended periods of time for volumes 3334 GB and above.
- Provisioned IOPS : More than 10K IOPS, can provision up to 20K IOPS per volume
- ST1 Throughput Optimized HDD : Cannot be boot volumes. DW, Logs are good use cases.
- SC1 Cold HDD : Cannot be boot volumes. Good for cold storage. Lowest cost.
- Standard Magnetic : Legacy, can be bootable. cheapest bootable
The mim volume size for HDD is 500GB
EBS volumes for an instance are in the same AZ. You can only mount the EBS volumes in the same AZ as the EC2 instance.
EBS volume types and sizes can be changed on the fly, without any downtime. There is a performance hit for a bit.
To move EBS volumes across AZs or Regions, use EBS snapshots.
Use Snapshot Copy to create a copy in a different region. Use Create Volume to create a new volume in a different AZ.
To encrypt an unencrypted EBS volume:
- Create a snapshot
- Copy the snapshot and select encryption
- Create a volume from this encrypted snapshot
The only time an EBS volume can be create as an encrypted volume is during the creation.
EBS snapshots sit on S3, and are incremental
Snapshots of an encrypted EBS volume will always be encrypted.
Snapshots can be shared with other accounts and can be made public. Encrypted snapshots cannot be shared.
Root device types can be EBS backed or instance backed.
Instance Store backed instances cannot be stopped and started, can only be rebooted.
Instance stores are ephemeral.
An instance can have many EBS volumes attached, but an EBS volume can be attached to only 1 instance at any time. There is no such thing as shared EBS, for that requirement, consider EFS.

Load Balancers

Elastic Load Balancers - allows us to balance the load between different servers.
- Application Load Balancer : Layer 7. They support advanced request routing based on HTTP request characteristics like path, headers, etc.
- Network Load Balancer : Very High Performance, Layer 4, Most expensive. They support millions of request per second.
- Classic Load Balancer : Dumber Layer 7, Legacy. Also supports Layer 4. The only thing supported at Layer 7 is X-Forwarded-For and sticky sessions.
ELB responds with HTTP 504 Gateway Timeout when the application does not respond.
The DNS names for the load balancers are {LB-name}.{region}-elb.amazonaws.com
The healthcheck statuses for instances behind LB can be InService or OutOfService.
When a healtcheck for an instance fails, the load balancer stops sending traffic to that instance.

Cloudwatch

Basic monitoring sends metrics every 5 minutes, detailed monitoring can send every 1 minute but that costs extra.
Standard EC2 metrics (by default) are CPU usage, disk IO, network IO, CPU credits, Status checks.
Metrics like RAM, Disk utilization, swap usage, etc. would need creation of custom metrics.
Cloudwatch Alarms have 3 states - INSUFFICIENT_DATA, OK and ALARM
Cloudwatch Events allow to set up rules to trigger actions - like AWS Batch job completion can trigger a lambda.
Cloudwatch Logs act as a central location for all logs (like lambda system.out, etc.)

Auto Scaling Groups

An autoscaling group = Launch Configuration + Scaling Policies.
Launch Templates are newly announced, but Launch Configurations have been around the longest
A Launch Configuration has the AMI, Instance Type, Instance details like IAM role, user data, IP allocation details, as well as storage and security groups.
Once a Launch Configuration is created, you can create an AutoScalingGroup. This is where details like VPC, subnets, number of instances, load balancer are entered.
Scaling policies are set up in ASG to tie cloudwatch alarms with autoscaling activities.
The EC2s are launched as soon as an ASG is created.
A Launch Configuration cannot be modified. However, it can be copied as a new configuration.
An ASG can be modified at any time and even the Launch Configuration associated with it can be changed.
Deleting an ASG will terminate the EC2 instances but will not delete the launch configuration associated with it.
ASG can also be created from an EC2 instance, where all the EC2 instance information is used to create a launch configuration, with these limitations -
- The block device information from the AMI is copied over, but not the devices that were attached after the instance launch.
- Tags are not copied over to the ASG
- The load balancer is not copied over (if the instance was behind one) in the ASG.
Scaling options -
- Manual : Update the min, max and desired number of instances manually.
- Scheduled : Good for predictable scaling needs/traffic patterns. A cron pattern can be specified for a schedule for recurring events.
- Dynamic : The advanced but most common, based on cloudwatch events.
Multiple scaling policies can be associated with an ASG.
Termination Policies - when the ASG decided to terminate an instance for scale down.
Default Termination Policy

EFS

Elastic File System, AWS’s NFS (NFSv4)filesystem which is Petabyte scale and scales up on demand.
An EFS is provisioned in multiple AZs and gets an (private) IP per AZ. The instances in each AZ mount to that IP address.
EFS has much better performance compared to EBS PIOPs
Pay for the storage used
Read after Write consistency model
Data is stored across multiple AZ within the same region.

Placement Groups

Two types - Clustered and Spread
Clustered Placement Group has been around for long, where instances share the same AZ. This is for low network latency and/or high network throughput. Only certain instance types (memory optimized, compute optimized, network optimized) can be launched in a clustered placement group.
Spread Placement Group ensures the instances are on different underlying hardware, and multiple AZs.
AWS recommends having the same instance types in a clustered placement group.
You cannot move an existing instance to an existing placement group. You can however launch an AMI from that instance into the placement group.

Lambda

Released in reInvent 2015
Amazon’s Event Driven Compute Service where a function is run without the customer needing to provision any servers.
Lambda has many event sources -
- Cloudwatch (events, logs, alarms)
- S3
- SNS
- API Gateway via HTTP requests
- DynamoDB,
- IoT
- Alexa Skills
- Kinesis
- Cloudfront
- SQS
- SES
- CodeCommit
- Cognito
- CloudFormation
Lambda scales out automatically, and runs concurrently (as the events occur), default limit is 1000 concurrent executions.
Lambda can be tied to a VPC, security group(s), and IAM role.
API Gateway is used to trigger Lambdas as a response to HTTP requests.
Each HTTP request translates into one lambda. In other words, 1 lambda function is not shared between multiple requests. This is 100% stateless by design.
Lambda supports Node, C#, Java, Python
Lambda free tier is 1M requests, and 20c per 1M requests thereafter.
For billing, lambda execution time is rounded to 100ms, and memory is rounded to 128MB.
Lambda can only run for 5 min max (recently 15 min), and max memory is 3008MB
Failure of asynchronous invocation (like SNS) is retried twice with delay in between (so total of 3 attempts), but sync will return 429 error for failure with no automatic retries.
Lambda based systems can get pretty complex when it comes to debugging/troubleshooting. Amazon X-Ray helps with that.

Route 53

Route 53 is Amazon’s DNS Service, allowing domain name mapping to EC2s, Load Balancers and S3 buckets.
Route53 is a global service, just like IAM.
ipv4 has 32 bit space, ipv6 is 128 bits
Last word in any DNS name is the top level domain name (.com, .gov, .in), the one before is second level domain name (.us.gov, .co.in)
The domain names are registered via domain name registrars (amazon, godaddy, wix, etc.) with InterNIC which maintains the whois database.
SOA record (Start of Authority) has the info for TTL (seconds), zone admin, zone server.
NS records are the name server records.
The ISP looks up the top level domain to ask for an NS record, which points to a name server, then the ISP will contact the NS server, which points to SOA record, which has an A (Address) record which has the IP address.
Cnames are used to convert one domain to another - like aliases.
Alias records are unique to Route 53, they’re just like cnames.
A cname cannot be used for naked domain names. That’s why AWS came up with alias records for route53.
Naked Domain Name == Zone Apex Record, is a domain name without the www
Alias Record Set == CNAME record set, which is created for an AWS Resource. It is only supported by A and AAAA (ipv6) DNS record types. Alias target can be ALB, CLB, NLB, Cloudfront Distribution, S3 website.
Routing Policies
- Simple : This is the default routing policy. No intelligence, just a simple resolution to a resource like a web server LB.
- Weighted : Split traffic by assigning weights. Can be used for DR tests, canaries.
- Latency : Route traffic based on the lowest latency for the end user location. Latency Resource Record sets are needed in route53 for this.
- Failover : For active/passive (usually DR) setup. This utilizes healthchecks on the primary site.
- Geolocation : Send traffic to localized servers based on the user’s geo location.
- Multivalue Answer : Works like a load balancer, where multiple targets are (optionally) health-checked and are returned as multiple IPs randomly.
- Geoproximity : Send traffic to the nearest resource based on the client’s location. Needs Route53 traffic flow enabled.
There is a soft limit of 50 domain names.
Be sure to read this

RDS

Relational Database Service for OLTP
6 Instance Types - Aurora, MySQL, MariaDB, Oracle, MS SQL Server, Postgres.
Non Relational Databases or NoSQL Databases have a Collections (Tables), Documents (Rows) and Key-Value Pairs (Fields). The documents may be nested. The structure of the document is not fixed (schemaless). DynamoDB is Amazon’s NoSQL database.
Data Warehousing is used for BI. It is used to perform complex operations on complex data sets, which are very data intensive.
OLTP - Online Transaction Processing, typically small writes and reads, but happen very frequently.
OLAP - Online Analytics Processing, is like Data Warehousing. Very different architecture and infrastructure than OLTP. AWS has Redshift as the DW database.
AWS always gives an instance (service) endpoint which is a DNS address for the DB instance, never an IP.
The DB Security Group needs to allow inbound traffic on port 3306 from the security group of the EC2 instance that is trying to establish a connection.
There are 2 types of backups - Automated backups and Database snapshots, retained for 1-35 days.
Automated backups take a full daily backup. For recovery, AWS chooses the most recent backup and apply the transaction logs. This allows for point in time recovery within the retention window.
Automated backups are enabled by default, and stored in S3 (free storage). During the backup during a defined window, storage IO may be suspended. They are deleted if the RDS instance is deleted.
DB snapshots are manual. They survive the RDS instance deletion.
The restored RDS instance will have a new DNS and will be a new RDS instance.
Encryption at rest is supported for all RDS DB engines.
Encryption uses AWS KMS. If the RDS instance is encrypted, the backup, snapshots and replicas are also encrypted. Encryption has to be defined at instance creation time. For existing DB, encrypt the snapshot thereby creating a copy, and restore it to create a new, encrypted instance. (Just like what we’d do with EBS)
Snapshots can be copied across regions
Multi AZ means a copy of a database (standby) in a different AZ which is replicated synchronously. This is for DR only. The instance automatically fails over to the standby in another AZ in the event of a failure. The DNS endpoint will now have the IP of the multiAZ database.
Read Replicas are for performance (not the standbys) - They are read-only copies of the master, which are replicated asynchronously. This is ideal for read-heavy workloads. They are not available for MS SQL Server and Oracle. They’re used for scale-out. Up to 5 read replicas are possible.
Read replicas can be promoted to be their own databases, but this breaks replication.
Read replicas can be in a completely different region.
Read replicas can be encrypted even if the master is not.

Elasticache

Elasticache - Managed in-memory storage. Supported engines are Memcached and Redis.
Used to improve performance of read-heavy applications by providing low latency access.
Memcached is not multiAZ, Redis is. Redis also supports Master-Slave replication.
Memcached cluster can be scaled out just like an ASG
Redis supports rich data structures like lists, hashes, sets and provides persistence, pub-sub, and multi-AZ with failover just like RDS.

DynamoDB

Amazon’s NoSQL database
You pay for read and write provisioned capacity + the storage.
Data is replicated across 3 data centers
Reads can be eventually consistent or strongly consistent
The largest record size is 400KB
Read Capacity = number of items that can fit in 4 KB and can be read from dynamodb in a second. So if they’re full size (400KB) then we’d need a read capacity of 100 to read 1 such item in a second. For strongly consistent reads, this capacity is 2X eventually consistent reads. So in this example, we’d need 200 units to perform the same throughput (1 full size record to be read in a second), but strongly consistent.
Write Capacity = Number of items which are 1 KB in size that can be written in a second. So just like the above case, if we have 1 full size item (400KB) that needs to be written, then we’d need a write capacity of 400 to be able to write it in 1 second. If we need to write 2 such items in a second, then it’d be 800 (21KB400).
A local secondary index can only be created at table creation time.
A local secondary index has the same primary partition key as the main table and can have a different sort key.
Global secondary index is the one where the index primary key can be different than the primary partition key of the main table, and of course the sort key can be different as well.
There is a 10GB limit of item collection (sum of size of all items in the table plus the local secondary indexes)
LSI share the provisioned read write capacity of the main table.
GSI need their own read write capacity, so think of GSI as another table of its own.
Changes to DynamoDB tables can be streamed using DynamoDB streams. These streams live in shards for 24 hours (like Kinesis).

Redshift

Amazon’s OLAP service, fully managed data warehouse
Can be single node with 160GB data
To scale, use multi node where there is a leader node with compute nodes to do the work (128 compute nodes)
Redshift organizes the data based on columns (column based system).
You’re not charged for a leader node, only the compute nodes.
Redshift is only available in 1 AZ
Can be snapshotted and copied to another AZ
Redshift Spectrum allows to run SQL on exabytes of unstructured data in S3. No ETL needed.
Supports AES256 for data encryption at rest
Redshift attempts to maintain at least 3 copies of the data.
Redshift can asynchronously replicate data to S3 in another region for DR
Backup retention is same as RDS - 1 to 35 days

VPC

A virtual datacenter in the cloud (Virtual Private Cloud).
Number of IPs in a CIDR notation is 2^(32-N). So, a /32 is 2^(0) which is 1.
/16 is the largest VPC, and smallest is /28.
A region has a soft limit of 5 VPCs.
A VPC is divided into subnets, where 1 subnet can only be in 1 AZ (a subnet cannot spread across AZs).
Route Tables control traffic between subnets.
Internet Gateways (1 per VPC) are used to provide internet (in+out) to a subnet by adding a route to IGW in that subnet’s route table.
We get a /16 default VPC in each region, where all /20 subnets are public.
VPCs can be peered (even between different accounts and regions). The peering is in a star configuration where there is no transitive connectivity.
NACLs (Network Access Control Lists) sit at the subnet level, and are stateful.
A VPC can be on a dedicated tenancy, where all the instances that are launched in this VPC will use dedicated hardware.
When we create a new VPC, it will create a default NACL, default route table (called main), but no subnets or gateways.
The default NACLs allows all inbound and outbound traffic, default SG allows all inbound within the same security group and allows all outbound to anywhere, default route table (main) allows all traffic within the VPC.
We lose 5 IP addresses per subnet. These are first 4 and last 1 (.0 network address, .1, .2 and .3 are reserved and .255 is the broadcast).
For outbound internet access for private subnets, we need to route the traffic to either NAT gateway or NAT instance.
NAT instances are legacy, and are created from a community AMI. They’re placed in a public subnet with a security group that allows HTTP traffic inbound and all traffic outbound. Remember to turn off the source/dest check on the instance. By default all the EC2s only allow traffic that either originates or terminates at them.
NAT instances are difficult to scale up and out using the traditional ASG setup.
Use NAT Gateways. They’re managed by AWS and is highly available. They’re also created in a public subnet. They do not sit behind a security group either. They scale automatically up to 10Gbps.
Egress gateway is similar to NAT gateway, except its for ipv6.
It is a good practice to create 1 NAT Gateway per AZ for AZ failure isolation.
NACLs sit at the subnet level (which sits at the AZ level). There can only be 1 NACL per subnet, but multiple subnets can be associated with a NACL.
NACLs are stateless, so you’d need to explicitly allow outbound traffic when you enable inbound traffic on a port.
A new NACL has deny all for inbound and outbound (unlike default which is allow all).
The rules are evaluated in the increasing order (AWS recommends increments of 100) and are first match exit.
A * in a NACL rule set is the default, when there is no earlier match.
Ephemeral ports are super important - they’re ports from 1024-65535, which are used as short lived ports for the client. The client picks these ports to expect the response on. They need to be opened up for outbound traffic.
The changes to NACLs take effect immediately.
The internet facing ALBs need to be in at least 2 AZs, and both of the subnets have to be public.
VPC flow logs allow capturing IP traffic going to/from the network interfaces in the VPC. They are stored as cloudwatch logs. They’re created at VPC, Subnet, or ENI level.
Flow logs can include peered VPC only if the peered VPC is in the same account.
VPC endpoints allow access to AWS services (S3, SQS, SNS..) via the AWS backbone, bypassing the internet.
We get 5 Elastic IP Addresses per VPC. These are static IPs that can be detached and attached to another resource. For example, NAT Gateway gets an EIP.

SQS

A queueing service that acts as a buffer for messages between producers and consumers. Oldest AWS service. Used for decoupled architecture.
SQS supports encryption at rest.
Standard Queue -
- At least once delivery.
- Higher throughput
- No guaranteed order
FIFO Queue -
- More $$
- Less Throughput (300 TPS)
- Exactly Once Delivery
- Retains order
SQS is a pull based system
256 KB per message
Default retention period is 4 days but can be maxed at 14 days, min is 1 minute.
Default visibility timeout is 30 seconds, can be maxed at 12 hours.
Visibility timeout is the amount of time the message is hidden (in flight) from other consumers.
SQS Long polling is a good way to save costs, as it hangs out to the connection till the timeout happens, or a message shows up. The max is 20 seconds. (Receive message wait time)
Delivery Delay can be set up on the queue (0-15 mins)

SWF

Simple WorkFlow service that makes it easy to coordinate tasks between machines and humans with a task oriented API.
This is amazon.com shopping experience for fulfillment.
Workflows can last for as long as 1 year
SWF Starter is the actor that starts the workflow
SWF Decider is a program that controls the coordination of tasks.
SWF Activity Worker is the code that executes to perform that task.
SWF assigns a task to only one worker (no duplication at all)
SWF Domain is the container that contains the related workflows, their tasks, etc.

SNS

Simple notification service
Push based, to send notifications, or act as a trigger for some other processing.
SNS supports Email, Email-JSON, HTTP/S, SMS, Lambda, Application, and SQS as transport protocols.
All messages published are stored redundantly across multiple AZs, and can be encrypted at rest (very recently announced).
An SNS topic acts as an endpoint for pushing messages to consumers.

API Gateway

A service that provides managed, secure HTTP interface to a lambda function (or LB, or EC2)
Scales Automatically
Supports API response caching to during a TTL the requests never hit the backend.
Supports Throttling to ensure the back end is not flooded with requests (like the database).
CORS would need to be enabled on the API Gateway, so pages from another domain can access the API Gateway Endpoints.

Kinesis

A service that allows ingesting, storing and processing streaming data.
Three Services
- Kinesis Streams : Stores streaming data for 1-7 days in shards. Consumers pull data from the shards. Then they can send this data to be stored in DynamoDB, S3, redshift, etc. Capacity of the stream is the sum of the capacity of the shards.
- Kinesis Firehose : No need for shards (so no retention). Data can be (optionally) analyzed with lambda and stored in S3, ES cluster, redshift (via S3)
- Kinesis Analytics : Allows SQL queries to be run on Kinesis Streams as well as Kinesis Firehose.

Well Architected Framework Pillars

Security

Data protection at rest and transit, Privilege Management, Infrastructure Protection, Detective Controls
Apply security at all levels (NACL, Security Groups, WAF, IAM policies)
Enable traceability (Cloudtrail, GuardDuty, Config, Cloudwatch)
Focus on security (Hardened AMIs, KMS encryption, S3 versioning, MFA deletes, IAM Password policies and MFA)
Automate Security (Cloudwatch alarms)
Shared responsibility model
- AWS is responsible for security of the cloud (AWS Global Infrastructure, Physical Infrastructure)
- Customer is responsible for security in the cloud (AMIs, Data Encryption, O/S, IAM, Application Data)

Reliability

Test recovery procedures (chaos engineering)
Automated recovery (ELB, multi-AZ, Route53)
Scale horizontally (ELB)
Stop guessing capacity (Autoscale) and know the service limits
Assume failures
Change Management in AWS
Backup, recovery, RPO, RTO
IaaC, Failure Injection Queries of Aurora

Performance Efficiency

Evolve the platform as AWS evolves theirs
Use managed services for PaaS and IaaS
Go Global
Use Serverless
Experiment often
Across Storage, Network, Database and Compute - pick the right options across the stack.
Focus on reducing latency across the stack, and make it predictable.

Cost Optimization

Reduce the cost to run infrastructure
Transparent expenses (use tags, budgets, billing alerts, consolidated billing)
Use managed services
Pay for what you use, make resources idle when not in use (compute, autoscale, lambda)
Economies of Scale
Do not overlook data xfer charges

Operational Excellence

Perform operations with code
Apply monitoring and collect metrics
Make incremental changes
Prepare - Maintain runbooks and playbooks (cloudformation)
Change Visibility and Configuration Management - AWS Config, Cloudwatch, Tagging
Focus on No downtime deployments, focus on CI/CD
Have an automated rollback plan before making changes

AWS Organizations

AWS allows linking and managing multiple accounts together, centrally.
There is a paying account (root) and other accounts linked to it.
Consolidated billing aggregates expenses across accounts per service and you’re sent one, consolidated bill.
Volume pricing discount as multiple account usage adds up for lower pricing tier.
Reserved instances that are unused in one linked account can pay for the other account
No resources should be deployed in the paying (root) account.
There is a soft limit of 20 linked accounts.
The organizations allow using SCP (Service Control Policy) which can be used to control the AWS services that the linked accounts can use.
SCP will override IAM.

Cloudformation Structure

Resources - Define the resources to be created
Parameters - Parameters taken in to create the resources
Mappings - Used to map key values in the template
Outputs - Return the resources created after running the template

Whitepapers

FAQs, APIs/CLIs and Documentation

Finally

Before you schedule the exam, measure your confidence in these areas in particular -

Ins and outs of VPC, EBS and S3.
Highly available, fault tolerant and cost effective architectures (and a combination of these)
Disaster Recovery