BECE 355L AWS Cloud Module 3 Total
BECE 355L AWS Cloud Module 3 Total
For
Module III
BECE355L-AWS for Cloud Computing
Course Contents Module 6: AWS Operational
Experiences
Module 7: AWS networking and content
delivery
Module 5:
AWS
Architectural
Best Module 4: AWS Security and compliance
practices
Module 3: AWS
Database
services
• AWS lambda is very helpful when you know how to write code but you don’t know how to
provision the underlying infrastructure in AWS. AWS lambda will scale up the applications
rapidly when there is sudden incoming traffic and scale down to zero when the incoming
traffic is reduced.
How AWS Lambda works ? Topic 1: AWS Lambda
• AWS Lambda is an event-driven, serverless
computing platform provided by Amazon as a
part of Amazon Web Services. It is a computing
service that runs code in response to events and
automatically manages the computing resources
required by that code.
• We can trigger Lambda from over 200 AWS
services so that means Lambda has incredible
natural integrations with AWS resources and
also able to integrate SAAS applications.
• Run code without provisioning or managing
infrastructure. Simply write and upload code as • Save costs by paying only for the compute
a .zip file or container image. Automatically time you use by per-millisecond instead of
respond to code execution requests at any scale, provisioning infrastructure upfront for
from a dozen events per day to hundreds of
thousands per second. peak capacity. We call this pay-as-you-go.
• Optimize code execution time and performance with the right function memory size. Respond to high demand in
double-digit milliseconds with Provisioned Concurrency.
How AWS Lambda works ? Topic 1: AWS Lambda
• Configuring Key Features: Configure core features such as global tables for multi-region replication,
encryption at rest for security, and on-demand capacity mode for automatic scaling based on demand.
• Data Management: Use NoSQL Workbench to design and visualize tables, and PartiQL for SQL-like
querying. Enable point-in-time recovery to back up and restore data as needed.
• Integrating with AWS Services: DynamoDB integrates with other AWS services seamlessly. Export
data to Amazon S3 for backup. Use AWS Glue Elastic Views for materialized views and Amazon Kinesis
Data Streams for real-time processing. Monitor activities and performance using AWS CloudTrail and
Amazon CloudWatch.
Auto Scaling: Amazon Dynamo DB Topic 2: Amazon Dynamo DB
• In order to automatically alter provided
throughput capacity on your behalf in
response to actual traffic patterns,
Amazon DynamoDB auto scaling
makes use of the AWS Application
Auto Scaling service.
• In order to accommodate sudden
increases in traffic without throttling, a
table or global secondary index might
enhance the read and write capacity of
the data it is provided with.
• Application Auto Scaling reduces
throughput when the workload falls so
that you won’t be charged for unused
allotted capacity.
Use cases: Amazon Dynamo DB Topic 2: Amazon Dynamo DB
Develop software applications
• Build internet-scale applications supporting user-content metadata and caches that require
high concurrency and connections for millions of users and millions of requests per second.
Create media metadata stores
• Scale throughput and concurrency for media and entertainment workloads such as real-time
video streaming and interactive content, and deliver lower latency with multi-Region
replication across AWS Regions.
Deliver seamless retail experiences
• Use design patterns for deploying shopping carts, workflow engines, inventory tracking, and
customer profiles. DynamoDB supports high-traffic, extreme-scaled events and can handle
millions of queries per second.
Scale gaming platforms
• Focus on driving innovation with no operational overhead. Build out your game platform
with player data, session history, and leaderboards for millions of concurrent users.
• What is Amazon DynamoDB used for?
• Amazon DynamoDB is a NoSQL database designed for high-performance
applications at any scale. It offers high read/write throughput with single-
digit millisecond performance and limitless scalability across multiple
regions, supporting horizontal scaling and fully managed automation.
• What are the advantages of Amazon DynamoDB?
• Amazon DynamoDB's key advantages include being a fully managed, scale-
to-zero serverless database with single-digit millisecond performance and up
to 99.999% availability. It provides consistent performance at scale, built-in
security, durability, and reliability for global applications.
• What are the main benefits of using Amazon DynamoDB?
• Amazon DynamoDB is a fully managed, serverless NoSQL database
offering limitless scalability, active-active data replication for multi-region
resiliency, and consistent single-digit millisecond response times. It is easy
to start and use, ideal for demanding applications.
Amazon ECS (Elastic
Container Service)
Introduction: Amazon ECS Topic 3: Amazon ECS (Elastic Container Service)
1. Container:
• Definition: A container is a lightweight, standalone, and executable software package that
includes everything needed to run a piece of software, including the code, runtime, system
tools, libraries, and settings.
• Key Feature: Containers are isolated from each other and the host system, which ensures
that the software will run consistently across different environments.
• Example: Think of a container as a virtualized package that can run your application on any
machine without worrying about the underlying hardware or operating system.
2. Docker:
• Definition: Docker is a platform that allows you to develop, ship, and run applications inside
containers. It simplifies the process of building and deploying containers.
• Key Feature: Docker provides tools to create, manage, and deploy containers efficiently.
• Example: Docker allows developers to package their application into a Docker container on
their laptop, test it, and then deploy the same container to production without worrying about
differences in environments.
Introduction: Amazon ECS Topic 3: Amazon ECS (Elastic Container Service)
• Amazon Elastic Container Service (ECS), also known as Amazon EC2 Container Service,
is a managed service that allows users to run Docker-based applications packaged as
containers across a cluster of EC2 instances.
• Running simple containers on a single EC2 instance is simple but running these
applications on a cluster of instances and managing the cluster is being administratively
heavy process.
• With ECS, Fargate launch type (serverless compute engine allows you to run containers
without managing the underlying infrastructure), the load, and responsibility of managing
the EC2 cluster is transferred over to the AWS and you can focus on application
development rather than management of your cluster architecture.
• AWS Fargate is the AWS service that allows ECS to run containers without having to
manage and provision the resources required for running these applications.
• It deeply integrates with the AWS environment to provide an easy-to-use solution for
running container workloads in the cloud and on premises with advanced security features
using Amazon ECS Anywhere.
Introduction: Amazon ECS Topic 3: Amazon ECS (Elastic Container Service)
• Developers Role: Developers begin by defining their applications, including specifying the
required resources such as compute power, storage, and networking. They also provide
container images that encapsulate the application and its dependencies.
How Elastic Container Service Works? Topic 3: Amazon ECS (Elastic Container Service)
• Operators' Role: Operators then create and customize scaling and capacity rules to manage how
applications should scale based on demand. They also set up monitoring and logging to observe
application performance and health.
• Amazon ECS Management: Amazon ECS takes over to manage the application lifecycle. It handles:
• Configuration: Automatically integrates with AWS services such as Elastic Load Balancing for
traffic distribution, AWS Secrets Manager for sensitive data management, and Amazon Elastic File
System for shared storage.
• Deployment and Scaling: Deploys applications using various compute options:
• AWS Fargate: A serverless compute engine that abstracts away the underlying infrastructure
management.
• Amazon EC2: Provides instances optimized for various workloads, including EC2 Graviton and Intel-
based options.
• Amazon ECS Anywhere: Extends ECS to on-premises servers.
• AWS Outposts, Local Zones, and Wavelength: Offers deployment flexibility for local and edge
computing needs.
• Container Images: Container images are built and stored using Amazon Elastic Container Registry
(ECR) or other container registries. ECS uses these images to launch and run containers as specified by
the developers.
Use cases Topic 3: Amazon ECS (Elastic Container Service)
Microservices Architecture
• Use Case: An e-commerce platform with different services like user
management, payment processing, product catalog, and order management.
• How ECS Helps: Each service can be deployed as a separate container,
allowing independent scaling, updates, and management. ECS enables
orchestrating these containers, ensuring smooth communication between
services and automatic scaling based on demand.
Batch Processing
• Use Case: A financial institution needs to run intensive batch processing jobs
like data transformation, report generation, or fraud detection.
• How ECS Helps: Batch jobs can be containerized and run on ECS. The
service can automatically scale up the number of containers to process large
data sets quickly and scale down when the tasks are completed, optimizing
resource usage.
Use cases Topic 3: Amazon ECS (Elastic Container Service)
Web Application Hosting
• Use Case: A media company needs to host a high-traffic website with
dynamic content and media streaming capabilities.
• How ECS Helps: The web application and media streaming services can be
containerized and deployed on ECS. ECS ensures that the application scales
to handle traffic spikes and balances the load across multiple containers.
Gaming Backend Services
• Use Case: A gaming company needs a scalable backend to handle
matchmaking, player data storage, and real-time analytics.
• How ECS Helps: ECS can run backend services in containers that
automatically scale based on player activity. This ensures that the gaming
experience is smooth and responsive even during peak times.
Features Topic 3: Amazon ECS (Elastic Container Service)
• Amazon S3 Glacier storage classes are designed for data archiving, offering
high performance, flexible retrieval, and low-cost storage. They provide
virtually unlimited scalability and 99.99% data durability, with options for fast
access and minimal cost.
• You can choose from three S3 Glacier archive storage classes based on access
needs and cost:
• S3 Glacier Instant Retrieval: For immediate access to data (e.g., medical
images), with millisecond retrieval and the lowest storage cost.
• S3 Glacier Flexible Retrieval: For data not needing immediate access but
flexible retrieval (e.g., backups), offering retrieval in minutes or free bulk
retrievals in 5-12 hours.
• S3 Glacier Deep Archive: For long-term storage (e.g., compliance archives)
with the lowest cost and retrieval within 12 hours.
How does it work Topic 3: Amazon S3 Glacier
• The Amazon S3 Glacier storage classes are purpose-built for data archiving,
providing you with the highest performance, most retrieval flexibility, and the
lowest cost archive storage in the cloud. You can now choose from three
archive storage classes optimized for different access patterns and storage
duration.
How does it work Topic 3: Amazon S3 Glacier
• S3 Glacier Instant Retrieval offers up to 68% lower storage costs compared to S3 Standard-
IA for long-lived data accessed quarterly.
• It is ideal for rarely accessed, performance-sensitive use cases like image hosting, medical
imaging, and news media. It provides 99.999999999% (11 nines) durability and 99.9%
availability, with data stored redundantly across multiple AWS Availability Zones.
• Storage costs are lower than S3 Standard, but retrieval costs are slightly higher.
How does it work Topic 3: Amazon S3 Glacier
• S3 Glacier Flexible Retrieval offers up to 10% lower storage costs compared to S3 Glacier
Instant Retrieval for archive data accessed 1-2 times per year.
• It provides flexible retrieval options, balancing cost and access times from minutes to hours,
with free bulk retrievals.
• This class is ideal for backups, disaster recovery, and offsite storage, offering 99.99% durability
and 99.99% availability, with data stored redundantly across multiple AWS Availability Zones.
How does it work Topic 3: Amazon S3 Glacier
• S3 Glacier Deep Archive offers the lowest storage cost, up to 75% less than S3 Glacier
Flexible Retrieval, for data accessed less than once per year.
• At $0.00099 per GB-month, it’s a cost-effective alternative to on-premises tape. Suitable
for long-term data retention (7-10 years), it provides 99.999% durability and 99.99%
availability, with data stored across multiple AWS Availability Zones.
AWS Glacier Terminology Topic 3: Amazon S3 Glacier
1. Vaults: Vaults are virtual
containers that are used to store
data. Vaults in AWS Glacier are
similar to buckets in S3.
• Each Vault has its specific access
policies (Vault lock/access policies).
Thus providing you with more
control over who has what kind of
access to your data.
• Vaults are region-specific.
• 2. Archives: Archives are the
fundamental entity type stored in
Vaults. Archives in AWS Glacier are
similar to Objects in S3. Virtually
you have unlimited storage capacity
on AWS Glacier and hence, can store
an unlimited number of archives in a
vault.
AWS Glacier Terminology Topic 3: Amazon S3 Glacier
• 3. Vault Access Policies: In addition to the
basic IAM controls AWS Glacier offers Vault
access policies that help managers and
administrators have more granular control of
their data.
• Each vault has its own set of Vault Access
Policies.
• If either of Vault Access Policy or IAM control
doesn’t pass for some user action. The user is
not declared unauthorized.
• 4. Vault Lock Policies: Vault lock policies are
exactly like Vault access policies but once set,
they cannot be changed.
• Specific to each bucket.
• This helps you with data compliance controls.
For example- Your business administrators
might want some highly confidential data to be
only accessible to the root user of the account.
Vault lock policy for such a use case can be
written for the required vaults.
Features of AWS Glacier Topic 3: Amazon S3 Glacier
• Given the extremely cheap storage,
provided by AWS Glacier, it doesn’t
provide as many features as AWS S3.
Access to data in AWS Glacier is an
extremely slow process.
• Just like S3, AWS Glacier can
essentially store all kinds of data types
and objects.
• Durability: AWS Glacier, just like
Amazon S3, claims to have a 99.99%
of durability. This means the
possibility of losing your data stored
in one of these services one in a
billion. AWS Glacier replicates data
across multiple Availability Zones for
providing high durability.
• Data Retrieval Time: Data retrieval
from AWS Glacier can be as fast as 1-
5 minutes (high-cost retrieval) to 5-12
hours(cheap data retrieval).
Features of AWS Glacier Topic 3: Amazon S3 Glacier
• Given the extremely cheap storage,
provided by AWS Glacier, it doesn’t
provide as many features as AWS S3.
Access to data in AWS Glacier is an
extremely slow process.
• Just like S3, AWS Glacier can
essentially store all kinds of data types
and objects.
• Durability: AWS Glacier, just like
Amazon S3, claims to have a 99.99%
of durability. This means the
possibility of losing your data stored
in one of these services one in a
billion. AWS Glacier replicates data
across multiple Availability Zones for
providing high durability.
•AWS Glacier Console: The AWS Glacier dashboard is not as
• Data Retrieval Time: Data retrieval
from AWS Glacier can be as fast as 1- intuitive and friendly as AWS S3. The Glacier console can only be
5 minutes (high-cost retrieval) to 5-12 used to create vaults. Data transfer to and from AWS Glacier can
hours(cheap data retrieval). only be done via some kind of code. This functionality is provided
via: AWS Glacier API, AWS SDKs
Features of AWS Glacier Topic 3: Amazon S3 Glacier
• Security:
• AWS Glacier automatically encrypts
your data using the AES-256
algorithm and manages its keys for
you.
• Apart from normal IAM controls
AWS Glacier also has resource
policies (vault access policies and
vault lock policies) that can be used
to manage access to your Glacier
vaults.
• Infinite Storage Capacity:
Virtually AWS Glacier is supposed
to have infinite storage capacity.
Data Transfer In Glacier Topic 3: Amazon S3 Glacier
• 1. Data Upload
• Data can be uploaded to AWS Glacier by creating a vault from the
Glacier console and using one of the following methods:
• Write code that uses AWS Glacier SDK (Software Development Kit) to
upload data.
• Write code that uses AWS Glacier API to upload data.
• S3 Lifecycle policies: S3 lifecycle policies can be set to upload S3 objects to
AWS Glacier after some time. This can be used to backup old and
infrequently access data stored in S3.
• 2. Data Transfer Between Regions
• AWS Glacier is a region-specific service. Data in one region can be transferred
to another from the AWS console. This cost of suck a data transfer is $0.02.
Data Transfer In Glacier Topic 3: Amazon S3 Glacier
• 3. Data Retrieval
• As mentioned before, AWS Glacier is a backup and data archive service, given its low cost of storage,
AWS Glacier data is not readily available for consumption.
• Data retrieval from Glacier can only be done via some sort of code, using AWS Glacier SDK or the
Glacier API.
• Data Retrieval in AWS Glacier is of three types:
• Expedited:
• This mode of data retrieval is only suggested for urgent requirements of data.
• A single expedited retrieval request can only be used to retrieve 250MB of data at max.
• This data is then provided to you within 1-5 minutes.
• The cost of expedited retrieval is $0.03 per GB and 0.01 per request.
• Standard:
• This data retrieval mode can be used for any size of data, full or partial archive.
• This data is then provided to you within 3-5 hours.
• The cost of standard retrieval is $0.01 per GB and $0.05 per 1000 requests.
• Bulk:
• This data retrieval is suggested for mass retrieval of data (petabytes of data).
• It is the cheapest data retrieval option offered by AWS Glacier
• This data is then provided to you within 5-12 hours.
• The cost of bulk retrieval is 0.0025 per GB and 0.025 per 1000 requests
Benefits Topic 3: Amazon S3 Glacier
• Lower Cost: Glacier is intended to be Amazon’s most affordable storage class. When
compared to other Amazon storage service, this allows an organization to store large amounts
of data at a lower cost.
• Upholds archive database: It is not mandatory for an organization to keep up its own
archival database. Administrative duties including hardware and capacity planning are handled
by AWS.
• Durability: Glacier can restore data more easily in the event that it is lost in one of the three
actual AWS Availability Zones where it is distributed at any one time.
• Scalability: Businesses are able to adjust the amount of data they store as needed. Businesses
have a choice between bulk, standard, and expedited retrievals.
Data Transfer In Glacier Topic 3: Amazon S3 Glacier
Feature Amazon Glacier Amazon S3
Long-term archival of infrequently Frequently accessed data with low-
Use Case accessed data. latency requirements.
Supports AWS Transfer Acceleration for Supports AWS Transfer Acceleration for
Data Transfer Acceleration faster data uploads. faster data uploads.
Amazon Redshift is a
fully managed data
warehouse service
provided by AWS.
AMAZON KINESIS DATA STREAMS
Amazon Kinesis is a serverless streaming data service that simplifies the
capture, processing and storage of data streams at any scale
This architecture demonstrates
how Amazon Kinesis can ingest
and process large volumes of
real-time data securely and
reliably across multiple
availability zones. It supports a
variety of use cases from simple
data archiving to complex real-
time analytics, machine learning,
and large-scale batch processing.
The modular nature of Kinesis
allows it to be integrated into
diverse systems to provide
timely and actionable insights.
Features of Amazon Kinesis
• Cost-efficient: All the services provided by the amazon are cost-efficient as it
follows the pay as you go model which means you have to pay for the service
according to the usage, not a flat price. So it becomes advantageous for the users
that they have to pay only what they use.
• Integrate with other AWS services: Amazon Kinesis allows users to use the
other AWS services and integrate with it. Services that can be integrated are
Amazon DynamoDB, Amazon Redshift, and all the other services that deal with
the large amount of data.
• Availability: You can access it from anywhere and anytime. Just need a good
connectivity of net.
• Real-time processing- It allows you to work upon the data which is needed to
be updated every time with changes instantaneously. Most advantageous feature
of Kinesis because real-time processing becomes important when you are
dealing with such a huge amount of data.
TERMINOLOGIES - AMAZON KINESIS
• Kinesis Data Streams
• A Kinesis data stream is a set of shards (pieces). Each shard has a sequence of data records. Each data
record has a sequence number that is assigned by Kinesis Data Streams
• Data Record
• A data record is the unit of data stored in a Kinesis data stream. Data records are composed of a
sequesnce number, partition key and a data blob, which is an immutable sequence of bytes. Kinesis
Data Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be
up to 1 MB
• Capacity Mode
• A data stream capacity mode determines how capacity is managed and how you are charged for the
usage of your data stream. Currently, in Kinesis Data Streams, you can choose between an on demand
mode and a provisioned mode for your data streams.
• With the on-demand mode, Kinesis Data Streams automatically manages the shards in order to
provide the necessary throughput. You are charged only for the actual throughput that you use and
Kinesis Data Streams automatically accommodates your workloads’ throughput needs as they ramp up
or down.
• With the provisioned mode, you must specify the number of shards for the data stream. The total
capacity of a data stream is the sum of the capacities of its shards. You can increase or decrease the
number of shards in a data stream as needed and you are charged for the number of shards at an hourly
rate.
TERMINOLOGIES - AMAZON KINESIS DATA STREAMS
• Retention Period
• The retention period is the length of time that data records are accessible after they are
added to the stream. A stream’s retention period is set to a default of 24 hours after
creation. You can increase the retention period up to 8760 hours (365 days) using the
IncreaseStreamRetentionPeriod operation, and decrease the retention period down to a
minimum of 24 hours using the DecreaseStreamRetentionPeriod operation. Additional
charges apply for streams with a retention period set to more than 24 hours
• Producer
• Producers put records into Amazon Kinesis Data Streams. For example, a web server
sending log data to a stream is a producer
• Consumer
• Consumers get records from Amazon Kinesis Data Streams and process them. These
consumers are known as Amazon Kinesis Data Streams Application.
Amazon Kinesis Data Streams Application
• An Amazon Kinesis Data Streams application is a consumer of a stream that
commonly runs on a fleet of EC2 instances.
• There are two types of consumers that you can develop: shared fan-out
consumers and enhanced fan-out consumers
AMAZON KINESIS DATA STREAMS
The producers continually push data to Kinesis Data Streams, and the consumers process the
data in real time. Consumers (such as a custom application running on Amazon EC2 or an
Amazon Data Firehose delivery stream) can store their results using an AWS service such as
Amazon DynamoDB, Amazon Redshift, or Amazon S3
Benefits of Amazon Kinesis Data Streams
• Although we can use Kinesis Data Streams to solve a variety of streaming data
problems, a common use is the real-time aggregation of data followed by
loading the aggregate data into a data warehouse or map-reduce cluster.
• Data is put into Kinesis data streams, which ensures durability and elasticity.
• The delay between the time a record is put into the stream and the time it can be
retrieved (put-to-get delay) is typically less than 1 second. In other words, a
Kinesis Data Streams application can start consuming the data from the stream
almost immediately after the data is added.
• The managed service aspect of Kinesis Data Streams relieves you of the
operational burden of creating and running a data intake pipeline. You can create
streaming map-reduce–type applications. The elasticity of Kinesis Data Streams
enables you to scale the stream up or down, so that you never lose data records
before they expire.
Benefits of Amazon Kinesis Data Streams (Contd…)
• Multiple Kinesis Data Streams applications can consume data from a stream, so
that multiple actions, like archiving and processing, can take place concurrently
and independently
Eg: Two applications can read data from the same stream. The first application
calculates running aggregates and updates an Amazon DynamoDB table, and the
second application compresses and archives data to a data store like Amazon
Simple Storage Service (Amazon S3). The DynamoDB table with running
aggregates is then read by a dashboard for up-to-the-minute reports. The Kinesis
Client Library enables fault-tolerant consumption of data from streams and
provides scaling support for Kinesis Data Streams applications.
AMAZON KINESIS VIDEO STREAMS
Amazon Kinesis is a serverless streaming data service that simplifies the capture,
processing and storage of data streams at any scale.
USE CASE EXAMPLE – INDUSTRIAL ROBOTICS
Amazon Kinesis Data Streams Vs Video Streams
• Amazon Kinesis Data Streams
A serverless data streaming service that can capture, process, and store data
streams at any scale. It's highly available, durable, and low latency.
• Amazon Kinesis Video Streams
A data streaming service that's tailored to video streaming. It can capture,
process, and store video streams from various sources, and supports both live
video streaming and archival storage. It can manage thousands of video streams
simultaneously
Amazon Kinesis Data Streams Vs Video Streams
• Here are some other differences between the two services:
• Use cases
Kinesis Data Streams can be used for a variety of applications, while Kinesis
Video Streams is designed for video streaming, analytics, and machine learning.
• Integration
Kinesis Data Streams integrates with other AWS services. Kinesis Video Streams
works well with machine learning services for tasks like facial recognition or
motion detection.
• Features
Kinesis Data Streams offers on-demand and provisioned capacity mode, and
dedicated throughput per consumer. Kinesis Video Streams is designed for cost-
effective, efficient ingestion and storage of time-encoded data
Use cases of Amazon Kinesis
• Create Real time applications
• Build apps for application monitoring, fraud detection, and live leaderboards
• Analyze data and emit the results to any data store or application
• Evolve from batch to real time analytics
• Perform real-time analytics on data that has been traditionally analyzed using batch processing
• Get the latest information without delay
• Analyze IoT device data
• Process streaming data from IoT devices, and then use the data to programmatically send real-
time alerts and respond when a sensor exceeds certain operating thresholds
• Build Video Analytics applications
• Securely stream video from camera-equipped devices
• Use streams for video playback, security monitoring, face detection, ML, and other analytics
Amazon Redshift
Amazon Redshift - Introduction
• Amazon Redshift is a fully managed, petabyte (1 PB = 1,024 TB)-scale data warehouse
service in the cloud, designed to enable fast, scalable, and cost-effective data analysis.
• Launched by Amazon Web Services (AWS) in 2013, it has quickly become a cornerstone for
organizations looking to perform complex queries on large datasets without the burden of
managing physical hardware.
• Designed for online analytical processing (OLAP) of petabyte-scale data, making it ideal for
big data analytics, business intelligence, and reporting.
• Redshift shines in its ability to handle huge volumes of data — capable of processing
structured and unstructured data in the range of exabytes (1018 bytes). However, the service
can also be used for large-scale data migrations.
• Redshift helps to gather valuable insights from a large amount of data. With the easy-to-use
interface of AWS, you can start a new cluster in a couple of minutes, and you don’t have to
worry about managing infrastructure.
• Customers can use the Redshift for just $0.25 per hour with no commitments or upfront costs
and scale to a petabyte or more for $1,000 per terabyte per year.
• RedShift is an enterprise-level,
petabyte-scale and fully managed data
warehousing service.
• Clusters
The core infrastructure component of an Amazon Redshift data warehouse is a
cluster. A cluster is composed of one or more compute nodes. If a cluster is
provisioned with two or more compute nodes, an additional leader node
coordinates the compute nodes and handles external communication. Your client
application interacts directly only with the leader node. The compute nodes are
transparent to external applications.
Data warehouse system architecture
• Leader Node
The leader node manages communications with client programs and all communication
with compute nodes. It parses and develops execution plans to carry out database
operations, in particular, the series of steps necessary to obtain results for complex
queries. Based on the execution plan, the leader node compiles code, distributes the
compiled code to the compute nodes, and assigns a portion of the data to each
compute node. The leader node distributes SQL statements to the compute nodes only
when a query references tables that are stored on the compute nodes. All other queries
run exclusively on the leader node.
• Compute Node
The leader node compiles code for individual elements of the execution plan and
assigns the code to individual compute nodes. The compute nodes run the compiled
code and send intermediate results back to the leader node for final aggregation. Each
compute node has its own dedicated CPU and memory, which are determined by the
node type. As your workload grows, you can increase the compute capacity of a cluster
by increasing the number of nodes, upgrading the node type, or both.
Data warehouse system architecture
• Redshift managed storage
Data warehouse data is stored in a separate storage tier Redshift Managed Storage (RMS).
RMS provides the ability to scale your storage to petabytes using Amazon S3 storage. RMS
lets you scale and pay for computing and storage independently, so that you can size your
cluster based only on your computing needs. It automatically uses high-performance SSD-
based local storage as tier-1 cache. It also takes advantage of optimizations, such as data
block temperature, data block age, and workload patterns to deliver high performance
while scaling storage automatically to Amazon S3 when needed without requiring any
action.
• Node Slices
A compute node is partitioned into slices. Each slice is allocated a portion of the node's
memory and disk space, where it processes a portion of the workload assigned to the
node. The leader node manages distributing data to the slices and apportions the
workload for any queries or other database operations to the slices. The slices then work
in parallel to complete the operation.
Data warehouse system architecture
• Data Distribution
Distributes data based on a column value, ensuring that rows with the same value are
stored together, optimizing joins.
Evenly distributes rows across all slices, which is useful for uniform data distribution.
Copies data to all nodes, ideal for small tables that are frequently joined with others.
• Encryption:
Data Encryption at Rest: Redshift supports encryption of data at rest using AWS Key Management
Service (KMS) or customer-managed keys.
• Network Isolation:
Amazon VPC Integration: Redshift clusters can be deployed within an Amazon Virtual Private
Cloud (VPC), providing network isolation and enhanced security.
PrivateLink Support: Allows secure and private communication between Redshift and other AWS
services without using the public internet
Security Features
• Access Control:
AWS Identity and Access Management (IAM): Redshift integrates with IAM to provide
fine-grained access control to data and resources, allowing you to define roles and
permissions based on the principle of least privilege.
Database User Management: Supports creating users and roles within Redshift with varying
levels of access to the database and its objects.
• Audit Logging:
Database Audit Logs: Redshift can generate audit logs that record database activity,
including connection attempts, queries run, and changes to database objects.
Integration with CloudTrail: Audit logs can be sent to AWS CloudTrail for long-term storage
and detailed analysis
Use cases
• Amazon Redshift is used when the data to be analyzed is humongous. The data has to be at
least of a petabyte-scale (1015 bytes) for Redshift to be a viable solution. The MPP
technology used by Redshift can be leveraged only at that scale. Beyond the size of data,
there are some specific use cases that warrant its use.
• Real-time analytics
• Many companies need to make decisions based on real-time data and often need to
implement solutions quickly too. Take Uber for example, based on historical and current
data, Uber has to make decisions quickly. It has to decide surge pricing, where to send
drivers, what route to take, expected traffic, and a whole host of data.
• Thousands of such decisions have to be made every minute for a company like Uber with
operations across the globe. The current stream of data and historical data has to be
processed in order to make those decisions and ensure smooth operations. Such instances can
use Redshift as the MPP technology to make accessing and processing data faster
Use cases
• Combining multiple data sources
There are occasions where structured data, semi-structured data, and/or unstructured data
have to be processed to gain insights. Traditional business intelligence tools lack the
capability to handle the varied structures of data from different sources. Amazon Redshift is a
potent tool in such use cases.
• Business intelligence
The data of an organization needs to be handled by a lot of different people. All of them are
not necessarily data scientists and will not be familiar with the programming tools used by
engineers.
They can rely on detailed reports and information dashboards that have an easy-to-use
interface. Highly functional dashboards and automatic report creation can be built using
Redshift. It can be used with tools like Amazon Quicksight and also third-party tools created
by AWS partners.
Use cases
• Log analysis
• Behavior analytics is a powerful source for useful insights. Behavior analytics provide
information on how a user uses an application, how they interact with it, the duration of
use, their clicks, sensor data, and a plethora of other data.
• The data can be collected from multiple sources — including a web application used on a
desktop, mobile phone, or tablet — and can be aggregated and analyzed to gain insight into
user behavior. This coalescing of complex datasets and computing data can be done using
Redshift.
• Redshift can also be used for traditional data warehousing. But solutions like the S3 data
lake would likely be better suited for that. Redshift can be used to perform operations on
data in S3, and save the output in S3 or Redshift
Use cases
• Improve financial and demand forecasts
• Ingests hundreds of megabytes of data per second so you can query data in near real time and build
low latency analytics applications for fraud detection, live leaderboards, and IoT.
• Optimize business intelligence
• Build insight-driven reports and dashboards using Amazon Redshift and BI tools such as Amazon
QuickSight, Tableau, Microsoft PowerBI, or others.
• Accelerate Machine Learning in SQL
• Use SQL to build, train, and deploy ML models for many use cases including predictive analytics,
classification, regression and more to support advance analytics on large amount of data.
• Monetize data
• Build applications on top of all your data across databases, data warehouses, and data lakes.
• Seamlessly and securely share and collaborate on to create more value for your customers, monetize
your data as a service, and unlock new revenue streams.
• Combine data with third party datasets
• Whether it's market data, social media analytics, weather data or more, subscribe to and combine third
party data in AWS Data Exchange with your data in Amazon Redshift, without hassling over licensing
and onboarding processes and moving the data to the warehouse.
Benefits of AWS Redshift
• Speed. With the use of MPP technology, the speed of delivering output on large data sets is
unparalleled. No other cloud service providers can match the speed at the cost AWS provides the
service.
• Data Encryption. Amazon provides the facility for data encryption for any part of Redshift operation.
You as the user can decide which operations need encryption and those that do not need encryption.
Data encryption provides an added layer of security.
• Use familiar tools. Redshift is based on PostgreSQL. All the SQL queries work with it. Additionally,
you are free to choose any SQL, ETL (Extract, Transform, Load), and Business Intelligence (BI) tools
you are familiar with. There is no requirement to use the tools provided by Amazon.
• Intelligent Optimization. For a large data set, there would be a number of ways to query data with the
same parameters. The different commands will have different levels of data utilization. AWS Redshift
provides tools and information to improve queries. It will also provide tips to improve the database
automatically. These can be utilized for an even faster operation that is less intensive on resources.
• Automate repetitive tasks. Redshift has the provisions by which you can automate tasks that have to be
done repeatedly. This could be administrative tasks like generating, daily, weekly, or monthly reports. It
could be resource and cost auditing. It can also be regular maintenance tasks to clean up data. You can
automate all these with the provisions offered by Redshift.
• Concurrent Scaling. AWS Redshift will scale up automatically to support increasing concurrent
workloads.
Example
Real-time User Activity Streaming:
•Amazon Kinesis Data Streams collects real-time data on user actions, such as browsing history, clicks, add-to-
cart actions, and purchases. This real-time stream is processed for:
• Real-time analytics and personalized recommendations.
• Updating inventory levels in near real-time.
• Triggering alerts for low stock or fraud detection.
• Recovery Point Objective (RPO) is the maximum acceptable amount of time since
the last data recovery point. This objective determines what is considered an
acceptable loss of data between the last recovery point and the interruption of
service and is defined by the organization.
Disaster recovery options in the cloud
• Disaster recovery strategies available to you within AWS can be broadly categorized into four
approaches, ranging from the low cost and low complexity of making backups to more complex
strategies using multiple active Regions
• Active/passive strategies use an active site (such as an AWS Region) to host the workload and serve
traffic
• The passive site (such as a different AWS Region) is used for recovery
• The passive site does not actively serve traffic until a failover event is triggered
Disaster recovery options in the cloud
• For a disaster event based on disruption or loss of one physical data center for a well-architected,
highly available workload, you may only require a backup and restore approach to disaster recovery
• If your definition of a disaster goes beyond the disruption or loss of a physical data center to that of
a Region or if you are subject to regulatory requirements that require it, then you should consider
Pilot Light, Warm Standby, or Multi-Site Active/Active
• When choosing your strategy, and the AWS resources to implement it, keep in mind that within
AWS, we commonly divide services into the data plane and the control plane
• The data plane is responsible for delivering real-time service while control planes are used to
configure the environment
• For maximum resiliency, you should use only data plane operations as part of your failover
operation
• This is because the data planes typically have higher availability design goals than the control
planes
Backup and restore
• Backup and restore is a suitable approach for mitigating against data loss or corruption. This
approach can also be used to mitigate against a regional disaster by replicating data to other AWS
Regions, or to mitigate lack of redundancy for workloads deployed to a single Availability Zone.
• In addition to data, you must redeploy the infrastructure, configuration, and application code in the
recovery Region
Pilot Light
• With the pilot light approach, you replicate your data from one Region to another and
provision a copy of your core workload infrastructure.
• Resources required to support data replication and backup, such as databases and object
storage, are always on.
• Other elements, such as application servers, are loaded with application code and
configurations, but are "switched off" and are only used during testing or when disaster
recovery failover is invoked.
• In the cloud, you have the flexibility to deprovision resources when you do not need them,
and provision them when you do.
• A best practice for “switched off” is to not deploy the resource, and then create the
configuration and capabilities to deploy it (“switch on”) when needed.
• Unlike the backup and restore approach, your core infrastructure is always available and you
always have the option to quickly provision a full scale production environment by
switching on and scaling out your application servers.
Pilot Light (Contd...)
A pilot light approach minimizes the ongoing cost of disaster recovery by minimizing the
active resources, and simplifies recovery at the time of a disaster because the core
infrastructure requirements are all in place. This recovery option requires you to change your
deployment approach. You need to make core infrastructure changes to each Region and
deploy workload (configuration, code) changes simultaneously to each Region.
Shared Responsibility Model for Resiliency
• Resiliency is a shared responsibility between AWS and you, the customer. It is
important that you understand how disaster recovery and availability, as part of
resiliency, operate under this shared model.
Shared Responsibility Model for Resiliency
• AWS responsibility “Resiliency of the Cloud”
• AWS is responsible for resiliency of the infrastructure that runs all of the services offered in the
AWS Cloud. This infrastructure comprises the hardware, software, networking, and facilities that
run AWS Cloud services. AWS uses commercially reasonable efforts to make these AWS Cloud
services available, ensuring service availability meets or exceeds
• Customer responsibility “Resiliency in the Cloud”
• Customer responsibility will be determined by the AWS Cloud services that are selected. This
determines the amount of configuration work you must perform as part of your resiliency
responsibilities. For example, a service such as Amazon Elastic Compute Cloud (Amazon EC2)
requires the customer to perform all of the necessary resiliency configuration and management
tasks. Customers that deploy Amazon EC2 instances are responsible for deploying EC2 instances
across multiple locations (such as AWS Availability Zones), implementing self-healing using
services like Amazon EC2 Auto Scaling, as well as using resilient workload architecture best
practices for applications installed on the instances. For managed services, such as Amazon S3 and
Amazon DynamoDB, AWS operates the infrastructure layer, the operating system, and platforms,
and customers access the endpoints to store and retrieve data
Restoring and testing backups