0% found this document useful (0 votes)
45 views

MIE1628 Big Data Analytics Lecture6

The document discusses core data concepts in Azure including batch and streaming data, relational and non-relational data, data analytics, and Azure Data Factory. It defines these terms, compares different data types and use cases, and describes roles and tools related to working with data in Azure.

Uploaded by

Viola Song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

MIE1628 Big Data Analytics Lecture6

The document discusses core data concepts in Azure including batch and streaming data, relational and non-relational data, data analytics, and Azure Data Factory. It defines these terms, compares different data types and use cases, and describes roles and tools related to working with data in Azure.

Uploaded by

Viola Song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Cloud-based

Data Analytics
Lecture 6

1
What are Core Data
Concepts in Azure?
Objectives
• Identify how data is defined and stored
• Describe and differentiate batch and
streaming data
• Describe and differentiate roles and
responsibilities
• Identify the characteristics of relational data
and non-relational data
• Describe and differentiate data use cases
• Describe and differentiate data analytics
• Describe Azure Data Factory
What is data?
• Collection of facts, numbers, descriptions, objects, stored in a
structured, semi-structured and unstructured way.
What is data?
• Collection of facts, numbers, descriptions, objects, stored in a
structured, semi-structured and unstructured way.
Batch Data / Streaming Data
Batch Data vs Streaming Data
Roles and Responsibilities
• Explore data job roles
• Explore common tasks and tools for data job roles
Roles in Data

• Database Management • Data Pipelines and processes • Provides insights into the
• Implements Data Security • Data Ingestion storage data
• Backups • Prepare data for Analysis • Visual Reporting
• User Access • Prepare data for analytical • Modeling Data for Analysis
• Monitors Performance processing • Combines data for
visualization and analysis
Common Tools – Database Administrator
What is Azure Data Studio?
What is SQL Server Management Studio?
Azure portal to manage Azure SQL Database
Common Tools – Data Engineering
Common Tools – Data Analyst
Data Visualization Tools
Describe Concepts of Relational Data
Understand the characteristics of relational
data
Understand the characteristics of relational
data
Main characteristics of a relational database
• All data is tabular. Entities are modeled as tables, each instance of an
entity is a row in the table, and each property is defined as a column.
• All rows in the same table have the same set of columns.
• A table can contain any number of rows.
• A primary key uniquely identifies each row in a table. No two rows
can share the same primary key.
• A foreign key references rows in another, related table. For each value
in the foreign key column, there should be a row with the same value
in the corresponding primary key column in the other table.
Relational database use cases
• Relational databases are commonly used in ecommerce systems, but
one of the major use cases for using relational databases is Online
Transaction Processing (OLTP).

• Examples of OLTP applications that use relational databases are:


• Banking solutions
• Online retail applications
• Flight reservation systems
• Many online purchasing applications.
Explore relational data structures - Index

• An index helps you search for data in a table.


• You can create many indexes on a table. So, if you also wanted to find all orders for a
specific product, then creating another index on the Product ID column in the Orders
table, would be useful.
• An index might consume additional storage space, and each time you insert, update, or
delete data in a table, the indexes for that table must be maintained.
Explore relational data structures – Clustered
Index

In database management systems that support them, a table


can only have a single clustered index.
Explore relational data structures – Views
Relational Data Services
Understand IaaS and PaaS
Azure IaaS Service
Azure PaaS Services
Describe Concepts of Non-Relational Data
Characteristics of Non-Relational Data
• A key aspect of non-relational databases is that they enable you to store
data in a very flexible manner.
• Non-relational databases don't impose a schema on data. Instead, they
focus on the data itself rather than how to structure it.
• This approach means that you can store information in a natural format,
that mirrors the way in which you would consume, query and use it.
• In a non-relational system, you store the information for entities in
collections or containers rather than relational tables.
• Two entities in the same collection can have a different set of fields rather
than a regular set of columns found in a relational table.
Characteristics of Non-Relational Data

• More advanced non-relational systems support indexing, in a similar manner to an index


in a relational database. Queries can then use the index to identify and fetch data based
on non-key fields.
Key/Value Stores and Graph Databases
Identify non-relational database use cases
• A relational database restructures the data into a fixed format that is
designed to answer specific queries. When data needs to be ingested very
quickly, or the query is unknown and unconstrained, a relational database
can be less suitable than a non-relational database.

• Non-relational databases are highly suitable for the following scenarios:


• IoT and telematics.
• Retail and marketing.
• Gaming.
• Web and mobile applications.
Azure Non-Relational Data Services

Azure Table Storage


Azure Non-Relational Data Services
Concepts of Data Analytics
Transactional vs Analytical Data Stores
Analytical System
Analytical System
Data Processing
What is ELT and ETL?

The data processing mechanism can take two approaches to retrieving the ingested data, processing
this data to transform it and generate models, and then saving the transformed data and models.
These approaches are known as ETL and ELT.
Advantages of ETL and ELT
Explore data analytics
Recap
Azure Data Factory
What is Azure Data Factory?
https://azure.microsoft.com/en-gb/services/data-factory/
Azure Data Factory Features
What is Azure Data Factory?

Copy Data Transform


What is Azure Data Factory?

Copy Data Transform


Data Factory Components
Copy Data Transform

1 Linked Services – How and what to connect to. Like the SSIS connection manager.

SQLDBLinkedService

ConnectionString: Server=MyServer;Database=myDataBase
UserName: “Admin”
Password: ***************
Data Factory Components
Copy Data Transform

1 Linked Services – How and what to connect to. Like the SSI

SQLDBLinkedService

ConnectionString: Server=MyServer;Database=myDataBase
UserName: “MrPaulAndrew”
Password: ***************
Data Factory Components
Copy Data Transform

1 Linked Services
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets – Where is my data? What format? What file path/table do I need?

dbo.DimOrder

/RAW/Orders/2020/01/01/Orders.csv
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets
Databricks Notebook Activity
3 Activities – What do we notebookPath: /Playground/Playing
want to happen? baseParameters: Testing
With what conditions? libraries[ jar]: dbfs:/lib1.jar
linkedServiceName: BricksOfData01
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines – What groups of


work do I want to do?
Execute Pipeline
Activity
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines – What groups of work


do I want to do?
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities
• Scheduled
Pipelines • Event-based
4 • Logic App Calls
5 Triggers – How are we going to tell our pipeline(s) to execute?
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual


• Tumbling Windows
3 Activities • Scheduled
• Event-based
4 Pipelines • Logic App Calls
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows - AKA Time Slices
3 Activities • Scheduled Loading

• Event-based
4 Pipelines • Logic App Calls 2019 2020

5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Event-based
4 Pipelines • Logic App Calls - Every 1 minute.
- UTC
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Event-based {Path} Created
4 Pipelines • Logic App Calls {Path} Deleted

5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Window Trigger
3 Activities • Schedule Trigger
• Events-based
4 Pipelines • Logic App Calls
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines

5 Triggers
Data Factory Control Flow Components
Copy Data Transform

1 üLinked Services
2 üData Sets
3 üActivities
4 üPipelines
5 ûTriggers
Data Factory Control Flow Components
Copy Data Transform

1 Linked Services
Expression Builder
2 Data Sets Parameters
@{………}
System Variables
3 Activities
• Collection
4 Pipelines • Conversation
• Date
5 Triggers • Logical
• Math
• String
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Integration Runtimes
Integration Runtimes
Flexible Region
Data Movements
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Azure IR
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Provisioning via PSH
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Single Hosted IR
On Premises Azure
Multiple Hosted IR’s (Failover & Load Balancing)
On Premises Azure
Hosted IR Linked to Multiple Data Factory’s
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Recap: Data Factory
1 Linked Services Azure
1 Integration Runtime
2 Data Sets
SSIS
3 Activities 2 Integration Runtime
4 Pipelines
Self Hosted
5 Triggers
3 Integration Runtime
Data Factory Key Activities
Lookup Activity
Get Metadata to Support Other Control Flow Activities
Single Value
Or
Many Values
Dataset [array]

Lookup

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
ForEach Activity IsSequential: [array]
true
Scaling Out Control Flow Activities [0]

[1]

Many Values [2]


[array]
[3]

[i]
ForEach
[array]

Lookup

[0] [1] [2] [3] [4] [5] [6] [i]


Copy Data Do Stuff

Batch Count Default: 20


@item(). Batch Count Max: 50

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity
Custom Activity Extend Data Factory with Custom Code
References Objects
Datasets: []
Linked Services: []

Custom

Linked Services
Azure Batch ???
Azure Blob Storage

https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
Recap: Data Factory – Useful Activities
Lookup
1 Get value(s) from lots of places to support other activities.

ForEach
2 Iteration of other activities, sequentially or in parallel.

Custom
3 Extensibility - code executed by Azure Batch compute pools.
Data Factory Data Flows
Data Factory Data Flows

Wrangling Mapping
Data Flow Data Flow

1 Linked Services
2 Data Sets

3 Activities

4 Pipelines

5 Triggers
Mapping Data Flows
What is a Mapping Data Flow?
Mapping Data Azure
Flow Databricks

Data Factory Portal Interface

Data Factory

https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
Mapping Data Flows – Settings & Concepts
Mapping Data Flows – Transformations
Multicast
Merge Join

Script Component
Mapping Data Flows – Expression Builder
Mapping Data Flows – Debug Mode

General Purpose cluster


Wrangling Data Flows
What is a Wrangling Data Flow?

Power BI Desktop

Data Factory Portal


What is a Wrangling Data Flow?
Wrangling Data Azure
Flow Databricks

Data Factory Portal

Data Factory
Data Flow - Cluster Configuration
Azure
Databricks

• General Purpose
• Memory Optimised
• Compute Optimised
Recap: Data Factory – Data Flows
Mapping
1 Similar to SSIS Data Flows in appearance.

Wrangling
2 Similar to Power BI Power Query.

Data Flow Cluster Config


3 Via the Data Factory Azure IR
Data Factory DevOps – CI/CD
Recap
Lookup
ForEach

Copy Data

Custom

DoStuff

1. Orchestrator of our Control Flow operations – with scale out Activities.


2. Orchestrator of our Data Flow transformations – using cloud native services.
3. The scheduler of solutions – using a variety of Pipeline Triggers.
References
• Azure Account

• Labs for Azure Practice


• Azure Data Factory Documentation

• Practice Module:
• 01 - Create a virtual machine in the portal
• 02 - Create a Web App
• 05 - Create blob storage

• Important - Delete all the resources after each practice


session

You might also like