MIE1628 Big Data Analytics Lecture6
MIE1628 Big Data Analytics Lecture6
Data Analytics
Lecture 6
1
What are Core Data
Concepts in Azure?
Objectives
• Identify how data is defined and stored
• Describe and differentiate batch and
streaming data
• Describe and differentiate roles and
responsibilities
• Identify the characteristics of relational data
and non-relational data
• Describe and differentiate data use cases
• Describe and differentiate data analytics
• Describe Azure Data Factory
What is data?
• Collection of facts, numbers, descriptions, objects, stored in a
structured, semi-structured and unstructured way.
What is data?
• Collection of facts, numbers, descriptions, objects, stored in a
structured, semi-structured and unstructured way.
Batch Data / Streaming Data
Batch Data vs Streaming Data
Roles and Responsibilities
• Explore data job roles
• Explore common tasks and tools for data job roles
Roles in Data
• Database Management • Data Pipelines and processes • Provides insights into the
• Implements Data Security • Data Ingestion storage data
• Backups • Prepare data for Analysis • Visual Reporting
• User Access • Prepare data for analytical • Modeling Data for Analysis
• Monitors Performance processing • Combines data for
visualization and analysis
Common Tools – Database Administrator
What is Azure Data Studio?
What is SQL Server Management Studio?
Azure portal to manage Azure SQL Database
Common Tools – Data Engineering
Common Tools – Data Analyst
Data Visualization Tools
Describe Concepts of Relational Data
Understand the characteristics of relational
data
Understand the characteristics of relational
data
Main characteristics of a relational database
• All data is tabular. Entities are modeled as tables, each instance of an
entity is a row in the table, and each property is defined as a column.
• All rows in the same table have the same set of columns.
• A table can contain any number of rows.
• A primary key uniquely identifies each row in a table. No two rows
can share the same primary key.
• A foreign key references rows in another, related table. For each value
in the foreign key column, there should be a row with the same value
in the corresponding primary key column in the other table.
Relational database use cases
• Relational databases are commonly used in ecommerce systems, but
one of the major use cases for using relational databases is Online
Transaction Processing (OLTP).
The data processing mechanism can take two approaches to retrieving the ingested data, processing
this data to transform it and generate models, and then saving the transformed data and models.
These approaches are known as ETL and ELT.
Advantages of ETL and ELT
Explore data analytics
Recap
Azure Data Factory
What is Azure Data Factory?
https://azure.microsoft.com/en-gb/services/data-factory/
Azure Data Factory Features
What is Azure Data Factory?
1 Linked Services – How and what to connect to. Like the SSIS connection manager.
SQLDBLinkedService
ConnectionString: Server=MyServer;Database=myDataBase
UserName: “Admin”
Password: ***************
Data Factory Components
Copy Data Transform
1 Linked Services – How and what to connect to. Like the SSI
SQLDBLinkedService
ConnectionString: Server=MyServer;Database=myDataBase
UserName: “MrPaulAndrew”
Password: ***************
Data Factory Components
Copy Data Transform
1 Linked Services
Data Factory Components
Copy Data Transform
1 Linked Services
2 Data Sets – Where is my data? What format? What file path/table do I need?
dbo.DimOrder
/RAW/Orders/2020/01/01/Orders.csv
Data Factory Components
Copy Data Transform
1 Linked Services
2 Data Sets
Databricks Notebook Activity
3 Activities – What do we notebookPath: /Playground/Playing
want to happen? baseParameters: Testing
With what conditions? libraries[ jar]: dbfs:/lib1.jar
linkedServiceName: BricksOfData01
Data Factory Components
Copy Data Transform
1 Linked Services
2 Data Sets
3 Activities
1 Linked Services
2 Data Sets
3 Activities
1 Linked Services
1 Linked Services
1 Linked Services
• Event-based
4 Pipelines • Logic App Calls 2019 2020
5 Triggers
Data Factory Components
Copy Data Transform
1 Linked Services
1 Linked Services
5 Triggers
Data Factory Components
Copy Data Transform
1 Linked Services
1 Linked Services
2 Data Sets
3 Activities
4 Pipelines
5 Triggers
Data Factory Control Flow Components
Copy Data Transform
1 üLinked Services
2 üData Sets
3 üActivities
4 üPipelines
5 ûTriggers
Data Factory Control Flow Components
Copy Data Transform
1 Linked Services
Expression Builder
2 Data Sets Parameters
@{………}
System Variables
3 Activities
• Collection
4 Pipelines • Conversation
• Date
5 Triggers • Logical
• Math
• String
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Azure Data Factory - UI
Integration Runtimes
Integration Runtimes
Flexible Region
Data Movements
Azure
1 Integration Runtime Activity
Orchestration
Specified Region
Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Azure IR
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration
Specified Region
Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Provisioning via PSH
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration
Specified Region
Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Single Hosted IR
On Premises Azure
Multiple Hosted IR’s (Failover & Load Balancing)
On Premises Azure
Hosted IR Linked to Multiple Data Factory’s
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Recap: Data Factory
1 Linked Services Azure
1 Integration Runtime
2 Data Sets
SSIS
3 Activities 2 Integration Runtime
4 Pipelines
Self Hosted
5 Triggers
3 Integration Runtime
Data Factory Key Activities
Lookup Activity
Get Metadata to Support Other Control Flow Activities
Single Value
Or
Many Values
Dataset [array]
Lookup
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
ForEach Activity IsSequential: [array]
true
Scaling Out Control Flow Activities [0]
[1]
[i]
ForEach
[array]
Lookup
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity
Custom Activity Extend Data Factory with Custom Code
References Objects
Datasets: []
Linked Services: []
Custom
Linked Services
Azure Batch ???
Azure Blob Storage
https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
Recap: Data Factory – Useful Activities
Lookup
1 Get value(s) from lots of places to support other activities.
ForEach
2 Iteration of other activities, sequentially or in parallel.
Custom
3 Extensibility - code executed by Azure Batch compute pools.
Data Factory Data Flows
Data Factory Data Flows
Wrangling Mapping
Data Flow Data Flow
1 Linked Services
2 Data Sets
3 Activities
4 Pipelines
5 Triggers
Mapping Data Flows
What is a Mapping Data Flow?
Mapping Data Azure
Flow Databricks
Data Factory
https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
Mapping Data Flows – Settings & Concepts
Mapping Data Flows – Transformations
Multicast
Merge Join
Script Component
Mapping Data Flows – Expression Builder
Mapping Data Flows – Debug Mode
Power BI Desktop
Data Factory
Data Flow - Cluster Configuration
Azure
Databricks
• General Purpose
• Memory Optimised
• Compute Optimised
Recap: Data Factory – Data Flows
Mapping
1 Similar to SSIS Data Flows in appearance.
Wrangling
2 Similar to Power BI Power Query.
Copy Data
Custom
DoStuff
• Practice Module:
• 01 - Create a virtual machine in the portal
• 02 - Create a Web App
• 05 - Create blob storage