0% found this document useful (0 votes)
294 views47 pages

Azure Data Factory

The document discusses cloud computing, Microsoft Azure, extract-transform-load (ETL) processes, extract-load-transform (ELT), types of data extraction including logical and physical extraction, types of data loading including initial, incremental, and full refresh loads, and Azure Data Factory (ADF). ADF is a cloud-based data integration service that allows users to create data workflows for moving and transforming data at scale. It can extract data from various sources, transform it, and load it into destinations. Pipelines are composed of activities and data flows that are programmed in ADF to integrate and process data.

Uploaded by

golden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
294 views47 pages

Azure Data Factory

The document discusses cloud computing, Microsoft Azure, extract-transform-load (ETL) processes, extract-load-transform (ELT), types of data extraction including logical and physical extraction, types of data loading including initial, incremental, and full refresh loads, and Azure Data Factory (ADF). ADF is a cloud-based data integration service that allows users to create data workflows for moving and transforming data at scale. It can extract data from various sources, transform it, and load it into destinations. Pipelines are composed of activities and data flows that are programmed in ADF to integrate and process data.

Uploaded by

golden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

What is Cloud Computing?

Cloud computing is a technology that provides access to various computing


resources over the internet. All you need to do is use your computer or mobile device
to connect to your cloud service provider through the internet. Once connected, you
get access to computing resources, which may include serverless computing, virtual
machines, storage, and various other things.

What is Microsoft Azure?


Azure is a cloud computing platform and an online portal that allows you to access
and manage cloud services and resources provided by Microsoft. These services
and resources include storing your data and transforming it, depending on your
requirements. To get access to these resources and services, all you need to have is
an active internet connection and the ability to connect to the Azure portal.

Extract, transform, and load (ETL) process


Extract, transform, and load (ETL) is a data pipeline used to collect data from various
sources. It then transforms the data according to business rules, and it loads the
data into a destination data store.

The data transformation that takes place usually involves various operations, such as filtering,
sorting, aggregating, joining data, cleaning data, deduplicating, and validating data.

Often, the three ETL phases are run in parallel to save time. For example, while data is being
extracted, a transformation process could be working on data already received and prepare
it for loading, and a loading process can begin working on the prepared data, rather than
waiting for the entire extraction process to complete.
Extract, load, and transform (ELT)
Extract, load, and transform (ELT) differs from ETL solely in where the transformation
takes place. In the ELT pipeline, the transformation occurs in the target data store.
Instead of using a separate transformation engine, the processing capabilities of the
target data store are used to transform data. This simplifies the architecture by
removing the transformation engine from the pipeline. Another benefit to this
approach is that scaling the target data store also scales the ELT pipeline
performance. However, ELT only works well when the target system is powerful
enough to transform the data efficiently.

ETL Vs ELT
 ETL loads data first into the staging server and then into the target
system, whereas ELT loads data directly into the target system.
 ETL model is used for on-premises, relational and structured data,
while ELT is used for scalable cloud structured and unstructured data
sources.
 Comparing ELT vs. ETL, ETL is mainly used for a small amount of
data, whereas ELT is used for large amounts of data.

Types of Data Extraction


Coming back to data extraction, there are two types of data extraction: Logical and Physical
extraction.

Logical Extraction
The most commonly used data extraction method is Logical Extraction which is further
classified into two categories:
Full Extraction
In this method, data is completely extracted from the source system. The source data will be
provided as is and no additional logical information is necessary on the source system. Since
it is complete extraction, there is no need to track the source system for changes.
For e.g., exporting a complete table in the form of a flat file.
Incremental Extraction
In incremental extraction, the changes in source data need to be tracked since the last
successful extraction. Only these changes in data will be retrieved and loaded. There can be
various ways to detect changes in the source system, maybe from the specific column in the
source system that has the last changed timestamp. You can also create a change table in the
source system, which keeps track of the changes in the source data. It can also be done via
logs if the redo logs are available for the rdbms sources. Another method for tracking changes
is by implementing triggers in the source database.

Physical Extraction
Physical extraction has two methods: Online and Offline extraction:
Online Extraction
In this process, the extraction process directly connects to the source system and extracts the
source data.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the
original source system. You can consider the following common structure in offline
extraction:
 Flat file: Is in a generic format
 Dump file: Database specific file
 Remote extraction from database transaction logs

Data extraction methods


Extraction jobs may be scheduled, or analysts may extract data on demand as
dictated by business needs and analysis goals. There are three primary types of
data extraction, listed here from most basic to most complex:

Update notification
The easiest way to extract data from a source system is to have that system issue a
notification when a record has been changed. Most databases provide an
automation mechanism for this so that they can support database replication
(change data capture or binary logs), and many SaaS applications provide
webhooks, which offer conceptually similar functionality. An important note about
change data capture is that it can provide the ability to analyze data in real time or
near-real time.
Incremental extraction
Some data sources are unable to provide notification that an update has occurred,
but they are able to identify which records have been modified and provide an
extract of those records. During subsequent ETL steps, the data extraction code
needs to identify and propagate changes. One drawback of incremental extraction
technique is that it may not be able to detect deleted records in source data,
because there's no way to see a record that's no longer there.

Full extraction
The first time you replicate any source, you must do a full extraction. Some data
sources have no way to identify data that has been changed, so reloading a whole
table may be the only way to get data from that source. Because full extraction
involves high volumes of data, which can put a load on the network, it’s not the best
option if you can avoid it.

Types of Loading
1. Initial Load: For the very first time loading all the data warehouse tables.
2. Incremental Load: Periodically applying ongoing changes as per the
requirement. After the data is loaded into the data warehouse database,
verify the referential integrity between the dimensions and the fact tables
to ensure that all records belong to the appropriate records in the other
tables. The DBA must verify that each record in the fact table is related to
one record in each dimension table that will be used in combination with
that fact table.
3. Full Refresh: Deleting the contents of a table and reloading it with fresh
data.

 Full load: entire data dump that takes place the first time a data source is
loaded into the warehouse. With the overwhelming amount of data being
moved at once, it is much easier for data to get lost within the big move.

 Incremental load: This is where you are moving new data in intervals. The
last extract date is stored so that only records added after this date are
loaded. Incremental loads are more likely to encounter problems due to the
nature of having to manage them as individual batches rather than one big
group.

Incremental loads come in two flavors that vary based on the volume of data
you’re loading:

o Streaming incremental load – better for loading small data volumes


o Batch incremental load – better for loading large data volumes

Full load Incremental load


Rows All rows in source data New and updated records only
sync
Time More time to load Less time to load (faster)
(slower)

Difficulty Low High. ETL must be checked for new/updated row. Recovery from an
issue is harder

Azure Data Factory (ADF)


Azure Data Factory is the platform that solves such data scenarios. It is the cloud-

based ETL and data integration service that allows you to create data-driven

workflows for orchestrating data movement and transforming data at scale.

Azure Data Factory (ADF) is a data pipeline orchestrator and ETL tool that is part of

the Microsoft Azure cloud ecosystem. ADF can pull data from the outside world (FTP,

Amazon S3, Oracle, and many more), transform it, filter it, enhance it, and move it

along to another destination.


Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-

transform-load (ETL), extract-load-transform (ELT), and data integration projects.

Getting ADF to do real work for you involves the following layers of technology, listed

from the highest level of abstraction that you interact with down to the software

closest to the data.


 Pipeline, the graphical user interface where you place widgets and draw data
paths
 Activity, a graphical widget that does something to your data
 Source and Sink, the parts of an activity that specify where data is coming from
and going to
 Data Set, an explicitly defined set of data that ADF can operate on
 Linked Service, the connection information that allows ADF to access a specific
outside data resource
 Integration Runtime, a glue/gateway layer that lets ADF talk to software outside
of itself
Pipeline
An ADF pipeline is the top-level concept that you work with most directly. Pipelines

are composed of activities and data flow arrows. You program ADF by creating

pipelines. You get work done by running pipelines, either manually or via automatic

triggers. You look at the results of your work by monitoring pipeline execution.

This pipeline takes inbound data from an initial Data Lake folder, moves it to cold

archive storage, gets a list of the files, loops over each file, copies those files to an

unzipped working folder, then applies an additional filter by file type.

A data factory might have one or more pipelines. A pipeline is a logical grouping of
activities that performs a unit of work. Together, the activities in a pipeline perform a
task. For example, a pipeline can contain a group of activities that ingests data from
an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the
data.

The benefit of this is that the pipeline allows you to manage the activities as a set
instead of managing each one individually. The activities in a pipeline can be chained
together to operate sequentially, or they can operate independently in parallel.

Activity
Activities represent a processing step in a pipeline. There is a CopyData activity to
move data, a ForEach activity to loop over a file list, a Filter activity that chooses a
subset of files, etc. Most activities have a source and a sink. Data Factory supports
three types of activities: data movement activities, data transformation activities, and
control activities.

The activity is the task we performed on our data. We use activity inside the Azure
Data Factory pipelines. ADF pipelines are a group of one or more activities.

Types of Acitivites

Activities in Azure Data Factory can be broadly categorized as:

1- Data Movement Activities

2- Data Transformation Activities

3- Control Activities

DATA MOVEMENT ACTIVITIES :

1- Copy Activity: It simply copies the data from Source location to destination
location. Azure supports multiple data store locations such as Azure Storage, Azure
DBs, NoSQL, Files, etc.

DATA TRANSFORMATION ACTIVITIES:

1- Data Flow: In data flow, First, you need to design data transformation workflow to
transform or move data. Use the Data Flow activity to transform and move data via
mapping data flows. There are two types of DataFlows: Mapping and Wrangling
DataFlows

MAPPING DATA FLOW: It provides a platform to graphically design data


transformation logic. You don’t need to write code. Once your data flow is complete,
you can use it as an Activity in ADF pipelines.

WRANGLING DATA FLOW: It provides a platform to use power query in Azure Data
Factory which is available on Ms excel. You can use power query M functions also on
the cloud.

NOTE :- HDinsight is a fully managed cloud service

2- Hive Activity: This is a HD insight activity that executes Hive queries on


windows/linux based HDInsight cluster. It is used to process and analyze structured
data.

3- Pig activity: This is a HD insight activity that executes Pig queries on


windows/linux based HDInsight cluster. It is used to analyze large datasets.
4- MapReduce: This is a HD insight activity that executes MapReduce programs on
windows/linux based HDInsight cluster. It is used for processing and generating large
datasets with a parallel distributed algorithm on a cluster.

5- Hadoop Streaming: This is a HD Insight activity that executes Hadoop streaming


program on windows/linux based HDInsight cluster. It is used to write mappers and
reducers with any executable script in any language like Python, C++ etc.

6- Spark: This is a HD Insight activity that executes Spark program on windows/linux


based HDInsight cluster. It is used for large scale data processing.

7- Stored Procedure: In Data Factory pipeline, you can use execute Stored
procedure activity to invoke a SQL Server Stored procedure. You can use the
following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server
Database, etc.

8- U-SQL: It executes U-SQL script on Azure Data Lake Analytics cluster. It is a big
data query language that provides benefits of SQL.

9- Custom Activity: In custom activity, you can create your own data processing
logic that is not provided by Azure. You can configure .Net activity or R activity that
will run on Azure Batch service or an Azure HDInsight cluster.

10- Databricks Notebook: It runs your databricks notebook on Azure databricks


workspace. It runs on Apache spark.

11- Databricks Python Activity: This activity will run your python files on Azure
Databricks cluster.

12- Databricks Jar Activity: A Spark Jar is launched on your Azure Databricks cluster
when the Azure Databricks Jar Activity is used in a pipeline.

13- Azure Functions: It is Azure Compute service that allows us to write code logic
and use it based on events without installing any infrastructure. A linked service
connection is required to launch an Azure Function. The linked service can then be
used with an activity that specifies the Azure Function you want to run. It stores your
code into Storage and keep the logs in application Insights.Key points of Azure
Functions are :

1- It is a Serverless service.

2- It has Multiple languages available : C#, Java, Javascript, Python and PowerShell

3- It is a Pay as you go Model.


14- ML Studio (classic) activities: Batch Execution and Update Resource: You can
quickly develop pipelines that employ a published Machine Learning Studio
(traditional) web service for predictive analytics with Azure Data Factory and Synapse
Analytics. You can use the Batch Execution Activity in a pipeline to use the Machine
Learning Studio (classic) web service to create predictions on the data in batch using
the Machine Learning Studio (classic) web service.

Control Flow Activities:

1- Append Variable Activity: add a value to an existing array variable.

2- Execute Pipeline Activity: It allows you to call Azure Data Factory pipeline to call
another pipeline.

3- Filter Activity: It allows you to apply different filters on your input dataset.

4- For Each Activity: It provides the functionality of a for each loop that executes for
multiple iterations. ForEach Activity defines a repeating control flow in your pipeline.
This activity is used to iterate over a collection and executes specified activities in a
loop.

5- Get Metadata Activity: used to retrieve metadata of any data in a Data Factory. It
is used to get metadata of files/folders. You need to provide the type of metadata
you require: childItems, columnCount, contentMDS, exists, itemName, itemType,
lastModified, size, structure, created etc.

6- If condition Activity: It provides the same functionality as If statement, it


executes the set of expressions based on if the condition evaluates to true or false. It
evaluates a set of activities when the condition evaluates to true and another set of
activities when the condition evaluates to false.

7- Lookup Activity: It reads and returns the content of multiple data sources such as
files or tables or databases. It could also return the result set of a query or stored
procedures. Lookup Activity can be used to read or look up a record/ table name/
value from any external source. This output can further be referenced by succeeding
activities.

8- Set Variable Activity: It is used to set the value of an existing variable of type
String, Array, etc.

9- Switch Activity: It is a Switch statement that executes the set of activities based
on matching cases.
10- Until Activity: It is same as do until loop. It executes a set of activities until the
condition associated with the activity evaluates to true. You can specify a timeout
value for the until activity.

11- Validation Activity: continues execution if a reference dataset exists, meets a


specified criteria, or a timeout has been reached. It is used to validate the input
dataset.

12- Wait Activity: It just waits for the specified time before moving ahead to the
next activity. You can specify the number of seconds.

13- Web Activity: It is used to make a call to REST APIs. You can use it for different
use cases such as ADF pipeline execution. Used to call a custom REST endpoint from
a pipeline. You can pass datasets and linked services to be consumed and accessed
by the activity.

14- Webhook Activity: Call an endpoint and pass a callback URL using the webhook
activity. Before moving on to the next activity, the pipeline waits for the callback to
be executed.

Copy activity properties for blob storage as a source


Property Description Required
Type The type property under storeSettings must be set Yes
to AzureBlobStorageReadSettings.
Locate the files to copy:
OPTION 1: static path Copy from the given container or folder/file path specified
in the dataset. If you want to copy all blobs from a
container or folder, additionally
specify wildcardFileName as *.
OPTION 2: blob prefix Prefix for the blob name under the given container No
- prefix configured in a dataset to filter source blobs. Blobs whose
names start with container_in_dataset/this_prefix are
selected. It utilizes the service-side filter for Blob storage,
which provides better performance than a wildcard filter.

When you use prefix and choose to copy to file-based sink


with preserving hierarchy, note the sub-path after the last
"/" in prefix will be preserved. For example, you have
source container/folder/subfolder/file.txt, and
configure prefix as folder/sub, then the preserved file path
is subfolder/file.txt.
The folder path with wildcard characters under the given No
OPTION 3: wildcard container configured in a dataset to filter source folders.
- wildcardFolderPath Allowed wildcards are: * (matches zero or more characters)
and ? (matches zero or single character). Use ^ to escape if
your folder name has wildcard or this escape character
inside.
See more examples in Folder and file filter examples.

OPTION 3: wildcard The file name with wildcard characters under the given Yes
- wildcardFileName container and folder path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches zero or more characters)
and ? (matches zero or single character). Use ^ to escape if
your file name has a wildcard or this escape character
inside. See more examples in Folder and file filter
examples.
OPTION 4: a list of files Indicates to copy a given file set. Point to a text file that No
- fileListPath includes a list of files you want to copy, one file per line,
which is the relative path to the path configured in the
dataset.
When you're using this option, do not specify a file name in
the dataset. See more examples in File list examples.
Additional settings:
Recursive Indicates whether the data is read recursively from the No
subfolders or only from the specified folder. Note that
when recursive is set to true and the sink is a file-based
store, an empty folder or subfolder isn't copied or created
at the sink.
Allowed values are true (default) and false.
This property doesn't apply when you
configure fileListPath.
deleteFilesAfterCompletion Indicates whether the binary files will be deleted from No
source store after successfully moving to the destination
store. The file deletion is per file, so when copy activity
fails, you will see some files have already been copied to
the destination and deleted from source, while others are
still remaining on source store.
This property is only valid in binary files copy scenario. The
default value: false.
modifiedDatetimeStart Files are filtered based on the attribute: last modified. No
The files will be selected if their last modified time is
greater than or equal to modifiedDatetimeStart and less
than modifiedDatetimeEnd. The time is applied to a UTC
time zone in the format of "2018-12-01T05:00:00Z".
The properties can be NULL, which means no file attribute
filter will be applied to the dataset.
When modifiedDatetimeStart has a datetime value
but modifiedDatetimeEnd is NULL, the files whose last
modified attribute is greater than or equal to the datetime
value will be selected. When modifiedDatetimeEnd has a
datetime value but modifiedDatetimeStart is NULL, the
files whose last modified attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath.
modifiedDatetimeEnd Same as above. No
enablePartitionDiscovery For files that are partitioned, specify whether to parse the No
partitions from the file path and add them as additional
source columns.
Allowed values are false (default) and true.
partitionRootPath When partition discovery is enabled, specify the absolute No
root path in order to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or list of files on source,
partition root path is the path configured in dataset.
- When you use wildcard folder filter, partition root path is
the sub-path before the first wildcard.
- When you use prefix, partition root path is sub-path
before the last "/".

For example, assuming you configure the path in dataset as


"root/folder/year=2020/month=08/day=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity will generate two
more columns month and day with value "08" and "27"
respectively, in addition to the columns inside the files.
- If partition root path is not specified, no extra column will
be generated.
maxConcurrentConnections The upper limit of concurrent connections established to No
the data store during the activity run. Specify a value only
when you want to limit concurrent connections.
Copy activity
properties for blob
storage as a sink
copyBehavior Defines the copy behavior when the source is files from a No
file-based data store.

Allowed values are:


- PreserveHierarchy (default): Preserves the file hierarchy in
the target folder. The relative path of the source file to the
source folder is identical to the relative path of the target
file to the target folder.
- FlattenHierarchy: All files from the source folder are in the
first level of the target folder. The target files have
autogenerated names.
- MergeFiles: Merges all files from the source folder to one
file. If the file or blob name is specified, the merged file
name is the specified name. Otherwise, it's an
autogenerated file name.
blockSizeInMB Specify the block size, in megabytes, used to write data to No
block blobs. Learn more about Block Blobs.
Allowed value is between 4 MB and 100 MB.
By default, the service automatically determines the block
size based on your source store type and data. For
nonbinary copy into Blob storage, the default block size is
100 MB so it can fit in (at most) 4.95 TB of data. It might be
not optimal when your data is not large, especially when
you use the self-hosted integration runtime with poor
network connections that result in operation timeout or
performance issues. You can explicitly specify a block size,
while ensuring that blockSizeInMB*50000 is big enough to
store the data. Otherwise, the Copy activity run will fail.
Metadata Set custom metadata when copy to sink. Each object under No
the metadata array represents an extra column.
The name defines the metadata key name, and
the value indicates the data value of that key. If preserve
attributes feature is used, the specified metadata will
union/overwrite with the source file metadata.

Allowed data values are:


- $$LASTMODIFIED: a reserved variable indicates to store
the source files' last modified time. Apply to file-based
source with binary format only.
- Expression
- Static value

Wildcard Filters Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with
wildcard filters.

folderPath fileName recursive Source folder structure and filter result (files
in bold are retrieved)

container/Folder* (empty, use false container


default) FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
container/Folder* (empty, use true container
default) FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
container/Folder* *.csv false container
FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
container/Folder* *.csv true container
FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

List of files File list examples


This section describes the resulting behavior of using a file list path in the Copy
activity source.

Assume that you have the following source folder structure and want to copy the
files in bold:

Sample source Content in Configuration


structure FileListToCopy.txt
container File1.csv In dataset:
FolderA Subfolder1/File3.csv - Container: container
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In Copy activity source:
File3.csv - File list path: container/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in the same data
Metadata store that includes a list of files you want to copy, one
FileListToCopy.txt file per line, with the relative path to the path
configured in the dataset.

Some recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different
combinations of recursive and copyBehavior values.

recursive copyBehavior Source folder Resulting target


structure
true preserveHierarchy Folder1 The target folder, Folder1, is created with the same
File1 structure as the source:
File2
Subfolder1 Folder1
File3 File1
File4 File2
File5 Subfolder1
File3
File4
File5
true flattenHierarchy Folder1 The target folder, Folder1, is created with the
File1 following structure:
File2
Subfolder1 Folder1
File3 autogenerated name for File1
File4 autogenerated name for File2
File5 autogenerated name for File3
autogenerated name for File4
autogenerated name for File5
true mergeFiles Folder1 The target folder, Folder1, is created with the
File1 following structure:
File2
Subfolder1 Folder1
File3 File1 + File2 + File3 + File4 + File5 contents are
File4 merged into one file with an autogenerated file
File5 name.
false preserveHierarchy Folder1 The target folder, Folder1, is created with the
File1 following structure:
File2
Subfolder1 Folder1
File3 File1
File4 File2
File5
Subfolder1 with File3, File4, and File5 is not picked
up.
false flattenHierarchy Folder1 The target folder, Folder1, is created with the
File1 following structure:
File2
Subfolder1 Folder1
File3 autogenerated name for File1
File4 autogenerated name for File2
File5
Subfolder1 with File3, File4, and File5 is not picked
up.
false mergeFiles Folder1 The target folder, Folder1, is created with the
File1 following structure:
File2
Subfolder1 Folder1
File3 File1 + File2 contents are merged into one file
File4 with an autogenerated file name. autogenerated
File5 name for File1

Subfolder1 with File3, File4, and File5 is not picked


up.

Data Integration Units


A Data Integration Unit is a measure that represents the power (a combination of
CPU, memory, and network resource allocation) of a single unit within the service.
Data Integration Unit only applies to Azure integration runtime, but not self-hosted
integration runtime.

I have absolutely no idea what 1 DIU actually is, but it doesn’t really matter.
What matters is that the more DIUs you specify, the more power you throw
at the copy data activity. And the more power you throw at the copy data
activity, the more you pay for it.

The allowed DIUs to empower a copy activity run is between 2 and 256. If not
specified or you choose "Auto" on the UI, the service dynamically applies the optimal
DIU setting based on your source-sink pair and data pattern. The following table lists
the supported DIU ranges and default behavior in different copy scenarios:

Copy Supported DIU range Default DIUs determined


scenario by service
Between - Copy from or to single file: 2-4 Between 4 and 32
file stores - Copy from and to multiple files: 2-256 depending on the depending on the number
number and size of the files and size of the files

For example, if you copy data from a folder with 4 large files
and choose to preserve hierarchy, the max effective DIU is
16; when you choose to merge file, the max effective DIU is
4.
From file - Copy from single file: 2-4 - Copy into Azure SQL
store to - Copy from multiple files: 2-256 depending on the number Database or Azure
non-file and size of the files Cosmos DB: between 4
store and 16 depending on the
For example, if you copy data from a folder with 4 large files, sink tier (DTUs/RUs) and
the max effective DIU is 16. source file pattern
- Copy into Azure Synapse
Analytics using PolyBase
or COPY statement: 2
- Other scenario: 4
From non- - Copy from partition-option-enabled data - Copy from REST or HTTP:
file store stores (including Azure Database for PostgreSQL, Azure SQL 1
to file Database, Azure SQL Managed Instance, Azure Synapse - Copy from Amazon
store Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256 Redshift using UNLOAD: 2
when writing to a folder, and 2-4 when writing to one single - Other scenario: 4
file. Note per source data partition can use up to 4 DIUs.
- Other scenarios: 2-4
Between - Copy from partition-option-enabled data - Copy from REST or HTTP:
non-file stores (including Azure Database for PostgreSQL, Azure SQL 1
stores Database, Azure SQL Managed Instance, Azure Synapse - Other scenario: 4
Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256
when writing to a folder, and 2-4 when writing to one single
file. Note per source data partition can use up to 4 DIUs.
- Other scenarios: 2-4

Degree of Copy Parallelism

Copy Parallelism on a single Copy activity just uses more threads to concurrently
copy partitions of data from the same data source.

Maximum this much of connection I want to allow to perform the read from the data
source or to write the data in the sink database.

Data consistency verification


When you move data from source to destination store, the copy activity provides an option
for you to do additional data consistency verification to ensure the data is not only
successfully copied from source to destination store, but also verified to be consistent
between source and destination store. Once inconsistent files have been found during the
data movement, you can either abort the copy activity or continue to copy the rest by
enabling fault tolerance setting to skip inconsistent files. You can get the skipped file names
by enabling session log setting in copy activity.

The verification includes file size check and checksum verification for binary files, and
row count verification for tabular data.

 Data consistency verification is supported by all the connectors except FTP,


SFTP, HTTP, Snowflake, Office 365 and Azure Databricks Delta Lake.
 Data consistency verification is not supported in staging copy scenario.
 When copying binary files, data consistency verification is only available when
'PreserveHierarchy' behavior is set in copy activity.
 When copying multiple binary files in single copy activity with data consistency
verification enabled, you have an option to either abort the copy activity or
continue to copy the rest by enabling fault tolerance setting to skip inconsistent
files.
 When copying a table in single copy activity with data consistency verification
enabled, copy activity fails if the number of rows read from the source is
different from the number of rows copied to the destination plus the number of
incompatible rows that were skipped.

Fault Tolerance
When you copy data from source to destination store, the copy activity provides
certain level of fault tolerances to prevent interruption from failures in the middle of
data movement. For example, you are copying millions of rows from source to
destination store, where a primary key has been created in the destination database,
but source database does not have any primary keys defined. When you happen to
copy duplicated rows from source to the destination, you will hit the PK violation
failure on the destination database. At this moment, copy activity offers you two
ways to handle such errors:

 You can abort the copy activity once any failure is encountered.
 You can continue to copy the rest by enabling fault tolerance to skip the incompatible
data. For example, skip the duplicated row in this case. In addition, you can log the
skipped data by enabling session log within copy activity. You can refer to session log
in copy activity for more details.

Copying binary files

The service supports the following fault tolerance scenarios when copying binary
files. You can choose to abort the copy activity or continue to copy the rest in the
following scenarios:

1. The files to be copied by the service are being deleted by other applications at the
same time.
2. Some particular folders or files do not allow the service access because ACLs of those
files or folders require higher permission level than the configured connection
information.
3. One or more files are not verified to be consistent between source and destination
store if you enable data consistency verification setting.

Supported scenarios

Copy activity supports three scenarios for detecting, skipping, and logging
incompatible tabular data:

 Incompatibility between the source data type and the sink native type.
For example: Copy data from a CSV file in Blob storage to a SQL database with
a schema definition that contains three INT type columns. The CSV file rows
that contain numeric data, such as 123,456,789 are copied successfully to the
sink store. However, the rows that contain non-numeric values, such as 123,456,
abc are detected as incompatible and are skipped.

 Mismatch in the number of columns between the source and the sink.

For example: Copy data from a CSV file in Blob storage to a SQL database with
a schema definition that contains six columns. The CSV file rows that contain six
columns are copied successfully to the sink store. The CSV file rows that contain
more than six columns are detected as incompatible and are skipped.

 Primary key violation when writing to SQL Server/Azure SQL


Database/Azure Cosmos DB.

For example: Copy data from a SQL server to a SQL database. A primary key is
defined in the sink SQL database, but no such primary key is defined in the
source SQL server. The duplicated rows that exist in the source cannot be
copied to the sink. Copy activity copies only the first row of the source data into
the sink. The subsequent source rows that contain the duplicated primary key
value are detected as incompatible and are skipped.

Enable Logging
When selecting this option, you can log copied files, skipped files and rows

 Storage Connection Name


The linked service of Azure Storage or Azure Data Lake Storage Gen2 to store
the log of copy activity jobs. Please note that blob storage accounts with
hierarchical namespaces enabled are not supported in logging settings.

logLevel 1. "Info" will log all the copied files, skipped files and skipped rows.
2. "Warning" will log skipped files and skipped rows only.

Reliable logging :- When it’s true, a Copy activity in reliable mode will flush
logs immediately once each file is copied to the destination. When copying
many files with reliable logging mode enabled in the Copy activity, you should
expect the throughput would be impacted, since double write operations are
required for each file copied. One request goes to the destination store and
another to the log storage store.
Best effort :- A Copy activity in best effort mode will flush logs with batch of
records within a period of time, and the copy throughput will be much less
impacted. The completeness and timeliness of logging isn’t guaranteed in this
mode since there are a few possibilities that the last batch of log events hasn’t
been flushed to the log file when a Copy activity failed. In this scenario, you’ll
see a few files copied to the destination aren’t logged.
Folder path :- The path of the log files. Specify the path that you want to
store the log files. If you don’t provide a path, the service creates a container
for you

Enable staging
Specify whether to copy data via an interim staging store. Enable staging only for the
beneficial scenarios, e.g. load data into Azure Synapse Analytics via PolyBase, load
data to/from Snowflake, load data from Amazon Redshift via UNLOAD or from HDFS
via DistCp, etc.

For eg: You’re copying data from source to sink and you don’t want to perform
directly in middle you want to copy the data into some temporary storage called
staging and from there you want to copy into the final destination in that case you’ll
will need to use two copy activity from source to staging and 2nd staging to sink so
instead of creating 2 copy activity you’ll b enable staging. Once all the data copied
into the the sink database all the data from the staging will be deleted.

 Staging account linked service


Specify the name of an AzureStorage linked service, which refers to the
instance of Storage that you use as an interim staging store. You cannot use
Storage with a shared access signature to load data into Azure Synapse
Analytics or SQL Pool via PolyBase. You can use it in all other scenarios.

Storage path :- Specify the Blob storage path that you want to contain the
staged data. If you do not provide a path, the service creates a container to
store temporary data. Specify a path only if you use Storage with a shared
access signature, or you require temporary data to be in a specific location.

Enable compression :- Specifies whether data should be compressed before


it's copied to the destination. This setting reduces the volume of data being
transferred.

Preserve

Copy activity supports preserving the following attributes during data copy:

 All the customer specified metadata


 And the following five data store built-in system
properties: contentType, contentLanguage (except for Amazon
S3), contentEncoding, contentDisposition, cacheControl.
Add additional columns during copy activity for SQL Server

In addition to copying data from source data store to sink, you can also configure to
add additional data columns to copy along to sink. For example:

 When copy from file-based source, store the relative file path as an additional column
to trace from which file the data comes from.
 Duplicate the specified source column as another column.
 Add a column with ADF expression, to attach ADF system variables like pipeline
name/pipeline ID, or store other dynamic value from upstream activity's output.
 Add a column with static value to meet your downstream consumption need.

You can find the following configuration on copy activity source tab. You can also
map those additional columns in copy activity schema mapping as usual by using
your defined column names.

NOTE:- This feature works with the latest dataset model. If you don't see this option
from the UI, try creating a new dataset.

To configure it programmatically, add the additionalColumns property in your copy


activity source:

Property Description Required


additionalColumns Add additional data columns to copy to sink.

Each object under the additionalColumns array represents an extra


column. The name defines the column name, and
the value indicates the data value of that column.

Allowed data values are:


- $$FILEPATH - a reserved variable indicates to store the source
files' relative path to the folder path specified in dataset. Apply to
file-based source.
- $$COLUMN:<source_column_name> - a reserved variable pattern
indicates to duplicate the specified source column as another
column
- Expression
- Static value
Copy activity for SQL SERVER as a source
Property Description Required
Type The type property of the copy activity source must be set to SqlSource. Yes
sqlReaderQuery Use the custom SQL query to read data. An example is select * from MyTable. No
sqlReaderStoredProcedureName This property is the name of the stored procedure that reads data from the source No
table. The last SQL statement must be a SELECT statement in the stored procedure.
storedProcedureParameters These parameters are for the stored procedure. No
Allowed values are name or value pairs. The names and casing of parameters must
match the names and casing of the stored procedure parameters.
isolationLevel Specifies the transaction locking behavior for the SQL source. The allowed values No
are: ReadCommitted, ReadUncommitted, RepeatableRead, Serializable, Snapshot.
If not specified, the database's default isolation level is used. Refer to this doc for
more details.
partitionOptions Specifies the data partitioning options used to load data from SQL Server. No
Allowed values are: None (default), PhysicalPartitionsOfTable, and DynamicRange.
When a partition option is enabled (that is, not None), the degree of parallelism to
concurrently load data from SQL Server is controlled by the parallelCopies setting on
the copy activity.
1. PhysicalPartitionsOfTable: When using physical partition, ADF will auto
determine the partition column and mechanism based on your physical
table definition.
2. DynamicRange : When using query with parallel enabled, range partition
parameter(?AdfDynamicRangePartitionCondition) is needed. Sample query:
SELECT * FROM <TableName> WHERE
?AdfDynamicRangePartitionCondition
partitionSettings Specify the group of the settings for data partitioning. No
Apply when the partition option isn't None.
Under partitionSettings:
partitionColumnName Specify the name of the source column in integer or date/datetime No
type (int, smallint, bigint, date, smalldatetime, datetime, datetime2, or datetimeoffset)
that will be used by range partitioning for parallel copy. If not specified, the index or
the primary key of the table is auto-detected and used as the partition column.
Apply when the partition option is DynamicRange. If you use a query to retrieve the
source data, hook ?AdfDynamicRangePartitionCondition in the WHERE clause. For an
example, see the Parallel copy from SQL database section.
partitionUpperBound The maximum value of the partition column for partition range splitting. This value is No
used to decide the partition stride, not for filtering the rows in table. All rows in the
table or query result will be partitioned and copied. If not specified, copy activity auto
detect the value.
Apply when the partition option is DynamicRange. For an example, see the Parallel
copy from SQL database section.
partitionLowerBound The minimum value of the partition column for partition range splitting. This value is No
used to decide the partition stride, not for filtering the rows in table. All rows in the
table or query result will be partitioned and copied. If not specified, copy activity auto
detect the value.
Apply when the partition option is DynamicRange. For an example, see the Parallel
copy from SQL database section.

Note the following points:

 If sqlReaderQuery is specified for SqlSource, the copy activity runs this query against
the SQL Server source to get the data. You also can specify a stored procedure by
specifying sqlReaderStoredProcedureName and storedProcedureParameters if the
stored procedure takes parameters.

 When using stored procedure in source to retrieve data, note if your stored procedure
is designed as returning different schema when different parameter value is passed in,
you may encounter failure or see unexpected result when importing schema from UI or
when copying data to SQL database with auto table creation.

Isolation levels
Chaos 16 The pending changes from more highly isolated
transactions cannot be overwritten.
ReadCommitted 4096 Shared locks are held while the data is being read to
avoid dirty reads, but the data can be changed before the
end of the transaction, resulting in non-repeatable reads
or phantom data.
ReadUncommitted 256 A dirty read is possible, meaning that no shared locks are
issued and no exclusive locks are honored.
RepeatableRead 65536 Locks are placed on all data that is used in a query,
preventing other users from updating the data. Prevents
non-repeatable reads but phantom rows are still possible.
Serializable 1048576 A range lock is placed on the DataSet, preventing other
users from updating or inserting rows into the dataset
until the transaction is complete.
Snapshot 16777216 Reduces blocking by storing a version of data that one
application can read while another is modifying the same
data. Indicates that from one transaction you cannot see
changes made in other transactions, even if you requery.
Unspecified -1 A different isolation level than the one specified is being
used, but the level cannot be determined.

Remarks
The IsolationLevel values are used by a .NET data provider when performing a
transaction.

The IsolationLevel remains in effect until explicitly changed, but it can be changed at
any time. The new value is used at execution time, not parse time. If changed during
a transaction, the expected behavior of the server is to apply the new locking level to
all statements remaining.

When using OdbcTransaction, if you do not set OdbcTransaction.IsolationLevel or


you set it to Unspecified, the transaction executes according to the isolation level
determined by the driver in use.

For more Isolation level: https://learn.microsoft.com/en-us/sql/t-


sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-ver16
Copy activity for SQL SERVER as a sink
Property Description
Type The type property of the copy activity sink must be set to SqlSink.
preCopyScript This property specifies a SQL query for the copy activity to run before writing data into SQL Server.
It's invoked only once per copy run. You can use this property to clean up the preloaded data.
tableOption Specifies whether to automatically create the sink table if not exists based on the source schema.
Auto table creation is not supported when sink specifies stored procedure.
Allowed values are: none (default), autoCreate.
sqlWriterStoredProcedureName The name of the stored procedure that defines how to apply source data into a target table.
This stored procedure is invoked per batch. For operations that run only once and have nothing
to do with source data, for example, delete or truncate, use the preCopyScript property.
See example from Invoke a stored procedure from a SQL sink.
storedProcedureTableTypeParameterName The parameter name of the table type specified in the stored procedure.
sqlWriterTableType The table type name to be used in the stored procedure. The copy activity makes the data being
moved available in a temp table with this table type. Stored procedure code can then merge the
data that's being copied with existing data.
storedProcedureParameters Parameters for the stored procedure.
Allowed values are name and value pairs. Names and casing of parameters must match the
names and casing of the stored procedure parameters.
writeBatchSize Number of rows to insert into the SQL table per batch.
Allowed values are integers for the number of rows. By default, the service dynamically
determines the appropriate batch size based on the row size.
writeBatchTimeout This property specifies the wait time for the batch insert operation to complete before it times out.
Allowed values are for the timespan. An example is "00:30:00" for 30 minutes.
If no value is specified, the timeout defaults to "02:00:00".
maxConcurrentConnections The upper limit of concurrent connections established to the data store during the activity run.
Specify a value only when you want to limit concurrent connections.
WriteBehavior Specify the write behavior for copy activity to load data into SQL Server Database.
The allowed value is Insert and Upsert. By default, the service uses insert to load data.
upsertSettings Specify the group of the settings for write behavior.
Apply when the WriteBehavior option is Upert.
Under upsertSettings:
useTempDB Specify whether to use the a global temporary table or physical table as the interim table for upsert.
By default, the service uses global temporary table as the interim table. value is true.

If you write large amount of data into SQL database, uncheck this and specify a schema name
under which Data Factory will create a staging table to load upstream data and auto clean up
upon completion. Make sure the user has create table permission in the database and alter
permission on the schema. If not specified, a global temp table is used as staging.
interimSchemaName Specify the interim schema for creating interim table if physical table is used.
Note: user need to have the permission for creating and deleting table.
By default, interim table will share the same schema as sink table.
Apply when the useTempDB option is False.

Specify a schema name under which Data Factory will create a staging table to load upstream
data and automatically clean them up upon completion.
Make sure you have create table permission in the database and alter permission on the schema.
Keys Specify the column names for unique row identification. Either a single key or a series of keys
can be used. If not specified, the primary key is used.

Choose which column is used to determine if a row from the source matches a row from the sink
Bulk insert table lock Use this to improve copy performance during bulk insert operation on table with no
index from multiple clients.
Delimited text format
Property Description Required
Type The type property of the dataset must be set to DelimitedText. Yes
location Location settings of the file(s). Each file-based connector has its own location type and Yes
supported properties under location.
columnDelimiter The character(s) used to separate columns in a file. No
The default value is comma ,. When the column delimiter is defined as empty string,
which means no delimiter, the whole line is taken as a single column.
Currently, column delimiter as empty string is only supported for mapping data flow but
not Copy activity.
rowDelimiter For Copy activity, the single character or "\r\n" used to separate rows in a file. The default No
value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\r\n". "\r\n" is
only supported in copy command.
For Mapping data flow, the single or two characters used to separate rows in a file. The
default value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\n".
When the row delimiter is set to no delimiter (empty string), the column delimiter must be
set as no delimiter (empty string) as well, which means to treat the entire content as a
single value.
Currently, row delimiter as empty string is only supported for mapping data flow but not
Copy activity.
quoteChar The single character to quote column values if it contains column delimiter. No
The default value is double quotes ".
When quoteChar is defined as empty string, it means there is no quote char and column
value is not quoted, and escapeChar is used to escape the column delimiter and itself.
escapeChar The single character to escape quotes inside a quoted value. No
The default value is backslash \.
When escapeChar is defined as empty string, the quoteChar must be set as empty string
as well, in which case make sure all column values don't contain delimiters.
firstRowAsHeader Specifies whether to treat/make the first row as a header line with names of columns. No
Allowed values are true and false (default).
When first row as header is false, note UI data preview and lookup activity output auto
generate column names as Prop_{n} (starting from 0), copy activity requires explicit
mapping from source to sink and locates columns by ordinal (starting from 1), and
mapping data flow lists and locates columns with name as Column_{n} (starting from 1).
nullValue Specifies the string representation of null value. No
The default value is empty string.
encodingName The encoding type used to read/write test files. No
Allowed values are as follows: "UTF-8","UTF-8 without BOM", "UTF-16", "UTF-16BE", "UTF-
32", "UTF-32BE", "US-ASCII", "UTF-7", "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030",
"JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437",
"IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860",
"IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141",
"IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148",
"IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3",
"ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9",
"ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251",
"WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255",
"WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258".
Note mapping data flow doesn't support UTF-7 encoding.
compressionCodec The compression codec used to read/write text files. No
Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. Default
is not compressed.
Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow
doesn't support "ZipDeflate", "TarGzip" and "Tar".
Note when using copy activity to decompress ZipDeflate/TarGzip/Tar file(s) and write to
file-based sink data store, by default files are extracted to the folder:<path specified in
dataset>/<folder named as source compressed file>/,
use preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder on copy
activity source to control whether to preserve the name of the compressed file(s) as folder
structure.
compressionLevel The compression ratio. No
Allowed values are Optimal or Fastest.
- Fastest: The compression operation should complete as quickly as possible, even if the
resulting file is not optimally compressed.
- Optimal: The compression operation should be optimally compressed, even if the
operation takes a longer time to complete. For more information, see Compression
Level topic.

Linked services
Linked services are much like connection strings, which define the connection
information that's needed for Data Factory to connect to external resources.

A linked service tells ADF how to see the particular data or computers you want to
operate on. To access a specific Azure storage account, you create a linked service for
it and include access credentials. To read/write another storage account, you create
another linked service. To allow ADF to operate on an Azure SQL database, your
linked service will state the Azure subscription, server name, database name, and
credentials.

Datasets
Datasets represent data structures within the data stores, which simply point to or
reference the data you want to use in your activities as inputs or outputs.

A data set makes a linked service more specific; it describes the folder you are using
within a storage container, or the table within a database, etc.

The data set in this screenshot points to one directory in one container in one Azure
storage account. (The container and directory names are set in the Parameters tab.)
Note how the data set references a linked service. Note also that this data set
specifies that the data is zipped, which allows ADF to automatically unzip the data as
you read it.
Source and Sink
A source and a sink are, as their names imply, places data comes from and goes to.
Sources and sinks are built on data sets. ADF is mostly concerned with moving data
from one place to another, often with some kind of transformation along the way, so
it needs to know where to move the data.

It is important to understand that there is mushy distinction between data sets and
sources/sinks. A data set defines a particular collection of data, but a source or sink
can redefine the collection. For example, suppose DataSet1 is defined as the folder
/Vehicles/GM/Trucks/. When a source uses DataSet1, it can take that collection as-is
(the default), or narrow the set to /Vehicles/GM/Trucks/Silverado/ or expand it to
/Vehicles/.

Integration Runtime (IR)


An Integration Runtime (IR) is the compute infrastructure used by Azure Data
Factory to provide data integration capabilities such as Data Flows and Data
Movement. It has access to resources in either public networks, or hybrid scenarios
(public and private networks).

Integration Runtimes are specified in each Linked Service, under Connections.

There are 3 types to choose from.


1. Azure Integration Runtime is managed by Microsoft. All the patching,
scaling and maintenance of the underlying infrastructure is taken care of. The
IR can only access data stores and services in public networks.

2. Self-hosted Integration Runtimes use infrastructure and hardware managed


by you. You’ll need to address all the patching, scaling and maintenance. The
IR can access resources in both public and private networks.
3. Azure-SSIS Integration Runtimes are VMs running the SSIS engine which
allow you to natively execute SSIS packages. They are managed by
Microsoft. As a result, all the patching, scaling and maintenance is taken care
of. The IR can access resources in both public and private networks.

Azure Key Vault

Triggers
Azure Data Factory Triggers determines when the pipeline execution will be fired,
based on the trigger type and criteria defined in that trigger.

Azure Data Factory allows you to assign multiple triggers to execute a single pipeline
or execute multiple pipelines using a single trigger, except for the tumbling window
trigger.

There are three main types of Azure Data Factory Triggers:


1. The Schedule trigger :- that executes the pipeline on a wall-clock schedule,
Where you need to specify the reference time zone that will be used in the
trigger start and end date, when the pipeline will be executed, how frequent it
will be executed and optionally the end date for that pipeline.

Schedule triggers and pipelines have a many-to-many relationship. That


means that one schedule trigger can execute many pipelines, and one
pipeline can be executed by many schedule triggers.

In the New Azure Data Factory Trigger window, provide a meaningful name
for the trigger that reflects the trigger type and usage, the type of the trigger,
which is Schedule here, the start date for the schedule trigger, the time zone
that will be used in the schedule, optionally the end date of the trigger and the
frequency of the trigger, with the ability to configure the trigger frequency to be
called every specific number of minutes or hours, and whether or not
to activate the trigger immediately after you publish it.

Even if you choose a start time in the past, the trigger will only start at the first
future valid execution time after it has been published.

as shown below:

2. Tumbling window trigger :- that executes the pipeline on a periodic interval,


and retains the pipeline state. This is fired at a periodic time interval from a
specified start time, while retaining state. You can imagine it as a series of
fixed-sized, non-overlapping, and contiguous time intervals. It can be used to
execute a single pipeline for each specified time slice or time window.
The Tumbling window trigger fits when working with time-based data, where
each data slice has the same size. It can pass the start time and end time for
each time window into your query, in order to return the data between that
start and end time.

A common use case is when you want to copy data from a database into a
data lake, and store data in separate files or folders for each hour or for each
day. In that case, you define a tumbling window trigger for every 1 hour or for
every 24 hours. The tumbling window trigger can pass the start and end time
for each time window into the database query, which then returns all data
between that start and end time. Finally, the data is saved in separate files or
folders for each hour or each day.

This even works for dates in the past, so you can use it to easily backfill or
load historical data.

Tumbling window triggers and pipelines have a one-to-one relationship,


because of the tight integration between the time windows in the trigger and
how they are used in the pipeline.

When creating the Tumbling window trigger, you need to provide a meaningful
name for that trigger, the trigger type, which is Tumbling window here, the
start date and optionally the end date in UTC, the trigger calling frequency,
with the ability to configure advanced options such as the delay, to wait a
certain time after the window start time before executing the pipeline, limit the
max concurrent tumbling windows running in parallel and retry count and
interval, and define dependencies to ensure that the trigger will start after
another tumbling window trigger completed successfully, as shown below:
3. Event-based trigger :- that responds to a blob related event, such as
creating or deleting a blob file, in an Azure Blob Storage.

Event triggers can execute one or more pipelines when events happen. You
use them when you need to execute a pipeline when something happens,
instead of at specific times.

Event triggers and pipelines have a many-to-many relationship. That means


that one event trigger can execute many pipelines, and one pipeline can be
executed by many event triggers.

When creating an ADF event-based trigger, you will be asked to provide a


meaningful name for the trigger, the type of the trigger, which is event trigger,
the subscription that hosts the Azure Storage Account, the Storage Account
name, the Container name, a filter for the blob files that you are interested in,
whether the trigger will be fired when a blob is created, deleted or both and to
ignore any empty blob file, and Blob path can begin with a folder path and/or
end with a file name or extension as shown below:
Once you select a path, you can confirm that it has been configured correctly
from the data preview page:

After creating the triggers, you can review, edit or enable it from the Triggers
page, where you can see that the three pipelines are disabled and not
connected to any pipeline yet, as shown below:
Make sure to publish the created triggers in order to be able to use it to execute the
ADF pipelines.

Difference between Scheduled trigger and tumbling window trigger

 A schedule trigger can only trigger future dated loads. But tumbling window
triggers can be configured to initiate past and future dated loads.
 Schedule pipelines and triggers have a many-to-many relationship. Whereas a
tumbling window trigger has a one-to-one relationship with a pipeline and can
only reference a single pipeline.
 Tumbling window trigger has a self-dependency property which means the
trigger shouldn't proceed to the next window until the preceding window is
successfully completed.

Adding triggers to pipelines


Once you have created your triggers, open the pipeline that you want to trigger.
From here, you can trigger now or click add trigger, then New/Edit:
This opens the add triggers pane, where you can select the trigger:

In the triggers tab, you can now see that the trigger has a pipeline attached to it,
and you can click to activate it:

Now, let us disconnect the Schedule Trigger from the pipeline, by clicking on the Add Trigger in
the Pipeline page and remove it, then choose to connect the Tumbling Window trigger, and
monitor the pipeline execution, from the Azure Data Factory monitor page, you will see that the
pipeline is executed based on the tumbling window trigger settings, as shown below:
The case is different with the event trigger, where the trigger will not be executed at a specific
time. Instead, it will be executed when a new blob file is added to the Azure Storage account or
deleted from that account, based on the trigger configuration.

In the event trigger that we created previously, we configured it to be fired when a new blob file
is created in the storage account. If we upload a new blob file to our container, as shown below:

You will see that the pipeline will be executed automatically, as the trigger firing criteria has
occurred, as shown below:
Resource
The resource is nothing but an Azure service such as app service, Azure storage,
azure active directory, etc. It means whenever you create a new resource, you are
actually creating the azure service.

You can think of resource as a service in your azure so as soon as you purchase a
service in azure you create a resource. If you’re done with the service and you delete
it the resource will also deleted. Azure uses this resources to save all the
configurations that you made to your service.

All the resources in the azure can represent a json template and it’s a simple file that
has properties and values. There are 4 common properties across all resources like
Type, Apiversion, Name, Location.

Resource Groups
The Azure resource group is the collection of resources, the resource group is the
container in which multiple azure services reside.

Every Azure service must be located in the resource group. The Resource group
gives better flexibility to manage the life cycle of all services at one place, which is
located in the resource group. You can deploy, update and delete these services
together.

You should use Resource group to logically group related resources. Resource
group are created as a utility to manage other resources.

How to group resources to our resource group


Azure Storage
The Azure Storage platform is Microsoft's cloud storage solution for modern data
storage scenarios. Azure Storage offers highly available, massively scalable, durable,
and secure storage for a variety of data objects in the cloud. Azure Storage data
objects are accessible from anywhere in the world over HTTP or HTTPS via a REST
API.
An Azure Storage Account is a secure account, which provides you access to services
in Azure Storage. The storage account is like an administrative container, and within
that, we can have several services like blobs, files, queues, tables, disks, etc.

Azure Storage data services


The Azure Storage platform includes the following data services:
1. Azure Blobs: A massively scalable object store for text and binary data.
Also includes support for big data analytics through Data Lake Storage
Gen2.
Blob storage is designed for:
 Serving images or documents directly to a browser.
 Storing files for distributed access.
 Streaming video and audio.
 Writing to log files.
 Storing data for backup and restore, disaster recovery, and archiving.
 Storing data for analysis by an on-premises or Azure-hosted service.

2. Azure Files: Managed file shares for cloud or on-premises deployments.


Azure file storage mainly can be used if we want to have a shared drive between
two servers or across users.
3. Azure Queues: A messaging store for reliable messaging between
application components.

It is a queue service, but there is a more advanced version of the queue service that
is available in Azure, which is a service bus queue.

o It is a service for storing a large number of messages in the cloud that can be
accessed from anywhere in the world using HTTP and HTTPS.
o A queue contains a set of message. Queue name must be all lowercase.
o A single queue message can be up to 64KB in size. A message can remain in
the queue for a maximum time of 7 days
o URL format is http://<storage account>.queue.core.windows.net/<queue>
o When the message is retrieved from the queue, it stays invisible for 30
seconds. A message needs to be explicitly deleted from the queue to avoid
getting picked up by another application.
4. Azure Tables: A NoSQL store for schemaless storage of structured data.
Azure Table storage is used for storing a large amount of structured data. This service
is a NoSQL data storage, which accepts authenticated calls from inside and outside of
the Azure cloud. It is ideal for storing structured and non-relational data.

Typical uses of Table storage include:

o Table storage is used for storing TBs of structured data capable of serving
web-scale applications.
o It is used for storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access.
o It is used for quickly querying data using a clustered index.
o There are two ways of accessing data, one is using the OData protocol, and
the other is LINQ queries with WCF Data Services with .NET Libraries.

5. Azure Disks: Block-level storage volumes for Azure VMs.

Access Tiers
There are four types of access tiers available:
1. Premium Storage (preview): It provides high-performance hardware for data
that is accessed frequently.
2. Hot storage: It is optimized for storing data that is accessed frequently.
3. Cool Storage: It is optimized for storing data that is infrequently accessed and
stored for at least 30 days.
4. Archive Storage: It is optimized for storing files that are rarely accessed and
stored for a minimum of 180 days with flexible latency needs (on the order of
hours).
Types of performance tiers

1. Standard performance: This tier is backed by magnetic drives and provides


low cost/GB. They are best for applications that are best for bulk storage or
infrequently accessed data. For storing blobs, files, queues, And Azure virtual
machine disks
2. Premium storage performance: This tier is backed by solid-state drives and
offers consistency and low latency performance. They can only be used with
Azure virtual machine disks, and are best for I/O intensive workload such as
the database. (So every virtual machine disk will be stored on a storage
account. So, if we are associating a disk, then we will go for the premium
storage. But if we are using storage account specifically to store blobs, then
we will go for standard performance.)

Data Redundancy
Azure Storage Replication is used for the durability of the data. It copies our data to
stay protected from planned and unplanned events, ranging from transient hardware
failure, network or power outages, and massive natural disasters to man-made
vulnerabilities.

Azure creates some copies of our data and stores it at different places. Based on the
replication strategy.

1. LRS (Local Redundant Storage): So, if we go with the local-redundant


storage, the data will be stored within the data center. If the data center or the
region goes down, the data will be lost.
2. ZRS (Zone-Redundant Storage): The data will be replicated across data
centers but within the region. In that case, the data is always available within
the data center, even if one node is not available. OR we can say that the data
will be available also if the entire data center goes down because the data is
already copied in another data center within the region. However, if the region
itself is gone, then you will not get the data access.
3. GRS (global-redundant storage): To protect our data against region-wide
failures. We can go for global-redundant storage. In this case, the data will be
replicated in the paired region within the geography.
4. (Read Access global-redundant storage). if we want to have read-only
access to the data that is copied to another region, then, in that case, we can
go for this data redundancy We can get different things in terms of durability,
as we can see in this table below.

Cloud Computing Service Models


1. IaaS (Infrastructure as a Service)
Infrastructure as a service (IaaS) is a type of cloud computing service that
offers essential compute, storage, and networking resources on demand, on a
pay-as-you-go basis. IaaS is one of the four types of cloud services, along with
software as a service (SaaS), platform as a service (PaaS), and serverless.

Migrating your organization's infrastructure to an IaaS solution helps you


reduce maintenance of on-premises data centers, save money on hardware
costs, and gain real-time business insights. IaaS solutions give you the
flexibility to scale your IT resources up and down with demand. They also help
you quickly provision new applications and increase the reliability of your
underlying infrastructure.

IaaS lets you bypass the cost and complexity of buying and managing physical
servers and datacenter infrastructure. Each resource is offered as a separate
service component, and you only pay for a particular resource for as long as
you need it. A cloud computing service provider like Azure manages the
infrastructure, while you purchase, install, configure, and manage your own
software—including operating systems, middleware, and applications.

Common IaaS business scenarios


Lift-and-shift migration
This is the fastest and least expensive method of migrating an application or
workload to the cloud. Without refactoring your underlying architecture, you
can increase the scale and performance, enhance the security, and reduce the
costs of running an application or workload.

Test and development


Your team can quickly set up and dismantle test and development
environments, bringing new applications to market faster. IaaS makes it quick
and economical to scale dev/test environments up and down.

Storage, backup, and recovery


Your organization avoids the capital outlay for storage and the complexity of
storage management, which typically requires a skilled staff to manage data
and meet legal and compliance requirements. IaaS is useful for handling
unpredictable demand and steadily growing storage needs. It also can simplify
planning and management of backup and recovery systems.

Web apps
IaaS provides all the infrastructure to support web apps, including storage,
web and application servers, and networking resources. Your organization can
quickly deploy web apps on IaaS and easily scale infrastructure up and down
when demand for the apps is unpredictable.

High-performance computing
High-performance computing on supercomputers, computer grids, or
computer clusters helps solve complex problems involving millions of
variables or calculations. Examples include protein folding and earthquake
simulations, climate and weather predictions, financial modeling, and product
design evaluations.

Advantages of IaaS
Reduces capital expenditures and optimizes costs
IaaS eliminates the cost of configuring and managing a physical datacenter,
which makes it a cost-effective choice for migrating to the cloud. The pay-as-
you-go subscription models used by IaaS providers help you reduce hardware
costs and maintenance and enable your IT team to focus on core business.

Increases scale and performance of IT workloads


IaaS lets you scale globally and accommodate spikes in resource demand.
That way, you can deliver IT resources to employees from anywhere in the
world faster and enhance application performance.

Increases stability, reliability, and supportability


With IaaS, there's no need to maintain and upgrade software and hardware or
troubleshoot equipment problems. With the appropriate agreement in place,
the service provider assures that your infrastructure is reliable and meets
service-level agreements (SLAs).

Improves business continuity and disaster recovery


Achieving high availability, business continuity, and disaster recovery is
expensive because it requires a significant amount of technology and staff. But
with the right SLA in place, IaaS helps to reduce this cost. It also helps you
access applications and data as usual during a disaster or outage.

Enhances security
With the appropriate service agreement, a cloud service provider can offer
better security for your applications and data than the security you would
attain in house.
Helps you innovate and get new apps to users faster
With IaaS, once you've decided to launch a new product or initiative, the
necessary computing infrastructure can be ready in minutes or hours, rather
than in days or weeks. And because you don't need to set up the underlying
infrastructure, IaaS lets you deliver your apps to users faster.

2. PaaS (Platform as a Service)


Platform as a service (PaaS) is a complete development and deployment
environment in the cloud, with resources that enable you to deliver everything
from simple cloud-based apps to sophisticated, cloud-enabled enterprise
applications. You purchase the resources you need from a cloud service
provider on a pay-as-you-go basis and access them over a secure Internet
connection.

Like IaaS, PaaS includes infrastructure—servers, storage, and networking—but


also middleware, development tools, business intelligence (BI) services,
database management systems, and more. PaaS is designed to support the
complete web application lifecycle: building, testing, deploying, managing,
and updating.

PaaS allows you to avoid the expense and complexity of buying and managing
software licenses, the underlying application infrastructure and middleware,
container orchestrators such as Kubernetes, or the development tools and
other resources. You manage the applications and services you develop, and
the cloud service provider typically manages everything else.

Common PaaS scenarios


Development framework. PaaS provides a framework that developers can
build upon to develop or customize cloud-based applications. Similar to the
way you create an Excel macro, PaaS lets developers create applications using
built-in software components. Cloud features such as scalability, high-
availability, and multi-tenant capability are included, reducing the amount of
coding that developers must do.

Analytics or business intelligence. Tools provided as a service with PaaS


allow organizations to analyze and mine their data, finding insights and
patterns and predicting outcomes to improve forecasting, product design
decisions, investment returns, and other business decisions.

Additional services. PaaS providers may offer other services that enhance
applications, such as workflow, directory, security, and scheduling.
Advantages of PaaS
By delivering infrastructure as a service, PaaS offers the same advantages as
IaaS. But its additional features—middleware, development tools, and other
business tools—give you more advantages:

Cut coding time. PaaS development tools can cut the time it takes to code
new apps with pre-coded application components built into the platform,
such as workflow, directory services, security features, search, and so on.

Add development capabilities without adding staff. Platform as a Service


components can give your development team new capabilities without your
needing to add staff having the required skills.

Develop for multiple platforms—including mobile—more easily. Some


service providers give you development options for multiple platforms, such
as computers, mobile devices, and browsers making cross-platform apps
quicker and easier to develop.

Use sophisticated tools affordably. A pay-as-you-go model makes it


possible for individuals or organizations to use sophisticated development
software and business intelligence and analytics tools that they could not
afford to purchase outright.

Support geographically distributed development teams. Because the


development environment is accessed over the Internet, development teams
can work together on projects even when team members are in remote
locations.

Efficiently manage the application lifecycle. PaaS provides all of the


capabilities that you need to support the complete web application lifecycle:
building, testing, deploying, managing, and updating within the same
integrated environment.

3. SaaS (Software as a Service)


Software as a service (SaaS) allows users to connect to and use cloud-based
apps over the Internet. Common examples are email, calendaring, and office
tools (such as Microsoft Office 365).

SaaS provides a complete software solution that you purchase on a pay-as-


you-go basis from a cloud service provider. You rent the use of an app for
your organization, and your users connect to it over the Internet, usually with
a web browser. All of the underlying infrastructure, middleware, app software,
and app data are located in the service provider’s data center. The service
provider manages the hardware and software, and with the appropriate
service agreement, will ensure the availability and the security of the app and
your data as well. SaaS allows your organization to get quickly up and running
with an app at minimal upfront cost.

Common SaaS scenarios


If you’ve used a web-based email service such as Outlook, Hotmail, or Yahoo!
Mail, then you’ve already used a form of SaaS. With these services, you log
into your account over the Internet, often from a web browser. The email
software is located on the service provider’s network, and your messages are
stored there as well. You can access your email and stored messages from a
web browser on any computer or Internet-connected device.

The previous examples are free services for personal use. For organizational
use, you can rent productivity apps, such as email, collaboration, and
calendaring; and sophisticated business applications such as customer
relationship management (CRM), enterprise resource planning (ERP), and
document management. You pay for the use of these apps by subscription or
according to the level of use.

Advantages of SaaS
Gain access to sophisticated applications. To provide SaaS apps to users,
you don’t need to purchase, install, update, or maintain any hardware,
middleware, or software. SaaS makes even sophisticated enterprise
applications, such as ERP and CRM, affordable for organizations that lack the
resources to buy, deploy, and manage the required infrastructure and
software themselves.

Pay only for what you use. You also save money because the SaaS service
automatically scales up and down according to the level of usage.

Use free client software. Users can run most SaaS apps directly from their
web browser without needing to download and install any software, although
some apps require plugins. This means that you don’t need to purchase and
install special software for your users.

Mobilize your workforce easily. SaaS makes it easy to “mobilize” your


workforce because users can access SaaS apps and data from any Internet-
connected computer or mobile device. You don’t need to worry about
developing apps to run on different types of computers and devices because
the service provider has already done so. In addition, you don’t need to bring
special expertise onboard to manage the security issues inherent in mobile
computing. A carefully chosen service provider will ensure the security of your
data, regardless of the type of device consuming it.

Access app data from anywhere. With data stored in the cloud, users can
access their information from any Internet-connected computer or mobile
device. And when app data is stored in the cloud, no data is lost if a user’s
computer or device fails.

You might also like