Azure Data Factory
Azure Data Factory
The data transformation that takes place usually involves various operations, such as filtering,
sorting, aggregating, joining data, cleaning data, deduplicating, and validating data.
Often, the three ETL phases are run in parallel to save time. For example, while data is being
extracted, a transformation process could be working on data already received and prepare
it for loading, and a loading process can begin working on the prepared data, rather than
waiting for the entire extraction process to complete.
Extract, load, and transform (ELT)
Extract, load, and transform (ELT) differs from ETL solely in where the transformation
takes place. In the ELT pipeline, the transformation occurs in the target data store.
Instead of using a separate transformation engine, the processing capabilities of the
target data store are used to transform data. This simplifies the architecture by
removing the transformation engine from the pipeline. Another benefit to this
approach is that scaling the target data store also scales the ELT pipeline
performance. However, ELT only works well when the target system is powerful
enough to transform the data efficiently.
ETL Vs ELT
ETL loads data first into the staging server and then into the target
system, whereas ELT loads data directly into the target system.
ETL model is used for on-premises, relational and structured data,
while ELT is used for scalable cloud structured and unstructured data
sources.
Comparing ELT vs. ETL, ETL is mainly used for a small amount of
data, whereas ELT is used for large amounts of data.
Logical Extraction
The most commonly used data extraction method is Logical Extraction which is further
classified into two categories:
Full Extraction
In this method, data is completely extracted from the source system. The source data will be
provided as is and no additional logical information is necessary on the source system. Since
it is complete extraction, there is no need to track the source system for changes.
For e.g., exporting a complete table in the form of a flat file.
Incremental Extraction
In incremental extraction, the changes in source data need to be tracked since the last
successful extraction. Only these changes in data will be retrieved and loaded. There can be
various ways to detect changes in the source system, maybe from the specific column in the
source system that has the last changed timestamp. You can also create a change table in the
source system, which keeps track of the changes in the source data. It can also be done via
logs if the redo logs are available for the rdbms sources. Another method for tracking changes
is by implementing triggers in the source database.
Physical Extraction
Physical extraction has two methods: Online and Offline extraction:
Online Extraction
In this process, the extraction process directly connects to the source system and extracts the
source data.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the
original source system. You can consider the following common structure in offline
extraction:
Flat file: Is in a generic format
Dump file: Database specific file
Remote extraction from database transaction logs
Update notification
The easiest way to extract data from a source system is to have that system issue a
notification when a record has been changed. Most databases provide an
automation mechanism for this so that they can support database replication
(change data capture or binary logs), and many SaaS applications provide
webhooks, which offer conceptually similar functionality. An important note about
change data capture is that it can provide the ability to analyze data in real time or
near-real time.
Incremental extraction
Some data sources are unable to provide notification that an update has occurred,
but they are able to identify which records have been modified and provide an
extract of those records. During subsequent ETL steps, the data extraction code
needs to identify and propagate changes. One drawback of incremental extraction
technique is that it may not be able to detect deleted records in source data,
because there's no way to see a record that's no longer there.
Full extraction
The first time you replicate any source, you must do a full extraction. Some data
sources have no way to identify data that has been changed, so reloading a whole
table may be the only way to get data from that source. Because full extraction
involves high volumes of data, which can put a load on the network, it’s not the best
option if you can avoid it.
Types of Loading
1. Initial Load: For the very first time loading all the data warehouse tables.
2. Incremental Load: Periodically applying ongoing changes as per the
requirement. After the data is loaded into the data warehouse database,
verify the referential integrity between the dimensions and the fact tables
to ensure that all records belong to the appropriate records in the other
tables. The DBA must verify that each record in the fact table is related to
one record in each dimension table that will be used in combination with
that fact table.
3. Full Refresh: Deleting the contents of a table and reloading it with fresh
data.
Full load: entire data dump that takes place the first time a data source is
loaded into the warehouse. With the overwhelming amount of data being
moved at once, it is much easier for data to get lost within the big move.
Incremental load: This is where you are moving new data in intervals. The
last extract date is stored so that only records added after this date are
loaded. Incremental loads are more likely to encounter problems due to the
nature of having to manage them as individual batches rather than one big
group.
Incremental loads come in two flavors that vary based on the volume of data
you’re loading:
Difficulty Low High. ETL must be checked for new/updated row. Recovery from an
issue is harder
based ETL and data integration service that allows you to create data-driven
Azure Data Factory (ADF) is a data pipeline orchestrator and ETL tool that is part of
the Microsoft Azure cloud ecosystem. ADF can pull data from the outside world (FTP,
Amazon S3, Oracle, and many more), transform it, filter it, enhance it, and move it
Getting ADF to do real work for you involves the following layers of technology, listed
from the highest level of abstraction that you interact with down to the software
are composed of activities and data flow arrows. You program ADF by creating
pipelines. You get work done by running pipelines, either manually or via automatic
triggers. You look at the results of your work by monitoring pipeline execution.
This pipeline takes inbound data from an initial Data Lake folder, moves it to cold
archive storage, gets a list of the files, loops over each file, copies those files to an
A data factory might have one or more pipelines. A pipeline is a logical grouping of
activities that performs a unit of work. Together, the activities in a pipeline perform a
task. For example, a pipeline can contain a group of activities that ingests data from
an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the
data.
The benefit of this is that the pipeline allows you to manage the activities as a set
instead of managing each one individually. The activities in a pipeline can be chained
together to operate sequentially, or they can operate independently in parallel.
Activity
Activities represent a processing step in a pipeline. There is a CopyData activity to
move data, a ForEach activity to loop over a file list, a Filter activity that chooses a
subset of files, etc. Most activities have a source and a sink. Data Factory supports
three types of activities: data movement activities, data transformation activities, and
control activities.
The activity is the task we performed on our data. We use activity inside the Azure
Data Factory pipelines. ADF pipelines are a group of one or more activities.
Types of Acitivites
3- Control Activities
1- Copy Activity: It simply copies the data from Source location to destination
location. Azure supports multiple data store locations such as Azure Storage, Azure
DBs, NoSQL, Files, etc.
1- Data Flow: In data flow, First, you need to design data transformation workflow to
transform or move data. Use the Data Flow activity to transform and move data via
mapping data flows. There are two types of DataFlows: Mapping and Wrangling
DataFlows
WRANGLING DATA FLOW: It provides a platform to use power query in Azure Data
Factory which is available on Ms excel. You can use power query M functions also on
the cloud.
7- Stored Procedure: In Data Factory pipeline, you can use execute Stored
procedure activity to invoke a SQL Server Stored procedure. You can use the
following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server
Database, etc.
8- U-SQL: It executes U-SQL script on Azure Data Lake Analytics cluster. It is a big
data query language that provides benefits of SQL.
9- Custom Activity: In custom activity, you can create your own data processing
logic that is not provided by Azure. You can configure .Net activity or R activity that
will run on Azure Batch service or an Azure HDInsight cluster.
11- Databricks Python Activity: This activity will run your python files on Azure
Databricks cluster.
12- Databricks Jar Activity: A Spark Jar is launched on your Azure Databricks cluster
when the Azure Databricks Jar Activity is used in a pipeline.
13- Azure Functions: It is Azure Compute service that allows us to write code logic
and use it based on events without installing any infrastructure. A linked service
connection is required to launch an Azure Function. The linked service can then be
used with an activity that specifies the Azure Function you want to run. It stores your
code into Storage and keep the logs in application Insights.Key points of Azure
Functions are :
1- It is a Serverless service.
2- It has Multiple languages available : C#, Java, Javascript, Python and PowerShell
2- Execute Pipeline Activity: It allows you to call Azure Data Factory pipeline to call
another pipeline.
3- Filter Activity: It allows you to apply different filters on your input dataset.
4- For Each Activity: It provides the functionality of a for each loop that executes for
multiple iterations. ForEach Activity defines a repeating control flow in your pipeline.
This activity is used to iterate over a collection and executes specified activities in a
loop.
5- Get Metadata Activity: used to retrieve metadata of any data in a Data Factory. It
is used to get metadata of files/folders. You need to provide the type of metadata
you require: childItems, columnCount, contentMDS, exists, itemName, itemType,
lastModified, size, structure, created etc.
7- Lookup Activity: It reads and returns the content of multiple data sources such as
files or tables or databases. It could also return the result set of a query or stored
procedures. Lookup Activity can be used to read or look up a record/ table name/
value from any external source. This output can further be referenced by succeeding
activities.
8- Set Variable Activity: It is used to set the value of an existing variable of type
String, Array, etc.
9- Switch Activity: It is a Switch statement that executes the set of activities based
on matching cases.
10- Until Activity: It is same as do until loop. It executes a set of activities until the
condition associated with the activity evaluates to true. You can specify a timeout
value for the until activity.
12- Wait Activity: It just waits for the specified time before moving ahead to the
next activity. You can specify the number of seconds.
13- Web Activity: It is used to make a call to REST APIs. You can use it for different
use cases such as ADF pipeline execution. Used to call a custom REST endpoint from
a pipeline. You can pass datasets and linked services to be consumed and accessed
by the activity.
14- Webhook Activity: Call an endpoint and pass a callback URL using the webhook
activity. Before moving on to the next activity, the pipeline waits for the callback to
be executed.
OPTION 3: wildcard The file name with wildcard characters under the given Yes
- wildcardFileName container and folder path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches zero or more characters)
and ? (matches zero or single character). Use ^ to escape if
your file name has a wildcard or this escape character
inside. See more examples in Folder and file filter
examples.
OPTION 4: a list of files Indicates to copy a given file set. Point to a text file that No
- fileListPath includes a list of files you want to copy, one file per line,
which is the relative path to the path configured in the
dataset.
When you're using this option, do not specify a file name in
the dataset. See more examples in File list examples.
Additional settings:
Recursive Indicates whether the data is read recursively from the No
subfolders or only from the specified folder. Note that
when recursive is set to true and the sink is a file-based
store, an empty folder or subfolder isn't copied or created
at the sink.
Allowed values are true (default) and false.
This property doesn't apply when you
configure fileListPath.
deleteFilesAfterCompletion Indicates whether the binary files will be deleted from No
source store after successfully moving to the destination
store. The file deletion is per file, so when copy activity
fails, you will see some files have already been copied to
the destination and deleted from source, while others are
still remaining on source store.
This property is only valid in binary files copy scenario. The
default value: false.
modifiedDatetimeStart Files are filtered based on the attribute: last modified. No
The files will be selected if their last modified time is
greater than or equal to modifiedDatetimeStart and less
than modifiedDatetimeEnd. The time is applied to a UTC
time zone in the format of "2018-12-01T05:00:00Z".
The properties can be NULL, which means no file attribute
filter will be applied to the dataset.
When modifiedDatetimeStart has a datetime value
but modifiedDatetimeEnd is NULL, the files whose last
modified attribute is greater than or equal to the datetime
value will be selected. When modifiedDatetimeEnd has a
datetime value but modifiedDatetimeStart is NULL, the
files whose last modified attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath.
modifiedDatetimeEnd Same as above. No
enablePartitionDiscovery For files that are partitioned, specify whether to parse the No
partitions from the file path and add them as additional
source columns.
Allowed values are false (default) and true.
partitionRootPath When partition discovery is enabled, specify the absolute No
root path in order to read partitioned folders as data
columns.
folderPath fileName recursive Source folder structure and filter result (files
in bold are retrieved)
Assume that you have the following source folder structure and want to copy the
files in bold:
I have absolutely no idea what 1 DIU actually is, but it doesn’t really matter.
What matters is that the more DIUs you specify, the more power you throw
at the copy data activity. And the more power you throw at the copy data
activity, the more you pay for it.
The allowed DIUs to empower a copy activity run is between 2 and 256. If not
specified or you choose "Auto" on the UI, the service dynamically applies the optimal
DIU setting based on your source-sink pair and data pattern. The following table lists
the supported DIU ranges and default behavior in different copy scenarios:
For example, if you copy data from a folder with 4 large files
and choose to preserve hierarchy, the max effective DIU is
16; when you choose to merge file, the max effective DIU is
4.
From file - Copy from single file: 2-4 - Copy into Azure SQL
store to - Copy from multiple files: 2-256 depending on the number Database or Azure
non-file and size of the files Cosmos DB: between 4
store and 16 depending on the
For example, if you copy data from a folder with 4 large files, sink tier (DTUs/RUs) and
the max effective DIU is 16. source file pattern
- Copy into Azure Synapse
Analytics using PolyBase
or COPY statement: 2
- Other scenario: 4
From non- - Copy from partition-option-enabled data - Copy from REST or HTTP:
file store stores (including Azure Database for PostgreSQL, Azure SQL 1
to file Database, Azure SQL Managed Instance, Azure Synapse - Copy from Amazon
store Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256 Redshift using UNLOAD: 2
when writing to a folder, and 2-4 when writing to one single - Other scenario: 4
file. Note per source data partition can use up to 4 DIUs.
- Other scenarios: 2-4
Between - Copy from partition-option-enabled data - Copy from REST or HTTP:
non-file stores (including Azure Database for PostgreSQL, Azure SQL 1
stores Database, Azure SQL Managed Instance, Azure Synapse - Other scenario: 4
Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256
when writing to a folder, and 2-4 when writing to one single
file. Note per source data partition can use up to 4 DIUs.
- Other scenarios: 2-4
Copy Parallelism on a single Copy activity just uses more threads to concurrently
copy partitions of data from the same data source.
Maximum this much of connection I want to allow to perform the read from the data
source or to write the data in the sink database.
The verification includes file size check and checksum verification for binary files, and
row count verification for tabular data.
Fault Tolerance
When you copy data from source to destination store, the copy activity provides
certain level of fault tolerances to prevent interruption from failures in the middle of
data movement. For example, you are copying millions of rows from source to
destination store, where a primary key has been created in the destination database,
but source database does not have any primary keys defined. When you happen to
copy duplicated rows from source to the destination, you will hit the PK violation
failure on the destination database. At this moment, copy activity offers you two
ways to handle such errors:
You can abort the copy activity once any failure is encountered.
You can continue to copy the rest by enabling fault tolerance to skip the incompatible
data. For example, skip the duplicated row in this case. In addition, you can log the
skipped data by enabling session log within copy activity. You can refer to session log
in copy activity for more details.
The service supports the following fault tolerance scenarios when copying binary
files. You can choose to abort the copy activity or continue to copy the rest in the
following scenarios:
1. The files to be copied by the service are being deleted by other applications at the
same time.
2. Some particular folders or files do not allow the service access because ACLs of those
files or folders require higher permission level than the configured connection
information.
3. One or more files are not verified to be consistent between source and destination
store if you enable data consistency verification setting.
Supported scenarios
Copy activity supports three scenarios for detecting, skipping, and logging
incompatible tabular data:
Incompatibility between the source data type and the sink native type.
For example: Copy data from a CSV file in Blob storage to a SQL database with
a schema definition that contains three INT type columns. The CSV file rows
that contain numeric data, such as 123,456,789 are copied successfully to the
sink store. However, the rows that contain non-numeric values, such as 123,456,
abc are detected as incompatible and are skipped.
Mismatch in the number of columns between the source and the sink.
For example: Copy data from a CSV file in Blob storage to a SQL database with
a schema definition that contains six columns. The CSV file rows that contain six
columns are copied successfully to the sink store. The CSV file rows that contain
more than six columns are detected as incompatible and are skipped.
For example: Copy data from a SQL server to a SQL database. A primary key is
defined in the sink SQL database, but no such primary key is defined in the
source SQL server. The duplicated rows that exist in the source cannot be
copied to the sink. Copy activity copies only the first row of the source data into
the sink. The subsequent source rows that contain the duplicated primary key
value are detected as incompatible and are skipped.
Enable Logging
When selecting this option, you can log copied files, skipped files and rows
logLevel 1. "Info" will log all the copied files, skipped files and skipped rows.
2. "Warning" will log skipped files and skipped rows only.
Reliable logging :- When it’s true, a Copy activity in reliable mode will flush
logs immediately once each file is copied to the destination. When copying
many files with reliable logging mode enabled in the Copy activity, you should
expect the throughput would be impacted, since double write operations are
required for each file copied. One request goes to the destination store and
another to the log storage store.
Best effort :- A Copy activity in best effort mode will flush logs with batch of
records within a period of time, and the copy throughput will be much less
impacted. The completeness and timeliness of logging isn’t guaranteed in this
mode since there are a few possibilities that the last batch of log events hasn’t
been flushed to the log file when a Copy activity failed. In this scenario, you’ll
see a few files copied to the destination aren’t logged.
Folder path :- The path of the log files. Specify the path that you want to
store the log files. If you don’t provide a path, the service creates a container
for you
Enable staging
Specify whether to copy data via an interim staging store. Enable staging only for the
beneficial scenarios, e.g. load data into Azure Synapse Analytics via PolyBase, load
data to/from Snowflake, load data from Amazon Redshift via UNLOAD or from HDFS
via DistCp, etc.
For eg: You’re copying data from source to sink and you don’t want to perform
directly in middle you want to copy the data into some temporary storage called
staging and from there you want to copy into the final destination in that case you’ll
will need to use two copy activity from source to staging and 2nd staging to sink so
instead of creating 2 copy activity you’ll b enable staging. Once all the data copied
into the the sink database all the data from the staging will be deleted.
Storage path :- Specify the Blob storage path that you want to contain the
staged data. If you do not provide a path, the service creates a container to
store temporary data. Specify a path only if you use Storage with a shared
access signature, or you require temporary data to be in a specific location.
Preserve
Copy activity supports preserving the following attributes during data copy:
In addition to copying data from source data store to sink, you can also configure to
add additional data columns to copy along to sink. For example:
When copy from file-based source, store the relative file path as an additional column
to trace from which file the data comes from.
Duplicate the specified source column as another column.
Add a column with ADF expression, to attach ADF system variables like pipeline
name/pipeline ID, or store other dynamic value from upstream activity's output.
Add a column with static value to meet your downstream consumption need.
You can find the following configuration on copy activity source tab. You can also
map those additional columns in copy activity schema mapping as usual by using
your defined column names.
NOTE:- This feature works with the latest dataset model. If you don't see this option
from the UI, try creating a new dataset.
If sqlReaderQuery is specified for SqlSource, the copy activity runs this query against
the SQL Server source to get the data. You also can specify a stored procedure by
specifying sqlReaderStoredProcedureName and storedProcedureParameters if the
stored procedure takes parameters.
When using stored procedure in source to retrieve data, note if your stored procedure
is designed as returning different schema when different parameter value is passed in,
you may encounter failure or see unexpected result when importing schema from UI or
when copying data to SQL database with auto table creation.
Isolation levels
Chaos 16 The pending changes from more highly isolated
transactions cannot be overwritten.
ReadCommitted 4096 Shared locks are held while the data is being read to
avoid dirty reads, but the data can be changed before the
end of the transaction, resulting in non-repeatable reads
or phantom data.
ReadUncommitted 256 A dirty read is possible, meaning that no shared locks are
issued and no exclusive locks are honored.
RepeatableRead 65536 Locks are placed on all data that is used in a query,
preventing other users from updating the data. Prevents
non-repeatable reads but phantom rows are still possible.
Serializable 1048576 A range lock is placed on the DataSet, preventing other
users from updating or inserting rows into the dataset
until the transaction is complete.
Snapshot 16777216 Reduces blocking by storing a version of data that one
application can read while another is modifying the same
data. Indicates that from one transaction you cannot see
changes made in other transactions, even if you requery.
Unspecified -1 A different isolation level than the one specified is being
used, but the level cannot be determined.
Remarks
The IsolationLevel values are used by a .NET data provider when performing a
transaction.
The IsolationLevel remains in effect until explicitly changed, but it can be changed at
any time. The new value is used at execution time, not parse time. If changed during
a transaction, the expected behavior of the server is to apply the new locking level to
all statements remaining.
If you write large amount of data into SQL database, uncheck this and specify a schema name
under which Data Factory will create a staging table to load upstream data and auto clean up
upon completion. Make sure the user has create table permission in the database and alter
permission on the schema. If not specified, a global temp table is used as staging.
interimSchemaName Specify the interim schema for creating interim table if physical table is used.
Note: user need to have the permission for creating and deleting table.
By default, interim table will share the same schema as sink table.
Apply when the useTempDB option is False.
Specify a schema name under which Data Factory will create a staging table to load upstream
data and automatically clean them up upon completion.
Make sure you have create table permission in the database and alter permission on the schema.
Keys Specify the column names for unique row identification. Either a single key or a series of keys
can be used. If not specified, the primary key is used.
Choose which column is used to determine if a row from the source matches a row from the sink
Bulk insert table lock Use this to improve copy performance during bulk insert operation on table with no
index from multiple clients.
Delimited text format
Property Description Required
Type The type property of the dataset must be set to DelimitedText. Yes
location Location settings of the file(s). Each file-based connector has its own location type and Yes
supported properties under location.
columnDelimiter The character(s) used to separate columns in a file. No
The default value is comma ,. When the column delimiter is defined as empty string,
which means no delimiter, the whole line is taken as a single column.
Currently, column delimiter as empty string is only supported for mapping data flow but
not Copy activity.
rowDelimiter For Copy activity, the single character or "\r\n" used to separate rows in a file. The default No
value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\r\n". "\r\n" is
only supported in copy command.
For Mapping data flow, the single or two characters used to separate rows in a file. The
default value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\n".
When the row delimiter is set to no delimiter (empty string), the column delimiter must be
set as no delimiter (empty string) as well, which means to treat the entire content as a
single value.
Currently, row delimiter as empty string is only supported for mapping data flow but not
Copy activity.
quoteChar The single character to quote column values if it contains column delimiter. No
The default value is double quotes ".
When quoteChar is defined as empty string, it means there is no quote char and column
value is not quoted, and escapeChar is used to escape the column delimiter and itself.
escapeChar The single character to escape quotes inside a quoted value. No
The default value is backslash \.
When escapeChar is defined as empty string, the quoteChar must be set as empty string
as well, in which case make sure all column values don't contain delimiters.
firstRowAsHeader Specifies whether to treat/make the first row as a header line with names of columns. No
Allowed values are true and false (default).
When first row as header is false, note UI data preview and lookup activity output auto
generate column names as Prop_{n} (starting from 0), copy activity requires explicit
mapping from source to sink and locates columns by ordinal (starting from 1), and
mapping data flow lists and locates columns with name as Column_{n} (starting from 1).
nullValue Specifies the string representation of null value. No
The default value is empty string.
encodingName The encoding type used to read/write test files. No
Allowed values are as follows: "UTF-8","UTF-8 without BOM", "UTF-16", "UTF-16BE", "UTF-
32", "UTF-32BE", "US-ASCII", "UTF-7", "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030",
"JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437",
"IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860",
"IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141",
"IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148",
"IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3",
"ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9",
"ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251",
"WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255",
"WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258".
Note mapping data flow doesn't support UTF-7 encoding.
compressionCodec The compression codec used to read/write text files. No
Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. Default
is not compressed.
Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow
doesn't support "ZipDeflate", "TarGzip" and "Tar".
Note when using copy activity to decompress ZipDeflate/TarGzip/Tar file(s) and write to
file-based sink data store, by default files are extracted to the folder:<path specified in
dataset>/<folder named as source compressed file>/,
use preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder on copy
activity source to control whether to preserve the name of the compressed file(s) as folder
structure.
compressionLevel The compression ratio. No
Allowed values are Optimal or Fastest.
- Fastest: The compression operation should complete as quickly as possible, even if the
resulting file is not optimally compressed.
- Optimal: The compression operation should be optimally compressed, even if the
operation takes a longer time to complete. For more information, see Compression
Level topic.
Linked services
Linked services are much like connection strings, which define the connection
information that's needed for Data Factory to connect to external resources.
A linked service tells ADF how to see the particular data or computers you want to
operate on. To access a specific Azure storage account, you create a linked service for
it and include access credentials. To read/write another storage account, you create
another linked service. To allow ADF to operate on an Azure SQL database, your
linked service will state the Azure subscription, server name, database name, and
credentials.
Datasets
Datasets represent data structures within the data stores, which simply point to or
reference the data you want to use in your activities as inputs or outputs.
A data set makes a linked service more specific; it describes the folder you are using
within a storage container, or the table within a database, etc.
The data set in this screenshot points to one directory in one container in one Azure
storage account. (The container and directory names are set in the Parameters tab.)
Note how the data set references a linked service. Note also that this data set
specifies that the data is zipped, which allows ADF to automatically unzip the data as
you read it.
Source and Sink
A source and a sink are, as their names imply, places data comes from and goes to.
Sources and sinks are built on data sets. ADF is mostly concerned with moving data
from one place to another, often with some kind of transformation along the way, so
it needs to know where to move the data.
It is important to understand that there is mushy distinction between data sets and
sources/sinks. A data set defines a particular collection of data, but a source or sink
can redefine the collection. For example, suppose DataSet1 is defined as the folder
/Vehicles/GM/Trucks/. When a source uses DataSet1, it can take that collection as-is
(the default), or narrow the set to /Vehicles/GM/Trucks/Silverado/ or expand it to
/Vehicles/.
Triggers
Azure Data Factory Triggers determines when the pipeline execution will be fired,
based on the trigger type and criteria defined in that trigger.
Azure Data Factory allows you to assign multiple triggers to execute a single pipeline
or execute multiple pipelines using a single trigger, except for the tumbling window
trigger.
In the New Azure Data Factory Trigger window, provide a meaningful name
for the trigger that reflects the trigger type and usage, the type of the trigger,
which is Schedule here, the start date for the schedule trigger, the time zone
that will be used in the schedule, optionally the end date of the trigger and the
frequency of the trigger, with the ability to configure the trigger frequency to be
called every specific number of minutes or hours, and whether or not
to activate the trigger immediately after you publish it.
Even if you choose a start time in the past, the trigger will only start at the first
future valid execution time after it has been published.
as shown below:
A common use case is when you want to copy data from a database into a
data lake, and store data in separate files or folders for each hour or for each
day. In that case, you define a tumbling window trigger for every 1 hour or for
every 24 hours. The tumbling window trigger can pass the start and end time
for each time window into the database query, which then returns all data
between that start and end time. Finally, the data is saved in separate files or
folders for each hour or each day.
This even works for dates in the past, so you can use it to easily backfill or
load historical data.
When creating the Tumbling window trigger, you need to provide a meaningful
name for that trigger, the trigger type, which is Tumbling window here, the
start date and optionally the end date in UTC, the trigger calling frequency,
with the ability to configure advanced options such as the delay, to wait a
certain time after the window start time before executing the pipeline, limit the
max concurrent tumbling windows running in parallel and retry count and
interval, and define dependencies to ensure that the trigger will start after
another tumbling window trigger completed successfully, as shown below:
3. Event-based trigger :- that responds to a blob related event, such as
creating or deleting a blob file, in an Azure Blob Storage.
Event triggers can execute one or more pipelines when events happen. You
use them when you need to execute a pipeline when something happens,
instead of at specific times.
After creating the triggers, you can review, edit or enable it from the Triggers
page, where you can see that the three pipelines are disabled and not
connected to any pipeline yet, as shown below:
Make sure to publish the created triggers in order to be able to use it to execute the
ADF pipelines.
A schedule trigger can only trigger future dated loads. But tumbling window
triggers can be configured to initiate past and future dated loads.
Schedule pipelines and triggers have a many-to-many relationship. Whereas a
tumbling window trigger has a one-to-one relationship with a pipeline and can
only reference a single pipeline.
Tumbling window trigger has a self-dependency property which means the
trigger shouldn't proceed to the next window until the preceding window is
successfully completed.
In the triggers tab, you can now see that the trigger has a pipeline attached to it,
and you can click to activate it:
Now, let us disconnect the Schedule Trigger from the pipeline, by clicking on the Add Trigger in
the Pipeline page and remove it, then choose to connect the Tumbling Window trigger, and
monitor the pipeline execution, from the Azure Data Factory monitor page, you will see that the
pipeline is executed based on the tumbling window trigger settings, as shown below:
The case is different with the event trigger, where the trigger will not be executed at a specific
time. Instead, it will be executed when a new blob file is added to the Azure Storage account or
deleted from that account, based on the trigger configuration.
In the event trigger that we created previously, we configured it to be fired when a new blob file
is created in the storage account. If we upload a new blob file to our container, as shown below:
You will see that the pipeline will be executed automatically, as the trigger firing criteria has
occurred, as shown below:
Resource
The resource is nothing but an Azure service such as app service, Azure storage,
azure active directory, etc. It means whenever you create a new resource, you are
actually creating the azure service.
You can think of resource as a service in your azure so as soon as you purchase a
service in azure you create a resource. If you’re done with the service and you delete
it the resource will also deleted. Azure uses this resources to save all the
configurations that you made to your service.
All the resources in the azure can represent a json template and it’s a simple file that
has properties and values. There are 4 common properties across all resources like
Type, Apiversion, Name, Location.
Resource Groups
The Azure resource group is the collection of resources, the resource group is the
container in which multiple azure services reside.
Every Azure service must be located in the resource group. The Resource group
gives better flexibility to manage the life cycle of all services at one place, which is
located in the resource group. You can deploy, update and delete these services
together.
You should use Resource group to logically group related resources. Resource
group are created as a utility to manage other resources.
It is a queue service, but there is a more advanced version of the queue service that
is available in Azure, which is a service bus queue.
o It is a service for storing a large number of messages in the cloud that can be
accessed from anywhere in the world using HTTP and HTTPS.
o A queue contains a set of message. Queue name must be all lowercase.
o A single queue message can be up to 64KB in size. A message can remain in
the queue for a maximum time of 7 days
o URL format is http://<storage account>.queue.core.windows.net/<queue>
o When the message is retrieved from the queue, it stays invisible for 30
seconds. A message needs to be explicitly deleted from the queue to avoid
getting picked up by another application.
4. Azure Tables: A NoSQL store for schemaless storage of structured data.
Azure Table storage is used for storing a large amount of structured data. This service
is a NoSQL data storage, which accepts authenticated calls from inside and outside of
the Azure cloud. It is ideal for storing structured and non-relational data.
o Table storage is used for storing TBs of structured data capable of serving
web-scale applications.
o It is used for storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access.
o It is used for quickly querying data using a clustered index.
o There are two ways of accessing data, one is using the OData protocol, and
the other is LINQ queries with WCF Data Services with .NET Libraries.
Access Tiers
There are four types of access tiers available:
1. Premium Storage (preview): It provides high-performance hardware for data
that is accessed frequently.
2. Hot storage: It is optimized for storing data that is accessed frequently.
3. Cool Storage: It is optimized for storing data that is infrequently accessed and
stored for at least 30 days.
4. Archive Storage: It is optimized for storing files that are rarely accessed and
stored for a minimum of 180 days with flexible latency needs (on the order of
hours).
Types of performance tiers
Data Redundancy
Azure Storage Replication is used for the durability of the data. It copies our data to
stay protected from planned and unplanned events, ranging from transient hardware
failure, network or power outages, and massive natural disasters to man-made
vulnerabilities.
Azure creates some copies of our data and stores it at different places. Based on the
replication strategy.
IaaS lets you bypass the cost and complexity of buying and managing physical
servers and datacenter infrastructure. Each resource is offered as a separate
service component, and you only pay for a particular resource for as long as
you need it. A cloud computing service provider like Azure manages the
infrastructure, while you purchase, install, configure, and manage your own
software—including operating systems, middleware, and applications.
Web apps
IaaS provides all the infrastructure to support web apps, including storage,
web and application servers, and networking resources. Your organization can
quickly deploy web apps on IaaS and easily scale infrastructure up and down
when demand for the apps is unpredictable.
High-performance computing
High-performance computing on supercomputers, computer grids, or
computer clusters helps solve complex problems involving millions of
variables or calculations. Examples include protein folding and earthquake
simulations, climate and weather predictions, financial modeling, and product
design evaluations.
Advantages of IaaS
Reduces capital expenditures and optimizes costs
IaaS eliminates the cost of configuring and managing a physical datacenter,
which makes it a cost-effective choice for migrating to the cloud. The pay-as-
you-go subscription models used by IaaS providers help you reduce hardware
costs and maintenance and enable your IT team to focus on core business.
Enhances security
With the appropriate service agreement, a cloud service provider can offer
better security for your applications and data than the security you would
attain in house.
Helps you innovate and get new apps to users faster
With IaaS, once you've decided to launch a new product or initiative, the
necessary computing infrastructure can be ready in minutes or hours, rather
than in days or weeks. And because you don't need to set up the underlying
infrastructure, IaaS lets you deliver your apps to users faster.
PaaS allows you to avoid the expense and complexity of buying and managing
software licenses, the underlying application infrastructure and middleware,
container orchestrators such as Kubernetes, or the development tools and
other resources. You manage the applications and services you develop, and
the cloud service provider typically manages everything else.
Additional services. PaaS providers may offer other services that enhance
applications, such as workflow, directory, security, and scheduling.
Advantages of PaaS
By delivering infrastructure as a service, PaaS offers the same advantages as
IaaS. But its additional features—middleware, development tools, and other
business tools—give you more advantages:
Cut coding time. PaaS development tools can cut the time it takes to code
new apps with pre-coded application components built into the platform,
such as workflow, directory services, security features, search, and so on.
The previous examples are free services for personal use. For organizational
use, you can rent productivity apps, such as email, collaboration, and
calendaring; and sophisticated business applications such as customer
relationship management (CRM), enterprise resource planning (ERP), and
document management. You pay for the use of these apps by subscription or
according to the level of use.
Advantages of SaaS
Gain access to sophisticated applications. To provide SaaS apps to users,
you don’t need to purchase, install, update, or maintain any hardware,
middleware, or software. SaaS makes even sophisticated enterprise
applications, such as ERP and CRM, affordable for organizations that lack the
resources to buy, deploy, and manage the required infrastructure and
software themselves.
Pay only for what you use. You also save money because the SaaS service
automatically scales up and down according to the level of usage.
Use free client software. Users can run most SaaS apps directly from their
web browser without needing to download and install any software, although
some apps require plugins. This means that you don’t need to purchase and
install special software for your users.
Access app data from anywhere. With data stored in the cloud, users can
access their information from any Internet-connected computer or mobile
device. And when app data is stored in the cloud, no data is lost if a user’s
computer or device fails.