0% found this document useful (0 votes)
22 views

Azure Data Factory

The document provides comprehensive documentation for Azure Data Factory, detailing its features, functionalities, and usage for data integration and transformation workflows. It includes sections on quickstarts, tutorials, concepts, how-to guides, scenarios, and reference materials, covering various data sources and activities. Azure Data Factory is positioned as a cloud-based service for orchestrating data movement and transformation, enabling businesses to leverage both on-premises and cloud data effectively.

Uploaded by

maituantu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Azure Data Factory

The document provides comprehensive documentation for Azure Data Factory, detailing its features, functionalities, and usage for data integration and transformation workflows. It includes sections on quickstarts, tutorials, concepts, how-to guides, scenarios, and reference materials, covering various data sources and activities. Azure Data Factory is positioned as a cloud-based service for orchestrating data movement and transformation, enabling businesses to leverage both on-premises and cloud data effectively.

Uploaded by

maituantu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3167

Contents

Data Factory Documentation


Switch to version 1 documentation
Overview
Introduction to Data Factory
What's New in Azure Data Factory
Compare current version to version 1
Quickstarts
Create data factory - User interface (UI)
Create data factory - Copy data tool
Create data factory - Azure CLI
Create data factory - Azure PowerShell
Create data factory - .NET
Create data factory - Python
Create data factory - REST
Create data factory - ARM template
Create data flow
Tutorials
List of tutorials
Copy and ingest data
From Azure Blob Storage to Azure SQL Database
Copy data tool
User interface (UI)
.NET
From a SQL Server database to Azure Blob Storage
Copy data tool
User interface (UI)
Azure PowerShell
From Amazon Web Services S3 to Azure Data Lake Storage
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen1
From Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2
From Azure SQL Database to Azure Synapse Analytics
From SAP BW to Azure Data Lake Storage Gen2
From Office 365 to Azure Blob storage
Multiple tables in bulk
User interface (UI)
Azure PowerShell
Incrementally load data
From one Azure SQL Database table
User interface (UI)
Azure PowerShell
From multiple SQL Server database tables
User interface (UI)
Azure PowerShell
Using change tracking information in SQL Server
User interface (UI)
Azure PowerShell
Using CDC in Azure SQL MI
User interface (UI)
New files by last modified data
New files by time partitioned file name
Build a copy pipeline using managed VNet and private endpoints
Transform data
Transform data with mapping data flows
Best practices for landing data in the lake with ADLS Gen2
Dynamically set column names
Transform data in the lake with Delta Lake
Transform data with mapping data flows
Mapping data flow video tutorials
Prepare data with wrangling
Using external services
HDInsight Spark
User interface (UI)
Azure PowerShell
Databricks Notebook
User interface (UI)
Hive transformation in virtual network
User interface (UI)
Azure PowerShell
Build mapping dataflow pipeline using managed VNet and private endpoints
Control Flow
User interface (UI)
.NET
Run SSIS packages in Azure
User interface (UI)
Azure PowerShell
Join virtual network
Lineage
Push Data Factory lineage data to Azure Purview
End-to-end labs
Data integration using data factory and data share
Managed virtual network
Access on premises SQL Server
Access SQL Managed Instance
Samples
Code samples
Azure PowerShell
Concepts
Pipelines and activities
Linked services
Datasets
Pipeline execution and triggers
Integration runtime
Data flows
Transform data with mapping data flows
Mapping data flow overview
Debug mode
Schema drift
Column patterns
Data flow monitoring
Data flow performance
Manage data flow canvas
Expression builder
Expression language
Prepare data with Power Query data wrangling
Data wrangling overview
Supported functions
Roles and permissions
Naming rules
Data redundancy
How-to guides
Author
Visually author data factories
Iterative development and debugging
Management hub
Source control
Continuous integration and delivery
Automated publishing for CI/CD
Connectors
Connector overview
Amazon Marketplace Web Service
Amazon Redshift
Amazon S3
Amazon S3 Compatible Storage
Avro format
Azure Blob Storage
Azure Cognitive Search
Azure Cosmos DB SQL API
Azure Cosmos DB's API for MongoDB
Azure Data Explorer
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure Database for MariaDB
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Databricks Delta Lake
Azure File Storage
Azure SQL Database
Azure SQL Managed Instance
Azure Synapse Analytics
Azure Table Storage
Binary format
Cassandra
Common Data Model format
Concur
Couchbase
DB2
Dataverse
Delimited text format
Delta format
Drill
Dynamics 365
Dynamics AX
Dynamics CRM
Excel format
File System
FTP
GitHub
Google AdWords
Google BigQuery
Google Cloud Storage
Greenplum
HBase
HDFS
Hive
HTTP
HubSpot
Impala
Informix
Jira
JSON format
Magento
MariaDB
Marketo
Microsoft Access
MongoDB
MongoDB (legacy)
MongoDB Atlas
MySQL
Netezza
OData
ODBC
Office 365
Oracle
Oracle Cloud Storage
Oracle Eloqua
Oracle Responsys
Oracle Service Cloud
ORC format
Parquet format
PayPal
Phoenix
PostgreSQL
Presto
QuickBooks Online
REST
Salesforce
Salesforce Service Cloud
Salesforce Marketing Cloud
SAP Business Warehouse Open Hub
Load SAP BW data
SAP Business Warehouse MDX
SAP Cloud for Customer
SAP ECC
SAP HANA
SAP Table
ServiceNow
SFTP
SharePoint Online List
Shopify
Snowflake
Spark
SQL Server
Square
Sybase
Teradata
Vertica
Web Table
Xero
XML format
Zoho
Move data
Copy data using copy activity
Monitor copy activity
Delete files using Delete activity
Copy data tool
Metadata driven copy data
Format and compression support
Copy activity performance
Performance and scalability guide
Troubleshoot performance
Performance features
Preserve metadata and ACLs
Schema and type mapping
Fault tolerance
Data consistency verification
Copy activity log
Format and compression support (legacy)
Transform data
Execute Data Flow activity
Execute Power Query activity
Azure Function activity
Custom activity
Databricks Jar activity
Databricks Notebook activity
Databricks Python activity
Data Explorer Command activity
Data Lake U-SQL activity
HDInsight Hive activity
HDInsight MapReduce activity
HDInsight Pig activity
HDInsight Spark activity
HDInsight Streaming activity
Machine Learning Execute Pipeline activity
Machine Learning Studio (classic) Batch Execution activity
Machine Learning Studio (classic) Update Resource activity
Stored Procedure activity
Compute linked services
Control flow
Append Variable activity
Execute Pipeline activity
Filter activity
For Each activity
Get Metadata activity
If Condition activity
Lookup activity
Set Variable activity
Switch activity
Until activity
Validation activity
Wait activity
Web activity
Webhook activity
Data flow transformations
Transformation overview
Aggregate
Alter row
Conditional split
Derived column
Exists
Filter
Flatten
Join
Lookup
New branch
Parse
Pivot
Rank
Select
Sink
Sort
Source
Surrogate key
Union
Unpivot
Window
Parameterize
Parameterizing linked services
Global parameters
Expression Language
System variables
Parameterizing mapping data flows
How to parameterize
Security
Data movement security considerations
Data access strategies
Azure integration runtime IP addresses
Store credentials in Azure Key Vault
Use Azure Key Vault secrets in pipeline activities
Encrypt credentials for self-hosted integration runtime
Managed identity for Data Factory
Encrypt data factory with customer managed key
Managed virtual network
Azure private link for Data Factory
Azure security baseline
Monitor and manage
Monitor visually
Monitor with Azure Monitor
Monitor SSIS with Azure Monitor
Monitor with SDKs
Monitor integration runtime
Monitor Azure-SSIS integration runtime
Reconfigure Azure-SSIS integration runtime
Copy or clone a data factory
Create integration runtime
Azure integration runtime
Self-hosted integration runtime
Create and configure a self-hosted integration runtime
Self-hosted integration runtime auto-update and expire notification
Shared self-hosted integration runtime
Automation scripts of self-hosted integration runtime
Run Self-Hosted Integration Runtime in Windows container
Azure-SSIS integration runtime
Run SSIS packages in Azure
Run SSIS packages in Azure from SSDT
Run SSIS packages with Azure SQL Managed Instance Agent
Run SSIS packages with Azure-enabled dtexec
Run SSIS packages with Execute SSIS Package activity
Run SSIS packages with Stored Procedure activity
Schedule Azure-SSIS integration runtime
Join Azure-SSIS IR to a virtual network
Configure Self-Hosted IR as a proxy for Azure-SSIS IR
Enable Azure AD authentication for Azure-SSIS IR
Connect to data with Windows Authentication
Save files and connect to file shares
Provision Enterprise Edition for Azure-SSIS IR
Built-in and preinstalled components on Azure-SSIS IR
Customize setup for Azure-SSIS IR
Install licensed components for Azure-SSIS IR
Configure high performance for Azure-SSIS IR
Configure disaster recovery for Azure-SSIS IR
Clean up SSISDB logs automatically
Use Azure SQL Managed Instance with Azure-SSIS IR
Migrate SSIS jobs with SSMS
Manage packages with Azure-SSIS IR package store
Create triggers
Create a schedule trigger
Create a tumbling window trigger
Create a tumbling window trigger dependency
Create a storage event trigger
Create a custom event trigger
Reference trigger metadata in pipeline
Data Catalog and Governance
Connect a Data Factory to Azure Purview
Discover and explore data in ADF using Purview
Scenarios
Data migration for data lake & EDW
Why Azure Data Factory
Migrate data from AWS S3 to Azure
Migrate data from on-premises Hadoop cluster to Azure
Migrate data from on-premises Netezza server to Azure
Azure Machine Learning
Data ingestion
Transformation using mapping data flow
Process fixed-width text files
Error row handling
Azure SQL DB to Azure Cosmos DB
Dedupe and null check with snippets
Process data from aml models using data flow
SSIS migration from on-premises
SSIS migration overview
SSISDB migration to Azure SQL Managed Instance
Templates
Overview of templates
Copy files from multiple containers
Copy new files by LastModifiedDate
Bulk copy from database
Bulk copy from files to database
Delta copy from database
Migrate data from Amazon S3 to Azure Storage
Move files
Transformation with Azure Databricks
Understanding pricing
Data flow reserved capacity overview
Data flow understand reservation charges
Plan and manage costs
Pricing examples
Troubleshooting guides
Data Factory UX
Activities
Connectors
Pipeline Triggers
Data Flows
Data flows overview
Connector and format
Continuous Integration and Deployment
Security and access control
Self-hosted Integration Runtimes
Azure-SSIS Integration Runtime
Package Execution in Azure-SSIS IR
Diagnose connectivity in Azure-SSIS IR
Reference
Data flow script
.NET
PowerShell
REST API
Resource Manager template
Python
Azure Policy built-ins
Azure CLI
Resources
Whitepapers
FAQ
Service updates
Blog
Ask a question - Microsoft Q&A question page
Ask a question - Stack Overflow
Request a feature
Pricing
Availability by region
Support options
Limits
Introduction to Azure Data Factory
5/4/2021 • 10 minutes to read • Edit Online

NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.

What is Azure Data Factory?


In the world of big data, how is existing data leveraged in business? Is it possible to enrich data that's generated
in the cloud by using reference data from on-premises data sources or other disparate data sources?
For example, a gaming company collects logs that are produced by games in the cloud. It wants to analyze these
logs to gain insights into customer preferences, demographics, usage behavior, and so on. The company also
wants to identify up-sell and cross-sell opportunities, develop compelling new features to drive business
growth, and provide a better experience to customers.
To analyze these logs, the company needs to use the reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. Therefore, the company
wants to ingest log data from the cloud data store and reference data from the on-premises data store.
Next they want to process the data by using Hadoop in the cloud (Azure HDInsight). They want to publish the
result data into a cloud data warehouse such as Azure Synapse Analytics or an on-premises data store such as
SQL Server. The company wants this workflow to run once a week.
The company needs a platform where they can create a workflow that can ingest data from both on-premises
and cloud data stores. The company also needs to be able to transform or process data by using existing
compute services such as Hadoop, and publish the results to an on-premises or cloud data store for BI
applications to consume.

Azure Data Factory is the platform for these kinds of scenarios. It is a cloud-based data integration service that
allows you to create data-driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure Synapse Analytics for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL) and Transform-and-Load (TL) platform rather than a traditional Extract-
Transform-and-Load (ETL) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.

How does it work?


The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps:

Connect and collect


Enterprises have data of various types that are located in disparate sources. The first step in building an
information production system is to connect to all the required sources of data and processing. These sources
include SaaS services, file shares, FTP, and web services. Then move the data as-needed to a centralized location
for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems.
These systems also often lack the enterprise grade monitoring, alerting, and controls that a fully managed
service can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis.
For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data
Lake Analytics compute service. Or, collect data in Azure blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transfer it by using compute services
such as HDInsight Hadoop, Spark, Data Lake Analytics, or Machine Learning. You want to reliably produce
transformed data on a maintainable and controlled schedule to feed production environments with trusted data.
Publish
Deliver transformed data from the cloud to on-premises sources such as SQL Server. Alternatively, keep it in
your cloud storage sources for consumption by BI and analytics tools and other applications.

Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you
can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a
Hive query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to
manage the activities as a set instead of each one individually. For example, you can deploy and schedule the
pipeline, instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example,
you can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive
activity. A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data
Factory supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory
supports the following data stores:

C AT EGO RY DATA STO RE SUP P O RT ED A S A SO URC E SUP P O RT ED A S A SIN K

Azure Azure Blob storage ✓ ✓

Azure Cosmos DB (SQL API) ✓ ✓

Azure Data Lake Storage ✓ ✓


Gen1

Azure SQL Database ✓ ✓

Azure Synapse Analytics ✓ ✓

Azure Cognitive Search ✓


Index

Azure Table storage ✓ ✓

Databases Amazon Redshift ✓

DB2* ✓

MySQL* ✓

Oracle* ✓ ✓

PostgreSQL* ✓

SAP Business Warehouse* ✓

SAP HANA* ✓

SQL Server* ✓ ✓

Sybase* ✓

Teradata* ✓

NoSQL Cassandra* ✓
C AT EGO RY DATA STO RE SUP P O RT ED A S A SO URC E SUP P O RT ED A S A SIN K

MongoDB* ✓

File Amazon S3 ✓

File System* ✓ ✓

FTP ✓

HDFS* ✓

SFTP ✓

Others Generic HTTP ✓

Generic OData ✓

Generic ODBC* ✓

Salesforce ✓

Web Table (table from ✓


HTML)

For more information, see Move data by using Copy Activity.


Data transformation activities
Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.

DATA T RA N SF O RM AT IO N A C T IVIT Y C O M P UT E EN VIRO N M EN T

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Azure Machine Learning Studio (classic) activities: Batch Azure VM


Execution and Update Resource

Stored Procedure Azure SQL, Azure Synapse Analytics, or SQL Server

Data Lake Analytics U-SQL Azure Data Lake Analytics

DotNet HDInsight [Hadoop] or Azure Batch


NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from
Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
See Run R Script using Azure Data Factory.

For more information, see Move data by using Copy Activity.


Custom .NET activities
Create a custom .NET activity if you need to move data to or from a data store that Copy Activity doesn't support
or if you need to transform data by using your own logic. For details about how to create and use a custom
activity, see Use custom activities in an Azure Data Factory pipeline.
Datasets
An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data
structures within the data stores. These structures point to or reference the data you want to use in your
activities (such as inputs or outputs).
For example, an Azure blob dataset specifies the blob container and folder in the Azure blob storage from which
the pipeline should read the data. Or an Azure SQL table dataset specifies the table to which the output data is
written by the activity.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the
data source and a dataset represents the structure of the data.
For example, an Azure Storage-linked service specifies a connection string with which to connect to the Azure
Storage account. An Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two reasons in Data Factory:
To represent a data store that includes, but isn't limited to, a SQL Server database, Oracle database, file
share, or Azure blob storage account. See the Data movement activities section for a list of supported
data stores.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. See the Data transformation activities
section for a list of supported compute environments.
Relationship between Data Factory entities

Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in
other regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall,
then a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance
in North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a
few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the
job on your computing environment does not change.

Get started with creating a pipeline


You can use one of these tools or APIs to create data pipelines in Azure Data Factory:
Visual Studio
PowerShell
.NET API
REST API
Azure Resource Manager template
To learn how to build data factories with data pipelines, follow the step-by-step instructions in the following
tutorials:

T UTO RIA L DESC RIP T IO N

Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to SQL Database.

Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.

Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from a
data store by using Data Management Gateway SQL Server database to an Azure blob. As part of the
walkthrough, you install and configure the Data
Management Gateway on your machine.
What is Azure Data Factory?
7/16/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage
systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful
insights to analysts, data scientists, or business decision makers.
Big data requires a service that can orchestrate and operationalize processes to refine these enormous stores of
raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these
complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in
the cloud. The company wants to analyze these logs to gain insights into customer preferences, demographics,
and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop compelling new
features, drive business growth, and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer information, game
information, and marketing campaign information that is in an on-premises data store. The company wants to
utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data
store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
and publish the transformed data into a cloud data warehouse such as Azure Synapse Analytics to easily build a
report on top of it. They want to automate this workflow, and monitor and manage it on a daily schedule. They
also want to execute it when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration
service that allows you to create data-driven workflows for orchestrating data movement and transforming data
at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can
ingest data from disparate data stores. You can build complex ETL processes that transform data visually with
data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL
Database.
Additionally, you can publish your transformed data to data stores such as Azure Synapse Analytics for business
intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into
meaningful data stores and data lakes for better business decisions.
How does it work?
Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data
engineers.
This visual guide provides a high-level overview of the Data Factory architecture:

To see more detail, click the preceding image to zoom in, or browse to the high resolution image.
Connect and collect
Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured,
unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources of data and
processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next
step is to move the data as needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It's expensive and hard to integrate and maintain such systems. In
addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service
can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can
collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics
compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transform the collected data by using
ADF mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs
that execute on Spark without needing to understand Spark clusters or Spark programming.
If you prefer to code transformations by hand, ADF supports external activities for executing your
transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine
Learning.
CI/CD and publish
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows
you to incrementally develop and deliver your ETL processes before publishing the finished product. After the
raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse,
Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from
their business intelligence tools.
Monitor
After you have successfully built and deployed your data integration pipeline, providing business value from
refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has
built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health
panels on the Azure portal.

Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of below key components.
Pipelines
Activities
Datasets
Linked services
Data Flows
Integration Runtimes
These components work together to provide the platform on which you can compose data-driven workflows
with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of
activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition
the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one
individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate
independently in parallel.
Mapping data flows
Create and manage graphs of data transformation logic that you can use to transform any-sized data. You can
build-up a reusable library of data transformation routines and execute those processes in a scaled-out manner
from your ADF pipelines. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down
when you need it. You won't ever have to manage or maintain clusters.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data from
one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on an
Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities: data
movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you want
to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the
data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked service
specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset
specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, a SQL Server database, Oracle database, file
share, or Azure blob storage account. For a list of supported data stores, see the copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Integration Runtime
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a
compute service. An integration runtime provides the bridge between the activity and linked Services. It's
referenced by the linked service or activity, and provides the compute environment where the activity either
runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the
target data store or compute service in the most performant way while meeting security and compliance needs.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There
are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the
trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
Variables
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with
parameters to enable passing values between pipelines, data flows, and other activities.

Next steps
Here are important next step documents to explore:
Dataset and linked services
Pipelines and activities
Integration runtime
Mapping Data Flows
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
What's New in Azure Data Factory
7/15/2021 • 2 minutes to read • Edit Online

Azure Data Factory receives improvements on an ongoing basis. To stay up to date with the most recent
developments, this article provides you with information about:
The latest releases
Known issues
Bug fixes
Deprecated functionality
Plans for changes
This page will be updated monthly, so revisit it regularly.

June 2021

Ser vice Categor y Ser vice improvements Details

Data Movement New user experience with Azure Data Redesigned Copy Data Tool is now
Factory Copy Data Tool available with improved data ingestion
experience.
Learn more

MongoDB and MongoDB Atlas are This improvement supports copying


Supported as both Source and Sink data between any supported data
store and MongoDB or MongoDB
Atlas database.
Learn more

Always Encrypted is supported for Always Encrypted is available in Azure


Azure SQL Database, Azure SQL Data Factory for Azure SQL Database,
Managed Instance, and SQL Server Azure SQL Managed Instance, and
connectors as both source and sink SQL Server connectors for copy
activity.
Learn more

Setting custom metadata is supported When writing to ADLS Gen2 or Azure


in copy activity when sinking to ADLS Blob, copy activity supports setting
Gen2 or Azure Blob custom metadata or storage of the
source file's last modified info as
metadata.
Learn more

Data Flow SQL Server is now supported as a SQL Server is now supported as a
source and sink in data flows source and sink in data flows. Follow
the link for instructions on how to
configure your networking using the
Azure Integration Runtime managed
VNET feature to talk to your SQL
Server on-premise and cloud VM-
based instances.
Learn more
Dataflow Cluster quick reuse is now ADF is happy to announce the general
enabled by default for all new Azure availability of the popular data flow
Integration Runtimes quick start-up reuse feature. All new
Azure Integration Runtimes will now
have quick reuse enabled by default.
Learn more

Power Query activity in ADF public You can now build complex field
preview mappings to your Power Query sink
using Azure Data Factory data
wrangling. The sink is now configured
in the pipeline in the Power Query
(Preview) activity to accommodate this
update.
Learn more

Updated data flows monitoring UI in Azure Data Factory has a new update
Azure Data Factory for the monitoring UI to make it easier
to view your data flow ETL job
executions and quickly identify areas
for performance tuning.
Learn more

SQL Ser ver Integration Ser vices Run any SQL anywhere in 3 simple This post provides 3 simple steps to
(SSIS) steps with SSIS in Azure Data Factory run any SQL statements/scripts
anywhere with SSIS in Azure Data
Factory.
1. Prepare your Self-Hosted
Integration Runtime/SSIS
Integration Runtime.
2. Prepare an Execute SSIS
Package activity in Azure Data
Factory pipeline.
3. Run the Execute SSIS Package
activity on your Self-Hosted
Integration Runtime/SSIS
Integration Runtime.
Learn more

More information
Blog - Azure Data Factory
Stack Overflow forum
Twitter
Videos
Compare Azure Data Factory with Data Factory
version 1
3/5/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see
Introduction to Data Factory.For an introduction to Data Factory version 1, see Introduction to Azure Data
Factory.

Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.

F EAT URE VERSIO N 1 C URREN T VERSIO N

Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, resource that can schedule pipelines
folders, and documents. For example, from a clock scheduler paradigm. For
an Azure Blob dataset specifies the more information, see Triggers and
blob container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.

Availability defines the processing


window slicing model for the dataset
(for example, hourly, daily, and so on).

Linked services Linked services are much like Linked services are the same as in
connection strings, which define the Data Factory V1, but with a new
connection information that's connectVia property to utilize the
necessary for Data Factory to connect Integration Runtime compute
to external resources. environment of the current version of
Data Factory. For more information,
see Integration runtime in Azure Data
Factory and Linked service properties
for Azure Blob storage.
F EAT URE VERSIO N 1 C URREN T VERSIO N

Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as "workflow units" that you
schedule separately via triggers.

Pipelines do not have "windows" of


time execution in the current version
of Data Factory. The Data Factory V1
concepts of startTime, endTime, and
isPaused are no longer present in the
current version of Data Factory. For
more information, see Pipeline
execution and triggers and Pipelines
and activities.

Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipeline. The current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.

Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises It provides the same capability as it did
and cloud. in V1.

The Azure-SSIS Integration Runtime in


the current version of Data Factory
also supports deploying and running
SQL Server Integration Services (SSIS)
packages in the cloud. For more
information, see Integration runtime in
Azure Data Factory.

Parameters NA Parameters are key-value pairs of


read-only configuration settings that
are defined in pipelines. You can pass
arguments for the parameters when
you are manually running the pipeline.
If you are using a scheduler trigger, the
trigger can pass values for the
parameters too. Activities within the
pipeline consume the parameter
values.
F EAT URE VERSIO N 1 C URREN T VERSIO N

Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of
Data Factory.

Pipeline runs NA A single instance of a pipeline


execution. For example, say you have a
pipeline that executes at 8 AM, 9 AM,
and 10 AM. There would be three
separate runs of the pipeline (pipeline
runs) in this case. Each pipeline run has
a unique pipeline run ID. The pipeline
run ID is a GUID that uniquely defines
that particular pipeline run. Pipeline
runs are typically instantiated by
passing arguments to parameters that
are defined in the pipelines.

Activity runs NA An instance of an activity execution


within a pipeline.

Trigger runs NA An instance of a trigger execution. For


more information, see Triggers.

Scheduling Scheduling is based on pipeline Scheduler trigger or execution via


start/end times and dataset availability. external scheduler. For more
information, see Pipeline execution and
triggers.

The following sections provide more information about the capabilities of the current version.

Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples
of branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-
demand or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more
information, see Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.

Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and
triggers article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is "delta loads," in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and
control flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial:
Incremental copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.

C O N T RO L A C T IVIT Y DESC RIP T IO N

ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified
activities in a loop. The loop implementation of this activity is
similar to Foreach looping structure in programming
languages.

Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.

Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.

Get metadata activity Retrieves the metadata of any data in Azure Data Factory.

Wait activity Pauses the pipeline for a specified period of time.

Deploy SSIS packages to Azure


You use Azure-SSIS if you want to move your SSIS workloads to the cloud, create a data factory by using the
current version, and provision an Azure-SSIS Integration Runtime.
The Azure-SSIS Integration Runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to
running your SSIS packages in the cloud. After you provision Azure-SSIS Integration Runtime, you can use the
same tools that you have been using to deploy SSIS packages to an on-premises SSIS environment.
For example, you can use SQL Server Data Tools or SQL Server Management Studio to deploy SSIS packages to
this runtime on Azure. For step-by-step instructions, see the tutorial Deploy SQL Server integration services
packages to Azure.

Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have "windows" of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.

Support for more data stores


The current version supports the copying of data to and from more data stores than V1. For a list of supported
data stores, see the following articles:
Version 1 - supported data stores
Current version - supported data stores

Support for on-demand Spark cluster


The current version supports the creation of an on-demand Azure HDInsight Spark cluster. To create an on-
demand Spark cluster, specify the cluster type as Spark in your on-demand, HDInsight linked service definition.
Then you can configure the Spark activity in your pipeline to use this linked service.
At runtime, when the activity is executed, the Data Factory service automatically creates the Spark cluster for
you. For more information, see the following articles:
Spark Activity in the current version of Data Factory
Azure HDInsight on-demand linked service

Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code
in .NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.

SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and
monitor pipelines.
.NET SDK : The .NET SDK is updated in the current version.
PowerShell : The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactor yV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK : This SDK is new in the current version.
REST API : The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.

Authoring experience
VERSIO N 2 VERSIO N 1

Azure por tal Yes No

Azure PowerShell Yes Yes

.NET SDK Yes Yes

REST API Yes Yes

Python SDK Yes No

Resource Manager template Yes Yes

Roles and permissions


The Data Factory version 1 Contributor role can be used to create and manage the current version of Data
Factory resources. For more info, see Data Factory Contributor.

Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.

Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts:
PowerShell, .NET, Python, REST API.
Quickstart: Create a data factory by using the Azure
Data Factory UI
7/7/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory. The pipeline
that you create in this data factory copies data from one folder to another folder in Azure Blob storage. To
transform data by using Azure Data Factory, see Mapping data flow.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.

Add an input folder and file for the blob container


In this section, you create a folder named input in the container you created, and then upload a sample file to
the input folder. Before you begin, open a text editor such as Notepad , and create a file named emp.txt with the
following content:

John, Doe
Jane, Doe

Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:

5. In the Upload to folder box, enter input .


6. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
7. Select the Close icon (an X ) to close the Upload blob page.
Keep the adftutorial container page open. You use it to verify the output at the end of this quickstart.
Video
Watching this video helps you understand the Data Factory UI:

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure portal.
3. From the Azure portal menu, select Create a resource .
4. Select Integration , and then select Data Factor y .

5. On the Create Data Factor y page, under Basics tab, select your Azure Subscription in which you
want to create the data factory.
6. For Resource Group , take one of the following steps:
a. Select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a new resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
7. For Region , select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory meta data
will be stored. The associated data stores (like Azure Storage and Azure SQL Database) and computes
(like Azure HDInsight) that Data Factory uses can run in other regions.
8. For Name , enter ADFTutorialDataFactor y . The name of the Azure data factory must be globally
unique. If you see the following error, change the name of the data factory (for example,
<yourname>ADFTutorialDataFactor y ) and try creating again. For naming rules for Data Factory
artifacts, see the Data Factory - naming rules article.

9. For Version , select V2 .


10. Select Next: Git configuration , and then select Configure Git later check box.
11. Select Review + create , and select Create after the validation is passed. After the creation is complete,
select Go to resource to navigate to the Data Factor y page.
12. Select Open on the Open Azure Data Factor y Studio tile to start the Azure Data Factory user interface
(UI) application on a separate browser tab.
NOTE
If you see that the web browser is stuck at "Authorizing", clear the Block third-par ty cookies and site data
check box. Or keep it selected, create an exception for login.microsoftonline.com , and then try to open the
app again.

Create a linked service


In this procedure, you create a linked service to link your Azure Storage account to the data factory. The linked
service has the connection information that the Data Factory service uses at runtime to connect to it.
1. On the Azure Data Factory UI page, open Manage tab from the left pane.
2. On the Linked services page, select +New to create a new linked service.
3. On the New Linked Ser vice page, select Azure Blob Storage , and then select Continue .
4. On the New Linked Service (Azure Blob Storage) page, complete the following steps:
a. For Name , enter AzureStorageLinkedSer vice .
b. For Storage account name , select the name of your Azure Storage account.
c. Select Test connection to confirm that the Data Factory service can connect to the storage account.
d. Select Create to save the linked service.
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset . These datasets are of type
AzureBlob . They refer to the Azure Storage linked service that you created in the previous section.
The input dataset represents the source data in the input folder. In the input dataset definition, you specify the
blob container (adftutorial ), the folder (input ), and the file (emp.txt ) that contain the source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition, you
specify the blob container (adftutorial ), the folder (output ), and the file to which the data is copied. Each run of
a pipeline has a unique ID associated with it. You can access this ID by using the system variable RunId . The
name of the output file is dynamically evaluated based on the run ID of the pipeline.
In the linked service settings, you specified the Azure Storage account that contains the source data. In the
source dataset settings, you specify where exactly the source data resides (blob container, folder, and file). In the
sink dataset settings, you specify where the data is copied to (blob container, folder, and file).
1. Select Author tab from the left pane.
2. Select the + (plus) button, and then select Dataset .

3. On the New Dataset page, select Azure Blob Storage , and then select Continue .
4. On the Select Format page, choose the format type of your data, and then select Continue . In this case,
select Binar y when copy files as-is without parsing the content.
5. On the Set Proper ties page, complete following steps:
a. Under Name , enter InputDataset .
b. For Linked ser vice , select AzureStorageLinkedSer vice .
c. For File path , select the Browse button.
d. In the Choose a file or folder window, browse to the input folder in the adftutorial container, select
the emp.txt file, and then select OK .
e. Select OK .

6. Repeat the steps to create the output dataset:


a. Select the + (plus) button, and then select Dataset .
b. On the New Dataset page, select Azure Blob Storage , and then select Continue .
c. On the Select Format page, choose the format type of your data, and then select Continue .
d. On the Set Proper ties page, specify OutputDataset for the name. Select
AzureStorageLinkedSer vice as linked service.
e. Under File path , enter adftutorial/output . If the output folder doesn't exist, the copy activity creates
it at runtime.
f. Select OK .
Create a pipeline
In this procedure, you create and validate a pipeline with a copy activity that uses the input and output datasets.
The copy activity copies data from the file you specified in the input dataset settings to the file you specified in
the output dataset settings. If the input dataset specifies only a folder (not the file name), the copy activity copies
all the files in the source folder to the destination.
1. Select the + (plus) button, and then select Pipeline .
2. In the General panel under Proper ties , specify CopyPipeline for Name . Then collapse the panel by
clicking the Properties icon in the top-right corner.
3. In the Activities toolbox, expand Move & Transform . Drag the Copy Data activity from the Activities
toolbox to the pipeline designer surface. You can also search for activities in the Activities toolbox.
Specify CopyFromBlobToBlob for Name .

4. Switch to the Source tab in the copy activity settings, and select InputDataset for Source Dataset .
5. Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset .
6. Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. Confirm that the
pipeline has been successfully validated. To close the validation output, select the Validation button in the
top-right corner.
Debug the pipeline
In this step, you debug the pipeline before deploying it to Data Factory.
1. On the pipeline toolbar above the canvas, click Debug to trigger a test run.
2. Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings at the
bottom.

3. Confirm that you see an output file in the output folder of the adftutorial container. If the output folder
doesn't exist, the Data Factory service automatically creates it.

Trigger the pipeline manually


In this procedure, you deploy entities (linked services, datasets, pipelines) to Azure Data Factory. Then, you
manually trigger a pipeline run.
1. Before you trigger a pipeline, you must publish entities to Data Factory. To publish, select Publish all on
the top.

2. To trigger the pipeline manually, select Add Trigger on the pipeline toolbar, and then select Trigger
Now . On the Pipeline run page, select OK .

Monitor the pipeline


1. Switch to the Monitor tab on the left. Use the Refresh button to refresh the list.

2. Select the CopyPipeline link, you'll see the status of the copy activity run on this page.
3. To view details about the copy operation, select the Details (eyeglasses image) link. For details about the
properties, see Copy Activity overview.

4. Confirm that you see a new file in the output folder.


5. You can switch back to the Pipeline runs view from the Activity runs view by selecting the All
pipeline runs link.
Trigger the pipeline on a schedule
This procedure is optional in this tutorial. You can create a scheduler trigger to schedule the pipeline to run
periodically (hourly, daily, and so on). In this procedure, you create a trigger to run every minute until the end
date and time that you specify.
1. Switch to the Author tab.
2. Go to your pipeline, select Add Trigger on the pipeline toolbar, and then select New/Edit .
3. On the Add Triggers page, select Choose trigger , and then select New .
4. On the New Trigger page, under End , select On Date , specify an end time a few minutes after the
current time, and then select OK .
A cost is associated with each pipeline run, so specify the end time only minutes apart from the start
time. Ensure that it's the same day. However, ensure that there's enough time for the pipeline to run
between the publish time and the end time. The trigger comes into effect only after you publish the
solution to Data Factory, not when you save the trigger in the UI.
5. On the New Trigger page, select the Activated check box, and then select OK .

6. Review the warning message, and select OK .


7. Select Publish all to publish changes to Data Factory.
8. Switch to the Monitor tab on the left. Select Refresh to refresh the list. You see that the pipeline runs
once every minute from the publish time to the end time.
Notice the values in the TRIGGERED BY column. The manual trigger run was from the step (Trigger
Now ) that you did earlier.
9. Switch to the Trigger runs view.
10. Confirm that an output file is created for every pipeline run until the specified end date and time in the
output folder.

Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To learn
about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy data
7/7/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this quickstart, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from a folder in Azure Blob storage to another folder.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.

Add an input folder and file for the blob container


In this section, you create a folder named input in the container you created, and then upload a sample file to
the input folder. Before you begin, open a text editor such as Notepad , and create a file named emp.txt with the
following content:

John, Doe
Jane, Doe

Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:

5. In the Upload to folder box, enter input .


6. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
7. Select the Close icon (an X ) to close the Upload blob page.
Keep the adftutorial container page open. You use it to verify the output at the end of this quickstart.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure portal.
3. From the Azure portal menu, select Create a resource > Integration > Data Factor y :
4. On the New data factor y page, enter ADFTutorialDataFactor y for Name .
The name of the Azure Data Factory must be globally unique. If you see the following error, change the
name of the data factory (for example, <yourname>ADFTutorialDataFactor y ) and try creating again.
For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.
5. For Subscription , select your Azure subscription in which you want to create the data factory.
6. For Resource Group , use one of the following steps:
Select Use existing , and select an existing resource group from the list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. For Version , select V2 .
8. For Location , select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory meta data
will be stored. The associated data stores (like Azure Storage and Azure SQL Database) and computes
(like Azure HDInsight) that Data Factory uses can run in other regions.
9. Select Create .
10. After the creation is complete, you see the Data Factor y page. Select Open on the Open Azure Data
Factor y Studio tile to start the Azure Data Factory user interface (UI) application on a separate tab.
Start the Copy Data tool
1. On the home page of Azure Data Factory, select the Ingest tile to start the Copy Data tool.

2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , then select
Next .
3. On the Source data store page, complete the following steps:
a. Click + Create new connection to add a connection.
b. Select the linked service type that you want to create for the source connection. In this tutorial, we
use Azure Blob Storage . Select it from the gallery, and then select Continue .
c. On the New connection (Azure Blob Storage) page, specify a name for your connection. Select
your Azure subscription from the Azure subscription list and your storage account from the
Storage account name list, test connection, and then select Create .
d. Select the newly created connection in the Connection block.
e. In the File or folder section, select Browse to navigate to the adftutorial/input folder, select the
emp.txt file, and then click OK .
f. Select the Binar y copy checkbox to copy file as-is, and then select Next .
4. On the Destination data store page, complete the following steps:
a. Select the AzureBlobStorage connection that you created in the Connection block.
b. In the Folder path section, enter adftutorial/output for the folder path.
c. Leave other settings as default and then select Next .
5. On the Settings page, specify a name for the pipeline and its description, then select Next to use other
default configurations.
6. On the Summar y page, review all settings, and select Next .
7. On the Deployment complete page, select Monitor to monitor the pipeline that you created.
8. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select Refresh
to refresh the list. Click the link under Pipeline name to view activity run details or rerun the pipeline.

9. On the Activity runs page, select the Details link (eyeglasses icon) under the Activity name column for
more details about copy operation. For details about the properties, see Copy Activity overview.
10. To go back to the Pipeline Runs view, select the All pipeline runs link in the breadcrumb menu. To
refresh the view, select Refresh .
11. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the output
folder doesn't exist, the Data Factory service automatically creates it.
12. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked services,
datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a data factory by
using the Azure portal.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To learn
about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure Data Factory using
Azure CLI
6/8/2021 • 5 minutes to read • Edit Online

This quickstart describes how to use Azure CLI to create an Azure Data Factory. The pipeline you create in this
data factory copies data from one folder to another folder in an Azure Blob Storage. For information on how to
transform data using Azure Data Factory, see Transform data in Azure Data Factory.
For an introduction to the Azure Data Factory service, see Introduction to Azure Data Factory.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Use the Bash environment in Azure Cloud Shell.

If you prefer, install the Azure CLI to run CLI reference commands.
If you're using a local installation, sign in to the Azure CLI by using the az login command. To finish
the authentication process, follow the steps displayed in your terminal. For additional sign-in
options, see Sign in with the Azure CLI.
When you're prompted, install Azure CLI extensions on first use. For more information about
extensions, see Use extensions with the Azure CLI.
Run az version to find the version and dependent libraries that are installed. To upgrade to the
latest version, run az upgrade.

NOTE
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the contributor
or owner role, or an administrator of the Azure subscription. For more information, see Azure roles.

Prepare a container and test file


This quickstart uses an Azure Storage account, which includes a container with a file.
1. To create a resource group named ADFQuickStartRG , use the az group create command:

az group create --name ADFQuickStartRG --location eastus

2. Create a storage account by using the az storage account create command:

az storage account create --resource-group ADFQuickStartRG \


--name adfquickstartstorage --location eastus

3. Create a container named adftutorial by using the az storage container create command:
az storage container create --resource-group ADFQuickStartRG --name adftutorial \
--account-name adfquickstartstorage --auth-mode key

4. In the local directory, create a file named emp.txt to upload. If you're working in Azure Cloud Shell, you
can find the current working directory by using the echo $PWD Bash command. You can use standard
Bash commands, like cat , to create a file:

cat > emp.txt


This is text.

Use Ctrl+D to save your new file.


5. To upload the new file to your Azure storage container, use the az storage blob upload command:

az storage blob upload --account-name adfquickstartstorage --name input/emp.txt \


--container-name adftutorial --file emp.txt --auth-mode key

This command uploads to a new folder named input .

Create a data factory


To create an Azure data factory, run the az datafactory factory create command:

az datafactory factory create --resource-group ADFQuickStartRG \


--factory-name ADFTutorialFactory

IMPORTANT
Replace ADFTutorialFactory with a globally unique data factory name, for example, ADFTutorialFactorySP1127.

You can see the data factory that you created by using the az datafactory factory show command:

az datafactory factory show --resource-group ADFQuickStartRG \


--factory-name ADFTutorialFactory

Create a linked service and datasets


Next, create a linked service and two datasets.
1. Get the connection string for your storage account by using the az storage account show-connection-
string command:

az storage account show-connection-string --resource-group ADFQuickStartRG \


--name adfquickstartstorage --key primary

2. In your working directory, create a JSON file with this content, which includes your own connection string
from the previous step. Name the file AzureStorageLinkedService.json :
{
"type":"AzureStorage",
"typeProperties":{
"connectionString":{
"type": "SecureString",

"value":"DefaultEndpointsProtocol=https;AccountName=adfquickstartstorage;AccountKey=K9F4Xk/EhYrMBIR98
rtgJ0HRSIDU4eWQILLh2iXo05Xnr145+syIKNczQfORkQ3QIOZAd/eSDsvED19dAwW/tw==;EndpointSuffix=core.windows.n
et"
}
}
}

3. Create a linked service, named AzureStorageLinkedService , by using the az datafactory linked-service


create command:

az datafactory linked-service create --resource-group ADFQuickStartRG \


--factory-name ADFTutorialFactory --linked-service-name AzureStorageLinkedService \
--properties @AzureStorageLinkedService.json

4. In your working directory, create a JSON file with this content, named InputDataset.json :

{
"type":
"AzureBlob",
"linkedServiceName": {
"type":"LinkedServiceReference",
"referenceName":"AzureStorageLinkedService"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "input",
"container": "adftutorial"
}
}
}

5. Create an input dataset named InputDataset by using the az datafactory dataset create command:

az datafactory dataset create --resource-group ADFQuickStartRG \


--dataset-name InputDataset --factory-name ADFQuickStartFactory \
--properties @InputDataset.json

6. In your working directory, create a JSON file with this content, named OutputDataset.json :
{
"type":
"AzureBlob",
"linkedServiceName": {
"type":"LinkedServiceReference",
"referenceName":"AzureStorageLinkedService"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "output",
"container": "adftutorial"
}
}
}

7. Create an output dataset named OutputDataset by using the az datafactory dataset create command:

az datafactory dataset create --resource-group ADFQuickStartRG \


--dataset-name OutputDataset --factory-name ADFQuickStartFactory \
--properties @OutputDataset.json

Create and run the pipeline


Finally, create and run the pipeline.
1. In your working directory, create a JSON file with this content named Adfv2QuickStartPipeline.json :
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "InputDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputDataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}

2. Create a pipeline named Adfv2QuickStartPipeline by using the az datafactory pipeline create command:

az datafactory pipeline create --resource-group ADFQuickStartRG \


--factory-name ADFTutorialFactory --name Adfv2QuickStartPipeline \
--pipeline @Adfv2QuickStartPipeline.json

3. Run the pipeline by using the az datafactory pipeline create-run command:

az datafactory pipeline create-run --resource-group ADFQuickStartRG \


--name Adfv2QuickStartPipeline --factory-name ADFTutorialFactory

This command returns a run ID. Copy it for use in the next command.
4. Verify that the pipeline run succeeded by using the az datafactory pipeline-run show command:

az datafactory pipeline-run show --resource-group ADFQuickStartRG \


--factory-name ADFTutorialFactory --run-id 00000000-0000-0000-0000-000000000000

You can also verify that your pipeline ran as expected by using the Azure portal. For more information, see
Review deployed resources.

Clean up resources
All of the resources in this quickstart are part of the same resource group. To remove them all, use the az group
delete command:

az group delete --name ADFQuickStartRG

If you're using this resource group for anything else, instead, delete individual resources. For instance, to remove
the linked service, use the az datafactory linked-service delete command.
In this quickstart, you created the following JSON files:
AzureStorageLinkedService.json
InputDataset.json
OutputDataset.json
Adfv2QuickStartPipeline.json
Delete them by using standard Bash commands.

Next steps
Pipelines and activities in Azure Data Factory
Linked services in Azure Data Factory
Datasets in Azure Data Factory
Transform data in Azure Data Factory
Quickstart: Create an Azure Data Factory using
PowerShell
5/28/2021 • 12 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This quickstart describes how to use PowerShell to create an Azure Data Factory. The pipeline you create in this
data factory copies data from one folder to another folder in an Azure blob storage. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Transform data using Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.

Add an input folder and file for the blob container


In this section, you create a folder named input in the container you created, and then upload a sample file to
the input folder. Before you begin, open a text editor such as Notepad , and create a file named emp.txt with the
following content:

John, Doe
Jane, Doe

Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:

5. In the Upload to folder box, enter input .


6. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
7. Select the Close icon (an X ) to close the Upload blob page.
Keep the adftutorial container page open. You use it to verify the output at the end of this quickstart.
Azure PowerShell

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
WARNING
If you do not use latest versions of PowerShell and Data Factory module, you may run into deserialization errors while
running the commands.

Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If you close
and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use to sign in
to the Azure portal:

Connect-AzAccount

3. Run the following command to view all the subscriptions for this account:

Get-AzSubscription

4. If you see multiple subscriptions associated with your account, run the following command to select the
subscription that you want to work with. Replace SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes,
and then run the command. For example: "ADFQuickStartRG" .

$resourceGroupName = "ADFQuickStartRG";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again

2. To create the Azure resource group, run the following command:

$ResGrp = New-AzResourceGroup $resourceGroupName -location 'East US'

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.

3. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.

$dataFactoryName = "ADFQuickStartFactory";
4. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:

$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName `


-Location $ResGrp.Location -Name $dataFactoryName

Note the following points:


The name of the Azure Data Factory must be globally unique. If you receive the following error, change
the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.

Create a linked service


Create linked services in a data factory to link your data stores and compute services to the data factory. In this
quickstart, you create an Azure Storage linked service that is used as both the source and sink stores. The linked
service has the connection information that the Data Factory service uses at runtime to connect to it.

TIP
In this quickstart, you use Account key as the authentication type for your data store, but you can choose other
supported authentication methods: SAS URI,Service Principal and Managed Identity if needed. Refer to corresponding
sections in this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key
Vault. Refer to this article for detailed illustrations.

1. Create a JSON file named AzureStorageLinkedSer vice.json in C:\ADFv2QuickStar tPSH folder with
the following content: (Create the folder ADFv2QuickStartPSH if it does not already exist.).

IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.
{
"name": "AzureStorageLinkedService",
"properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>;EndpointSuffix=core.windows.net"
}
}
}

If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example, AzureStorageLinkedService.json.txt . If
you create the file in File Explorer before opening it in Notepad, you may not see the .txt extension
since the Hide extensions for known files types option is set by default. Remove the .txt extension
before proceeding to the next step.
2. In PowerShell , switch to the ADFv2QuickStar tPSH folder.

Set-Location 'C:\ADFv2QuickStartPSH'

3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service:
AzureStorageLinkedSer vice .

Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "AzureStorageLinkedService" `
-DefinitionFile ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobStorageLinkedService

Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset . These datasets are of type
Binar y . They refer to the Azure Storage linked service that you created in the previous section. The input dataset
represents the source data in the input folder. In the input dataset definition, you specify the blob container
(adftutorial ), the folder (input ), and the file (emp.txt ) that contain the source data. The output dataset
represents the data that's copied to the destination. In the output dataset definition, you specify the blob
container (adftutorial ), the folder (output ), and the file to which the data is copied.
1. Create a JSON file named InputDataset.json in the C:\ADFv2QuickStar tPSH folder, with the following
content:
{
"name": "InputDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "input",
"container": "adftutorial"
}
}
}
}

2. To create the dataset: InputDataset , run the Set-AzDataFactor yV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName -Name "InputDataset" `
-DefinitionFile ".\InputDataset.json"

Here is the sample output:

DatasetName : InputDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset

3. Repeat the steps to create the output dataset. Create a JSON file named OutputDataset.json in the
C:\ADFv2QuickStar tPSH folder, with the following content:

{
"name": "OutputDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"folderPath": "output",
"container": "adftutorial"
}
}
}
}

4. Run the Set-AzDataFactor yV2Dataset cmdlet to create the OutDataset .


Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName -Name "OutputDataset" `
-DefinitionFile ".\OutputDataset.json"

Here is the sample output:

DatasetName : OutputDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset

Create a pipeline
In this procedure, you create a pipeline with a copy activity that uses the input and output datasets. The copy
activity copies data from the file you specified in the input dataset settings to the file you specified in the output
dataset settings.
1. Create a JSON file named Adfv2QuickStar tPipeline.json in the C:\ADFv2QuickStar tPSH folder with
the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "InputDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputDataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}

2. To create the pipeline: Adfv2QuickStar tPipeline , Run the Set-AzDataFactor yV2Pipeline cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"

Create a pipeline run


In this step, you create a pipeline run.
Run the Invoke-AzDataFactor yV2Pipeline cmdlet to create a pipeline run. The cmdlet returns the pipeline
run ID for future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name

Monitor the pipeline run


1. Run the following PowerShell script to continuously check the pipeline run status until it finishes copying
the data. Copy/paste the following script in the PowerShell window, and press ENTER.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ( ($Run.Status -ne "InProgress") -and ($Run.Status -ne "Queued") ) {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output ("Pipeline is running...status: " + $Run.Status)
}

Start-Sleep -Seconds 10
}

Here is the sample output of pipeline run:

Pipeline is running...status: InProgress


Pipeline run finished. The status is: Succeeded

ResourceGroupName : ADFQuickStartRG
DataFactoryName : ADFQuickStartFactory
RunId : 00000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 8/27/2019 7:23:07 AM
Parameters : {}
RunStart : 8/27/2019 7:22:56 AM
RunEnd : 8/27/2019 7:23:07 AM
DurationInMs : 11324
Status : Succeeded
Message :

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.

Write-Output "Activity run details:"


$Result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $DataFactory.DataFactoryName -
ResourceGroupName $ResGrp.ResourceGroupName -PipelineRunId $RunId -RunStartedAfter (Get-
Date).AddMinutes(-30) -RunStartedBefore (Get-Date).AddMinutes(30)
$Result

Write-Output "Activity 'Output' section:"


$Result.Output -join "`r`n"

Write-Output "Activity 'Error' section:"


$Result.Error -join "`r`n"
3. Confirm that you see the output similar to the following sample output of activity run result:

ResourceGroupName : ADFQuickStartRG
DataFactoryName : ADFQuickStartFactory
ActivityRunId : 00000000-0000-0000-0000-000000000000
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink, enableStaging}
Output : {dataRead, dataWritten, filesRead, filesWritten...}
LinkedServiceName :
ActivityRunStart : 8/27/2019 7:22:58 AM
ActivityRunEnd : 8/27/2019 7:23:05 AM
DurationInMs : 6828
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity 'Output' section:


"dataRead": 20
"dataWritten": 20
"filesRead": 1
"filesWritten": 1
"sourcePeakConnections": 1
"sinkPeakConnections": 1
"copyDuration": 4
"throughput": 0.01
"errors": []
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (Central US)"
"usedDataIntegrationUnits": 4
"usedParallelCopies": 1
"executionDetails": [
{
"source": {
"type": "AzureBlobStorage"
},
"sink": {
"type": "AzureBlobStorage"
},
"status": "Succeeded",
"start": "2019-08-27T07:22:59.1045645Z",
"duration": 4,
"usedDataIntegrationUnits": 4,
"usedParallelCopies": 1,
"detailedDurations": {
"queuingDuration": 3,
"transferDuration": 1
}
}
]

Activity 'Error' section:


"errorCode": ""
"message": ""
"failureType": ""
"target": "CopyFromBlobToBlob"

Review deployed resources


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies the emp.txt
file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page, select Refresh to see the output folder.
2. Select output in the folder list.
3. Confirm that the emp.txt is copied to the output folder.

Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following command to
delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

NOTE
Dropping a resource group may take some time. Please be patient with the process

If you want to delete just the data factory, not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline using
.NET SDK
5/28/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This quickstart describes how to use .NET SDK to create an Azure Data Factory. The pipeline you create in this
data factory copies data from one folder to another folder in an Azure blob storage. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Transform data using Spark.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.

Add an input folder and file for the blob container


In this section, you create a folder named input in the container you created, and then upload a sample file to
the input folder. Before you begin, open a text editor such as Notepad , and create a file named emp.txt with the
following content:

John, Doe
Jane, Doe

Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:

5. In the Upload to folder box, enter input .


6. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
7. Select the Close icon (an X ) to close the Upload blob page.
Keep the adftutorial container page open. You use it to verify the output at the end of this quickstart.
Visual Studio
The walkthrough in this article uses Visual Studio 2019. The procedures for Visual Studio 2013, 2015, or 2017
differ slightly.

Create an application in Azure Active Directory


From the sections in How to: Use the portal to create an Azure AD application and service principal that can
access resources, follow the instructions to do these tasks:
1. In Create an Azure Active Directory application, create an application that represents the .NET application you
are creating in this tutorial. For the sign-on URL, you can provide a dummy URL as shown in the article (
https://contoso.org/exampleapp ).
2. In Get values for signing in, get the application ID and tenant ID , and note down these values that you use
later in this tutorial.
3. In Certificates and secrets, get the authentication key , and note down this value that you use later in this
tutorial.
4. In Assign the application to a role, assign the application to the Contributor role at the subscription level so
that the application can create data factories in the subscription.

Create a Visual Studio project


Next, create a C# .NET console application in Visual Studio:
1. Launch Visual Studio .
2. In the Start window, select Create a new project > Console App (.NET Framework) . .NET version 4.5.2
or above is required.
3. In Project name , enter ADFv2QuickStar t .
4. Select Create to create the project.

Install NuGet packages


1. Select Tools > NuGet Package Manager > Package Manager Console .
2. In the Package Manager Console pane, run the following commands to install packages. For more
information, see the Microsoft.Azure.Management.DataFactory nuget package.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -IncludePrerelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs , include the following statements to add references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Rest.Serialization;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets the variables. Replace the placeholders with your
own values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, and more) and computes
(HDInsight and others) used by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "<the location of your resource group>";
string dataFactoryName =
"<specify the name of data factory to create. It must be globally unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files
// need to be copied to the output folder.
string inputBlobPath =
"<path to existing blob(s) to copy data from, e.g. containername/inputdir>";
//specify the contains and output folder where the files are copied
string outputBlobPath =
"<the blob path to copy data to, e.g. containername/outputdir>";

// name of the Azure Storage linked service, blob dataset, and the pipeline
string storageLinkedServiceName = "AzureStorageLinkedService";
string blobDatasetName = "BlobDataset";
string pipelineName = "Adfv2QuickStartPipeline";

NOTE
For Sovereign clouds, you must use the appropriate cloud-specific endpoints for ActiveDirectoryAuthority and
ResourceManagerUrl (BaseUri). For example, in US Azure Gov you would use authority of https://login.microsoftonline.us
instead of https://login.microsoftonline.com, and use https://management.usgovcloudapi.net instead of
https://management.azure.com/, and then create the data factory management client. You can use Powershell to easily
get the endpoint Urls for various clouds by executing “Get-AzEnvironment | Format-List”, which will return a list of
endpoints for each cloud environment.

3. Add the following code to the Main method that creates an instance of
DataFactor yManagementClient class. You use this object to create a data factory, a linked service,
datasets, and a pipeline. You also use this object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.microsoftonline.com/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync(
"https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) {
SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factor y .
// Create a data factory
Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);
Console.WriteLine(
SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings));

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState ==


"PendingCreation")
{
System.Threading.Thread.Sleep(1000);
}

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked ser vice .
You create linked services in a data factory to link your data stores and compute services to the data factory. In
this Quickstart, you only need to create one Azure Storage linked service for both the copy source and the sink
store; it's named "AzureStorageLinkedService" in the sample.

// Create an Azure Storage linked service


Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString(
"DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(
storageLinkedService, client.SerializationSettings));

Create a dataset
Add the following code to the Main method that creates an Azure blob dataset .
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob dataset
references to the Azure Storage linked service you created in the previous step. The dataset takes a parameter
whose value is set in an activity that consumes the dataset. The parameter is used to construct the "folderPath"
pointing to where the data resides/is stored.
// Create an Azure Blob dataset
Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = new Expression { Value = "@{dataset().path}" },
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "path", new ParameterSpecification { Type = ParameterType.String } }
}
}
);
client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(
SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity .
In this example, this pipeline contains one activity and takes two parameters: the input blob path and the output
blob path. The values for these parameters are set when the pipeline is triggered/run. The copy activity refers to
the same blob dataset created in the previous step as input and output. When the dataset is used as an input
dataset, input path is specified. And, when the dataset is used as an output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run .
This code also sets values of the inputPath and outputPath parameters specified in the pipeline with the actual
values of the source and sink blob paths.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> parameters = new Dictionary<string, object>
{
{ "inputPath", inputBlobPath },
{ "outputPath", outputBlobPath }
};
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(
resourceGroup, dataFactoryName, pipelineName, parameters: parameters
).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);
Monitor a pipeline run
1. Add the following code to the Main method to continuously check the status until it finishes copying the
data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(
resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress" || pipelineRun.Status == "Queued")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to the Main method that retrieves copy activity run details, such as the size of the
data that's read or written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

RunFilterParameters filterParams = new RunFilterParameters(


DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse = client.ActivityRuns.QueryByPipelineRun(
resourceGroup, dataFactoryName, runResponse.RunId, filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and pipeline run. It
then checks the pipeline run status. Wait until you see the copy activity run details with the size of the read/write
data. Then use tools such as Azure Storage Explorer to check the blob(s) is copied to "outputBlobPath" from
"inputBlobPath" as you specified in the variables.
Sample output

Creating data factory SPv2Factory0907...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"value": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>",
<storageAccountKey>",
"type": "SecureString"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
Creating pipeline Adfv2QuickStartPipeline...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"name": "CopyFromBlobToBlob"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 308d222d-3858-48b1-9e66-acd921feaa09
Checking pipeline run status...
Status: InProgress
Status: InProgress
Checking copy activity run details...
{
"dataRead": 331452208,
"dataWritten": 331452208,
"copyDuration": 23,
"throughput": 14073.209,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 23
}

Press any key to exit...

Verify the output


The pipeline automatically creates the output folder in the adftutorial blob container. Then, it copies the
emp.txt file from the input folder to the output folder.
1. In the Azure portal, on the adftutorial container page that you stopped at in the Add an input folder and file
for the blob container section above, select Refresh to see the output folder.
2. In the folder list, select output .
3. Confirm that the emp.txt is copied to the output folder.

Clean up resources
To programmatically delete the data factory, add the following lines of code to the program:

Console.WriteLine("Deleting the data factory");


client.Factories.Delete(resourceGroup, dataFactoryName);

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline using
Python
5/28/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this quickstart, you create a data factory by using Python. The pipeline in this data factory copies data from
one folder to another folder in Azure Blob storage.
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for
orchestrating and automating data movement and data transformation. Using Azure Data Factory, you can
create and schedule data-driven workflows, called pipelines.
Pipelines can ingest data from disparate data stores. Pipelines process or transform data by using compute
services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
Pipelines publish output data to data stores such as Azure Synapse Analytics for business intelligence (BI)
applications.

Prerequisites
An Azure account with an active subscription. Create one for free.
Python 3.6+.
An Azure Storage account.
Azure Storage Explorer (optional).
An application in Azure Active Directory. Create the application by following the steps in this link, using
Authentication Option 2 (application secret), and assign the application to the Contributor role by
following instructions in the same article. Make note of the following values as shown in the article to use
in later steps: Application (client) ID, client secret value, and tenant ID.

Create and upload an input file


1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John|Doe
Jane|Doe

2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder in the
container. Then, upload the input.txt file to the input folder.

Install the Python package


1. Open a terminal or command prompt with administrator privileges.
2. First, install the Python package for Azure management resources:

pip install azure-mgmt-resource


3. To install the Python package for Data Factory, run the following command:

pip install azure-mgmt-datafactory

The Python SDK for Data Factory supports Python 2.7 and 3.6+.
4. To install the Python package for Azure Identity authentication, run the following command:

pip install azure-identity

NOTE
The "azure-identity" package might have conflicts with "azure-cli" on some common dependencies. If you meet
any authentication issue, remove "azure-cli" and its dependencies, or use a clean machine without installing
"azure-cli" package to make it work. For Sovereign clouds, you must use the appropriate cloud-specific constants.
Please refer to Connect to all regions using Azure libraries for Python Multi-cloud | Microsoft Docs for instructions
to connect with Python in Sovereign clouds.

Create a data factory client


1. Create a file named datafactor y.py . Add the following statements to add references to namespaces.

from azure.identity import ClientSecretCredential


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

2. Add the following functions that print information.


def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create the data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details. Set subscription_id variable to the ID of your Azure
subscription. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight,
etc.) used by data factory can be in other regions.

def main():

# Azure subscription ID
subscription_id = '<subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<resource group>'

# The data factory name. It must be globally unique.


df_name = '<factory name>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ClientSecretCredential(client_id='<Application (client) ID>',
client_secret='<client secret value>', tenant_id='<tenant ID>')

# Specify following for Soverign Clouds, import right cloud constant and then use it to connect.
# from msrestazure.azure_cloud import AZURE_PUBLIC_CLOUD as CLOUD
# credentials = DefaultAzureCredential(authority=CLOUD.endpoints.active_directory,
tenant_id=tenant_id)

resource_client = ResourceManagementClient(credentials, subscription_id)


adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'westus'}
df_params = {'location':'westus'}
Create a data factory
Add the following code to the Main method that creates a data factor y . If your resource group already exists,
comment out the first create_or_update statement.

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

#Create a data factory


df_resource = Factory(location='westus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

Create a linked service


Add the following code to the Main method that creates an Azure Storage linked ser vice .
You create linked services in a data factory to link your data stores and compute services to the data factory. In
this quickstart, you only need create one Azure Storage linked service as both copy source and sink store,
named "AzureStorageLinkedService" in the sample. Replace <storageaccountname> and <storageaccountkey>
with name and key of your Azure Storage account.

# Create an Azure Storage linked service


ls_name = 'storageLinkedService001'

# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString(value='DefaultEndpointsProtocol=https;AccountName=<account
name>;AccountKey=<account key>;EndpointSuffix=<suffix>')

ls_azure_storage =
LinkedServiceResource(properties=AzureStorageLinkedService(connection_string=storage_string))
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)

Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about properties
of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step.
# Create an Azure blob dataset (input)
ds_name = 'ds_in'
ds_ls = LinkedServiceReference(reference_name=ls_name)
blob_path = '<container>/<folder path>'
blob_filename = '<file name>'
ds_azure_blob = DatasetResource(properties=AzureBlobDataset(
linked_service_name=ds_ls, folder_path=blob_path, file_name=blob_filename))
ds = adf_client.datasets.create_or_update(
rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

Create a dataset for sink Azure Blob


Add the following code to the Main method that creates an Azure blob dataset. For information about properties
of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step.

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = '<container>/<folder path>'
dsOut_azure_blob = DatasetResource(properties=AzureBlobDataset(linked_service_name=ds_ls,
folder_path=output_blobpath))
dsOut = adf_client.datasets.create_or_update(
rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity .

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=act_name,inputs=[dsin_ref], outputs=[dsOut_ref], source=blob_source,
sink=blob_sink)

#Create a pipeline with the copy activity

#Note1: To pass parameters to the pipeline, add them to the json string params_for_pipeline shown below
in the format { “ParameterName1” : “ParameterValue1” } for each of the parameters needed in the pipeline.
#Note2: To pass parameters to a dataflow, create a pipeline parameter to hold the parameter name/value,
and then consume the pipeline parameter in the dataflow parameter in the format
@pipeline().parameters.parametername.

p_name = 'copyPipeline'
params_for_pipeline = {}

p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run .

# Create a pipeline run


run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name, parameters={})

Monitor a pipeline run


To monitor the pipeline run, add the following code the Main method:

# Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(
rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
filter_params = RunFilterParameters(
last_updated_after=datetime.now() - timedelta(1), last_updated_before=datetime.now() + timedelta(1))
query_response = adf_client.activity_runs.query_by_pipeline_run(
rg_name, df_name, pipeline_run.run_id, filter_params)
print_activity_run_details(query_response.value[0])

Now, add the following statement to invoke the main method when the program is run:

# Start the main method


main()

Full script
Here is the full Python code:

from azure.identity import ClientSecretCredential


from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)

def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")

def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))

def main():

# Azure subscription ID
subscription_id = '<subscription ID>'

# This program creates this resource group. If it's an existing resource group, comment out the code
that creates the resource group
rg_name = '<resource group>'

# The data factory name. It must be globally unique.


df_name = '<factory name>'

# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ClientSecretCredential(client_id='<service principal ID>', client_secret='<service
principal key>', tenant_id='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)

rg_params = {'location':'westus'}
df_params = {'location':'westus'}

# create the resource group


# comment out if the resource group already exits
resource_client.resource_groups.create_or_update(rg_name, rg_params)

# Create a data factory


df_resource = Factory(location='westus')
df = adf_client.factories.create_or_update(rg_name, df_name, df_resource)
print_item(df)
while df.provisioning_state != 'Succeeded':
df = adf_client.factories.get(rg_name, df_name)
time.sleep(1)

# Create an Azure Storage linked service


ls_name = 'storageLinkedService001'

# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString(value='DefaultEndpointsProtocol=https;AccountName=<account
name>;AccountKey=<account key>;EndpointSuffix=<suffix>')

ls_azure_storage =
LinkedServiceResource(properties=AzureStorageLinkedService(connection_string=storage_string))
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)

# Create an Azure blob dataset (input)


ds_name = 'ds_in'
ds_ls = LinkedServiceReference(reference_name=ls_name)
blob_path = '<container>/<folder path>'
blob_filename = '<file name>'
ds_azure_blob = DatasetResource(properties=AzureBlobDataset(
linked_service_name=ds_ls, folder_path=blob_path, file_name=blob_filename))
ds = adf_client.datasets.create_or_update(
rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)

# Create an Azure blob dataset (output)


dsOut_name = 'ds_out'
output_blobpath = '<container>/<folder path>'
dsOut_azure_blob = DatasetResource(properties=AzureBlobDataset(linked_service_name=ds_ls,
folder_path=output_blobpath))
dsOut = adf_client.datasets.create_or_update(
rg_name, df_name, dsOut_name, dsOut_azure_blob)
print_item(dsOut)

# Create a copy activity


act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=act_name, inputs=[dsin_ref], outputs=[
dsOut_ref], source=blob_source, sink=blob_sink)

# Create a pipeline with the copy activity


p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(
activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

# Create a pipeline run


run_response = adf_client.pipelines.create_run(rg_name, df_name, p_name, parameters={})

# Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(
rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
filter_params = RunFilterParameters(
last_updated_after=datetime.now() - timedelta(1), last_updated_before=datetime.now() + timedelta(1))
query_response = adf_client.activity_runs.query_by_pipeline_run(
rg_name, df_name, pipeline_run.run_id, filter_params)
print_activity_run_details(query_response.value[0])

# Start the main method


main()

Run the code


Build and start the application, then verify the pipeline execution.
The console prints the progress of creating data factory, linked service, datasets, pipeline, and pipeline run. Wait
until you see the copy activity run details with data read/written size. Then, use tools such as Azure Storage
explorer to check the blob(s) is copied to "outputBlobPath" from "inputBlobPath" as you specified in variables.
Here is the sample output:
Name: <data factory name>
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>
Location: eastus
Tags: {}

Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/linkedservices/storageLinkedService

Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in

Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out

Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline

Pipeline run status: Succeeded


Datetime with no tzinfo will be considered UTC.
Datetime with no tzinfo will be considered UTC.

Activity run details

Activity run status: Succeeded


Number of bytes read: 18
Number of bytes written: 18
Copy duration: 4

Clean up resources
To delete the data factory, add the following code to the program:

adf_client.factories.delete(rg_name, df_name)

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
6/20/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in
the cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory,
you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores, process/transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning, and publish output data to data stores such as Azure Synapse
Analytics for business intelligence (BI) applications to consume.
This quickstart describes how to use REST API to create an Azure data factory. The pipeline in this data factory
copies data from one location to another location in an Azure blob storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure subscription . If you don't have a subscription, you can create a free trial account.
Azure Storage account . You use the blob storage as source and sink data store. If you don't have an
Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some files to
the folder. You can use tools such as Azure Storage Explorer to connect to Azure Blob storage, create a blob
container, upload input file, and verify the output file.
Install Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell. This
quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Director y following this instruction. Make note of the following
values that you use in later steps: application ID , clientSecrets , and tenant ID . Assign application to
"Contributor " role.

NOTE
For Sovereign clouds, you must use the appropriate cloud-specific endpoints for ActiveDirectoryAuthority and
ResourceManagerUrl (BaseUri). You can use Powershell to easily get the endpoint Urls for various clouds by executing
“Get-AzEnvironment | Format-List”, which will return a list of endpoints for each cloud environment.

Set global variables


1. Launch PowerShell . Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the following commands after replacing the places-holders with your own values, to set global
variables to be used in later steps.

$tenantID = "<your tenant ID>"


$appId = "<your application ID>"
$clientSecrets = "<your clientSecrets for the application>"
$subscriptionId = "<your subscription ID to create the factory>"
$resourceGroupName = "<your resource group to create the factory>"
$factoryName = "<specify the name of data factory to create. It must be globally unique.>"
$apiVersion = "2018-06-01"

Authenticate with Azure AD


Run the following commands to authenticate with Azure Active Directory (AAD):

$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonline.com/${
tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -ArgumentList
($appId, $clientSecrets)
$result = $AuthContext.AcquireTokenAsync("https://management.core.windows.net/",
$cred).GetAwaiter().GetResult()
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}

Create a data factory


Run the following commands to create a data factory:
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}?api-version=${apiVersion}"
$body = @"
{
"name": "$factoryName",
"location": "East US",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

Data factory name "ADFv2QuickStartDataFactory" is not available.

For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
Here is the sample response:

{
"name":"<dataFactoryName>",
"identity":{
"type":"SystemAssigned",
"principalId":"<service principal ID>",
"tenantId":"<tenant ID>"
},

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>",
"type":"Microsoft.DataFactory/factories",
"properties":{
"provisioningState":"Succeeded",
"createTime":"2019-09-03T02:10:27.056273Z",
"version":"2018-06-01"
},
"eTag":"\"0200c876-0000-0100-0000-5d6dcb930000\"",
"location":"East US",
"tags":{

}
}

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this quickstart, you only need create one Azure Storage linked service as both copy source and sink store,
named "AzureStorageLinkedService" in the sample.
Run the following commands to create a linked service named AzureStorageLinkedSer vice :
Replace <accountName> and <accountKey> with name and key of your Azure storage account before
executing the commands.

$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/linkedservices/AzureStorageLinkedService?api-
version=${apiVersion}"
$body = @"
{
"name":"AzureStorageLinkedService",
"properties":{
"annotations":[

],
"type":"AzureBlobStorage",
"typeProperties":{
"connectionString":"DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name":"AzureStorageLinkedService",
"type":"Microsoft.DataFactory/factories/linkedservices",
"properties":{
"annotations":[

],
"type":"AzureBlobStorage",
"typeProperties":{
"connectionString":"DefaultEndpointsProtocol=https;AccountName=<accountName>;"
}
},
"etag":"07011a57-0000-0100-0000-5d6e14a20000"
}

Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, you create two
datasets: InputDataset and OutputDataset. They refer to the Azure Storage linked service that you created in the
previous section. The input dataset represents the source data in the input folder. In the input dataset definition,
you specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition, you
specify the blob container (adftutorial), the folder (output), and the file to which the data is copied.
Create InputDataset
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/datasets/InputDataset?api-version=${apiVersion}"
$body = @"
{
"name":"InputDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"emp.txt",
"folderPath":"input",
"container":"adftutorial"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/datasets/InputDataset",
"name":"InputDataset",
"type":"Microsoft.DataFactory/factories/datasets",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":"@{type=AzureBlobStorageLocation; fileName=emp.txt; folderPath=input;
container=adftutorial}"
}
},
"etag":"07011c57-0000-0100-0000-5d6e14b40000"
}

Create OutputDataset
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/datasets/OutputDataset?api-version=${apiVersion}"
$body = @"
{
"name":"OutputDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":"output",
"container":"adftutorial"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/datasets/OutputDataset",
"name":"OutputDataset",
"type":"Microsoft.DataFactory/factories/datasets",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":"@{type=AzureBlobStorageLocation; folderPath=output; container=adftutorial}"
}
},
"etag":"07013257-0000-0100-0000-5d6e18920000"
}

Create pipeline
In this example, this pipeline contains one Copy activity. The Copy activity refers to the "InputDataset" and the
"OutputDataset" created in the previous step as input and output.
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/pipelines/Adfv2QuickStartPipeline?api-version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "InputDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputDataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json

Here is the sample output:


{

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/pipelines/Adfv2QuickStartPipeline",
"name":"Adfv2QuickStartPipeline",
"type":"Microsoft.DataFactory/factories/pipelines",
"properties":{
"activities":[
"@{name=CopyFromBlobToBlob; type=Copy; dependsOn=System.Object[]; policy=;
userProperties=System.Object[]; typeProperties=; inputs=System.Object[]; outputs=System.Object[]}"
],
"annotations":[

]
},
"etag":"07012057-0000-0100-0000-5d6e14c00000"
}

Create pipeline run


In this step, you trigger a pipeline run. The pipeline run ID returned in the response body is used in later
monitoring API.

$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId

Here is the sample output:

{
"runId":"04a2bb9a-71ea-4c31-b46e-75276b61bafc"
}

Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.

$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"

if ( ($response.Status -eq "InProgress") -or ($response.Status -eq "Queued") ) {


Start-Sleep -Seconds 15
}
else {
$response | ConvertTo-Json
break
}
}

Here is the sample output:


{
"runId":"04a2bb9a-71ea-4c31-b46e-75276b61bafc",
"debugRunId":null,
"runGroupId":"04a2bb9a-71ea-4c31-b46e-75276b61bafc",
"pipelineName":"Adfv2QuickStartPipeline",
"parameters":{

},
"invokedBy":{
"id":"2bb3938176ee43439752475aa12b2251",
"name":"Manual",
"invokedByType":"Manual"
},
"runStart":"2019-09-03T07:22:47.0075159Z",
"runEnd":"2019-09-03T07:22:57.8862692Z",
"durationInMs":10878,
"status":"Succeeded",
"message":"",
"lastUpdated":"2019-09-03T07:22:57.8862692Z",
"annotations":[

],
"runDimension":{

},
"isLatest":true
}

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.

$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}/queryActivityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader
$response | ConvertTo-Json

Here is the sample output:


{
"value":[
{
"activityRunEnd":"2019-09-03T07:22:56.6498704Z",
"activityName":"CopyFromBlobToBlob",
"activityRunStart":"2019-09-03T07:22:49.0719311Z",
"activityType":"Copy",
"durationInMs":7577,
"retryAttempt":null,
"error":"@{errorCode=; message=; failureType=; target=CopyFromBlobToBlob}",
"activityRunId":"32951886-814a-4d6b-b82b-505936e227cc",
"iterationHash":"",
"input":"@{source=; sink=; enableStaging=False}",
"linkedServiceName":"",
"output":"@{dataRead=20; dataWritten=20; filesRead=1; filesWritten=1;
sourcePeakConnections=1; sinkPeakConnections=1; copyDuration=4; throughput=0.01;
errors=System.Object[]; effectiveIntegrationRuntime=DefaultIntegrationRuntime (Central US);
usedDataIntegrationUnits=4; usedParallelCopies=1; executionDetails=System.Object[]}",
"userProperties":"",
"pipelineName":"Adfv2QuickStartPipeline",
"pipelineRunId":"04a2bb9a-71ea-4c31-b46e-75276b61bafc",
"status":"Succeeded",
"recoveryStatus":"None",
"integrationRuntimeNames":"defaultintegrationruntime",
"executionDetails":"@{integrationRuntime=System.Object[]}"
}
]
}

Verify the output


Use Azure Storage explorer to check the file is copied to "outputPath" from "inputPath" as you specified when
creating a pipeline run.

Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Run the following command to delete only the data factory:

Remove-AzDataFactoryV2 -Name "<NameOfYourDataFactory>" -ResourceGroupName "<NameOfResourceGroup>"

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure Data Factory using
ARM template
7/7/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This quickstart describes how to use an Azure Resource Manager template (ARM template) to create an Azure
data factory. The pipeline you create in this data factory copies data from one folder to another folder in an
Azure blob storage. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Transform
data using Spark.
An ARM template is a JavaScript Object Notation (JSON) file that defines the infrastructure and configuration for
your project. The template uses declarative syntax. In declarative syntax, you describe your intended deployment
without writing the sequence of programming commands to create the deployment.

NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.

If your environment meets the prerequisites and you're familiar with using ARM templates, select the Deploy to
Azure button. The template will open in the Azure portal.

Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Create a file
Open a text editor such as Notepad , and create a file named emp.txt with the following content:

John, Doe
Jane, Doe

Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.)

Review template
The template used in this quickstart is from Azure Quickstart Templates.

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"metadata": {
"_generator": {
"name": "bicep",
"version": "0.4.1.14562",
"templateHash": "8367564219536411224"
}
}
},
"parameters": {
"dataFactoryName": {
"type": "string",
"defaultValue": "[format('datafactory{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Data Factory Name"
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location of the data factory."
}
},
"storageAccountName": {
"type": "string",
"defaultValue": "[format('storage{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"blobContainerName": {
"type": "string",
"defaultValue": "[format('blob{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
}
},
"functions": [],
"variables": {
"dataFactoryLinkedServiceName": "ArmtemplateStorageLinkedService",
"dataFactoryDataSetInName": "ArmtemplateTestDatasetIn",
"dataFactoryDataSetOutName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline"
},
"resources": [
{
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2021-04-01",
"name": "[parameters('storageAccountName')]",
"location": "[parameters('location')]",
"sku": {
"name": "Standard_LRS"
},
"kind": "StorageV2"
},
{
"type": "Microsoft.Storage/storageAccounts/blobServices/containers",
"apiVersion": "2021-04-01",
"name": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories",
"apiVersion": "2018-06-01",
"name": "[parameters('dataFactoryName')]",
"location": "[parameters('location')]",
"identity": {
"type": "SystemAssigned"
}
},
{
"type": "Microsoft.DataFactory/factories/linkedservices",
"type": "Microsoft.DataFactory/factories/linkedservices",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "[format('DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}',
parameters('storageAccountName'), listKeys(resourceId('Microsoft.Storage/storageAccounts',
parameters('storageAccountName')), '2021-04-01').keys[0].value)]"
}
},
"dependsOn": [
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('dataFactoryDataSetInName'))]",
"properties": {
"linkedServiceName": {
"referenceName": "[variables('dataFactoryLinkedServiceName')]",
"type": "LinkedServiceReference"
},
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"folderPath": "input",
"fileName": "emp.txt"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/blobServices/containers',
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[1],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')
[2])]",
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/linkedservices', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('dataFactoryDataSetOutName'))]",
"properties": {
"linkedServiceName": {
"referenceName": "[variables('dataFactoryLinkedServiceName')]",
"type": "LinkedServiceReference"
},
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"folderPath": "output"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/blobServices/containers',
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[1],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')
[2])]",
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/linkedservices', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('pipelineName'))]",
"properties": {
"activities": [
{
"name": "MyCopyActivity",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriterSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "[variables('dataFactoryDataSetInName')]",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "[variables('dataFactoryDataSetOutName')]",
"type": "DatasetReference"
}
]
}
]
},
"dependsOn": [
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/datasets', parameters('dataFactoryName'),
variables('dataFactoryDataSetInName'))]",
"[resourceId('Microsoft.DataFactory/factories/datasets', parameters('dataFactoryName'),
variables('dataFactoryDataSetOutName'))]"
]
}
]
}

There are Azure resources defined in the template:


Microsoft.Storage/storageAccounts: Defines a storage account.
Microsoft.DataFactory/factories: Create an Azure Data Factory.
Microsoft.DataFactory/factories/linkedServices: Create an Azure Data Factory linked service.
Microsoft.DataFactory/factories/datasets: Create an Azure Data Factory dataset.
Microsoft.DataFactory/factories/pipelines: Create an Azure Data Factory pipeline.
More Azure Data Factory template samples can be found in the quickstart template gallery.

Deploy the template


1. Select the following image to sign in to Azure and open a template. The template creates an Azure Data
Factory account, a storage account, and a blob container.

2. Select or enter the following values.

Unless it's specified, use the default values to create the Azure Data Factory resources:
Subscription : Select an Azure subscription.
Resource group : Select Create new , enter a unique name for the resource group, and then select
OK .
Region : Select a location. For example, East US .
Data Factor y Name : Use default value.
Location : Use default value.
Storage Account Name : Use default value.
Blob Container : Use default value.

Review deployed resources


1. Select Go to resource group .

2. Verify your Azure Data Factory is created.


a. Your Azure Data Factory name is in the format - datafactory<uniqueid>.

3. Verify your storage account is created.


a. The storage account name is in the format - storage<uniqueid>.

4. Select the storage account created and then select Containers .


a. On the Containers page, select the blob container you created.
a. The blob container name is in the format - blob<uniqueid>.

Upload a file
1. On the Containers page, select Upload .
2. In te right pane, select the Files box, and then browse to and select the emp.txt file that you created
earlier.
3. Expand the Advanced heading.
4. In the Upload to folder box, enter input.
5. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
6. Select the Close icon (an X ) to close the Upload blob page.
Keep the container page open, because you can use it to verify the output at the end of this quickstart.
Start Trigger
1. Navigate to the Data factories page, and select the data factory you created.
2. Select Open on the Open Azure Data Factor y Studio tile.

3. Select the Author tab .


4. Select the pipeline created - ArmtemplateSampleCopyPipeline.
5. Select Add Trigger > Trigger Now .

6. In the right pane under Pipeline run , select OK .


Monitor the pipeline

1. Select the Monitor tab .


2. You see the activity runs associated with the pipeline run. In this quickstart, the pipeline has only one
activity of type: Copy. As such, you see a run for that activity.

Verify the output file


The pipeline automatically creates an output folder in the blob container. Then, it copies the emp.txt file from the
input folder to the output folder.
1. In the Azure portal, on the Containers page, select Refresh to see the output folder.
2. Select output in the folder list.
3. Confirm that the emp.txt is copied to the output folder.

Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following command to
delete the entire resource group:
Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

If you want to delete just the data factory, and not the entire resource group, run the following command:

Remove-AzDataFactoryV2 -Name $dataFactoryName -ResourceGroupName $resourceGroupName

Next steps
In this quickstart, you created an Azure Data Factory using an ARM template and validated the deployment. To
learn more about Azure Data Factory and Azure Resource Manager, continue on to the articles below.
Azure Data Factory documentation
Learn more about Azure Resource Manager
Get other Azure Data Factory ARM templates
Create Azure Data Factory Data Flow
7/7/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Mapping Data Flows in ADF provide a way to transform data at scale without any coding required. You can
design a data transformation job in the data flow designer by constructing a series of transformations. Start with
any number of source transformations followed by data transformation steps. Then, complete your data flow
with sink to land your results in a destination.
Get started by first creating a new V2 Data Factory from the Azure portal. After creating your new factory, select
"Open" in the "Open Azure Data Factory Studio" tile to launch the Data Factory UI.

Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, select "Pipeline templates" tile in the 'Discover more' section of the homepage, and
select the Data Flow category from the template gallery.
You will be prompted to enter your Azure Blob Storage account information.

The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.

Create new data flow


Use the Create Resource "plus sign" button in the ADF UI to create Data Flows.
Next steps
Begin building your data transformation with a source transformation.
Azure Data Factory tutorials
6/23/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Below is a list of tutorials to help explain and walk through a series of Data Factory concepts and scenarios.

Copy and ingest data


Copy data tool
Copy activity in pipeline
Copy data from on-premises to the cloud
Amazon S3 to ADLS Gen2
Incremental copy pattern overview
Incremental pattern with change tracking
Incremental SQL DB single table
Incremental SQL DB multiple tables
CDC copy pipeline with SQL MI
Copy from SQL DB to Synapse SQL Pools
Copy SAP BW to ADLS Gen2
Copy Office 365 to Azure Blob Store
Bulk copy multiple tables
Copy pipeline with managed VNet

Data flows
Data flow tutorial videos
Code-free data transformation at scale
Delta lake transformations
Data wrangling with Power Query
Data flows inside managed VNet
Best practices for lake data in ADLS Gen2
Dynamically set column names

External data services


Azure Databricks notebook activity
HDI Spark transformations
Hive transformations

Pipelines
Control flow

SSIS
SSIS integration runtime

Data share
Data integration with Azure Data Share

Data lineage
Azure Purview

Next steps
Learn more about Data Factory pipelines and data flows.
Copy data from Azure Blob storage to a SQL
Database by using the Copy Data tool
7/8/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure portal to create a data factory. Then you use the Copy Data tool to create a
pipeline that copies data from Azure Blob storage to a SQL Database.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account : Use Blob storage as the source data store. If you don't have an Azure Storage
account, see the instructions in Create a storage account.
Azure SQL Database : Use a SQL Database as the sink data store. If you don't have a SQL Database, see the
instructions in Create a SQL Database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL Database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad . Copy the following text and save it in a file named inputEmp.txt on your disk:

FirstName|LastName
John|Doe
Jane|Doe

2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
the Azure portal or various tools like Azure Storage Explorer to perform these tasks.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL Database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL Server. Verify that the setting Allow Azure ser vices and resources
to access this ser ver is enabled for your server that's running SQL Database. This setting lets Data
Factory write data to your database instance. To verify and turn on this setting, go to logical SQL server >
Security > Firewalls and virtual networks > set the Allow Azure ser vices and resources to access
this ser ver option to ON .

NOTE
The option to Allow Azure ser vices and resources to access this ser ver enables network access to your
SQL Server from any Azure resource, not just those in your subscription. For more information, see Azure SQL
Server Firewall rules. Instead, you can use Private endpoints to connect to Azure PaaS services without using
public IPs.

Create a data factory


1. On the left menu, select Create a resource > Integration > Data Factor y :
2. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yourname ADFTutorialDataFactor y . For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group , take one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version , select V2 for the version.
6. Under location , select the location for the data factory. Only supported locations are displayed in the
drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Create .
8. After creation is finished, the Data Factor y home page is displayed.
9. To launch the Azure Data Factory user interface (UI) in a separate tab, select Open on the Open Azure
Data Factor y Studio tile.

Use the Copy Data tool to create a pipeline


1. On the home page of Azure Data Factory, select the Ingest tile to launch the Copy Data tool.

2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , then select
Next .
3. On the Source data store page, complete the following steps:
a. Select + Create new connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue .
c. On the New connection (Azure Blob Storage) page, select your Azure subscription from the Azure
subscription list, and select your storage account from the Storage account name list. Test connection
and then select Create .
d. Select the newly created linked service as source in the Connection block.
e. In the File or folder section, select Browse to navigate to the adfv2tutorial folder, select the
inputEmp.txt file, then select OK .
f. Select Next to move to next step.
4. On the File format settings page, enable the checkbox for First row as header. Notice that the tool
automatically detects the column and row delimiters, and you can preview data and view the schema of
the input data by selecting Preview data button on this page. Then select Next .
5. On the Destination data store page, completes the following steps:
a. Select + Create new connection to add a connection.
b. Select Azure SQL Database from the gallery, and then select Continue .
c. On the New connection (Azure SQL Database) page, select your Azure subscription, server name
and database name from the dropdown list. Then select SQL authentication under Authentication
type , specify the username and password. Test connection and select Create .

d. Select the newly created linked service as sink, then select Next .
6. On the Destination data store page, select Use existing table and select the dbo.emp table. Then
select Next .
7. On the Column mapping page, notice that the second and the third columns in the input file are
mapped to the FirstName and LastName columns of the emp table. Adjust the mapping to make sure
that there is no error, and then select Next .
8. On the Settings page, under Task name , enter CopyFromBlobToSqlPipeline , and then select Next .

9. On the Summar y page, review the settings, and then select Next .
10. On the Deployment page, select Monitor to monitor the pipeline (task).
11. On the Pipeline runs page, select Refresh to refresh the list. Select the link under Pipeline name to view
activity run details or rerun the pipeline.

12. On the "Activity runs" page, select the Details link (eyeglasses icon) under Activity name column for
more details about copy operation. To go back to the "Pipeline runs" view, select the All pipeline runs
link in the breadcrumb menu. To refresh the view, select Refresh .
13. Verify that the data is inserted into the dbo.emp table in your SQL Database.
14. Select the Author tab on the left to switch to the editor mode. You can update the linked services,
datasets, and pipelines that were created via the tool by using the editor. For details on editing these
entities in the Data Factory UI, see the Azure portal version of this tutorial.

Next steps
The pipeline in this sample copies data from Blob storage to a SQL Database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a database
in Azure SQL Database by using Azure Data
Factory
7/7/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data from Azure Blob storage to a database in Azure SQL Database. The configuration
pattern in this tutorial applies to copying from a file-based data store to a relational data store. For a list of data
stores supported as sources and sinks, see the supported data stores table.

NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Blob storage as a source data store. If you don't have a storage account,
see Create an Azure storage account for steps to create one.
Azure SQL Database . You use the database as a sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:

FirstName,LastName
John,Doe
Jane,Doe

2. Create a container named adftutorial in your Blob storage. Create a folder named input in this
container. Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure
Storage Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your database:

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure ser vices is turned ON
for your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this
setting, go to logical SQL server > Overview > Set server firewall> set the Allow access to Azure
ser vices option to ON .

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome . Currently, Data Factory UI is supported only in Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y .
3. On the Create Data Factor y page, under Basics tab, select the Azure Subscription in which you want
to create the data factory.
4. For Resource Group , take one of the following steps:
a. Select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a new resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Region , select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
6. Under Name , enter ADFTutorialDataFactor y .
The name of the Azure data factory must be globally unique. If you receive an error message about the
name value, enter a different name for the data factory. (for example, yournameADFTutorialDataFactory).
For naming rules for Data Factory artifacts, see Data Factory naming rules.
7. Under Version , select V2 .
8. Select Git configuration tab on the top, and select the Configure Git later check box.
9. Select Review + create , and select Create after the validation is passed.
10. After the creation is finished, you see the notice in Notifications center. Select Go to resource to
navigate to the Data factory page.
11. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory UI in a
separate tab.

Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the home page, select Orchestrate .
2. In the General panel under Proper ties , specify CopyPipeline for Name . Then collapse the panel by
clicking the Properties icon in the top-right corner.
3. In the Activities tool box, expand the Move and Transform category, and drag and drop the Copy
Data activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for
Name .

Configure source

TIP
In this tutorial, you use Account key as the authentication type for your source data store, but you can choose other
supported authentication methods: SAS URI,Service Principal and Managed Identity if needed. Refer to corresponding
sections in this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key
Vault. Refer to this article for detailed illustrations.

1. Go to the Source tab. Select + New to create a source dataset.


2. In the New Dataset dialog box, select Azure Blob Storage , and then select Continue . The source data
is in Blob storage, so you select Azure Blob Storage for the source dataset.
3. In the Select Format dialog box, choose the format type of your data, and then select Continue .
4. In the Set Proper ties dialog box, enter SourceBlobDataset for Name. Select the checkbox for First
row as header . Under the Linked ser vice text box, select + New .
5. In the New Linked Ser vice (Azure Blob Storage) dialog box, enter AzureStorageLinkedSer vice as
name, select your storage account from the Storage account name list. Test connection, select Create
to deploy the linked service.
6. After the linked service is created, it's navigated back to the Set proper ties page. Next to File path ,
select Browse .
7. Navigate to the adftutorial/input folder, select the emp.txt file, and then select OK .
8. Select OK . It automatically navigates to the pipeline page. In Source tab, confirm that
SourceBlobDataset is selected. To preview data on this page, select Preview data .
Configure sink

TIP
In this tutorial, you use SQL authentication as the authentication type for your sink data store, but you can choose other
supported authentication methods: Service Principal and Managed Identity if needed. Refer to corresponding sections in
this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key Vault. Refer to
this article for detailed illustrations.

1. Go to the Sink tab, and select + New to create a sink dataset.


2. In the New Dataset dialog box, input "SQL" in the search box to filter the connectors, select Azure SQL
Database , and then select Continue . In this tutorial, you copy data to a SQL database.
3. In the Set Proper ties dialog box, enter OutputSqlDataset for Name. From the Linked ser vice
dropdown list, select + New . A dataset must be associated with a linked service. The linked service has
the connection string that Data Factory uses to connect to SQL Database at runtime. The dataset specifies
the container, folder, and the file (optional) to which the data is copied.
4. In the New Linked Ser vice (Azure SQL Database) dialog box, take the following steps:
a. Under Name , enter AzureSqlDatabaseLinkedSer vice .
b. Under Ser ver name , select your SQL Server instance.
c. Under Database name , select your database.
d. Under User name , enter the name of the user.
e. Under Password , enter the password for the user.
f. Select Test connection to test the connection.
g. Select Create to deploy the linked service.
5. It automatically navigates to the Set Proper ties dialog box. In Table , select [dbo].[emp] . Then select
OK .
6. Go to the tab with the pipeline, and in Sink Dataset , confirm that OutputSqlDataset is selected.
You can optionally map the schema of the source to corresponding schema of destination by following Schema
mapping in copy activity.

Validate the pipeline


To validate the pipeline, select Validate from the tool bar.
You can see the JSON code associated with the pipeline by clicking Code on the upper right.

Debug and publish the pipeline


You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory or
your own Azure Repos Git repository.
1. To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output
tab at the bottom of the window.
2. Once the pipeline can run successfully, in the top toolbar, select Publish all . This action publishes entities
(datasets, and pipelines) you created to Data Factory.
3. Wait until you see the Successfully published message. To see notification messages, click the Show
Notifications on the top-right (bell button).

Trigger the pipeline manually


In this step, you manually trigger the pipeline you published in the previous step.
1. Select Trigger on the toolbar, and then select Trigger Now . On the Pipeline Run page, select OK .
2. Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger. You can
use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
3. To see activity runs associated with the pipeline run, select the CopyPipeline link under the PIPELINE
NAME column. In this example, there's only one activity, so you see only one entry in the list. For details
about the copy operation, select the Details link (eyeglasses icon) under the ACTIVITY NAME column.
Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view, select
Refresh .

4. Verify that two more rows are added to the emp table in the database.

Trigger the pipeline on a schedule


In this schedule, you create a schedule trigger for the pipeline. The trigger runs the pipeline on the specified
schedule, such as hourly or daily. Here you set the trigger to run every minute until the specified end datetime.
1. Go to the Author tab on the left above the monitor tab.
2. Go to your pipeline, click Trigger on the tool bar, and select New/Edit .
3. In the Add triggers dialog box, select + New for Choose trigger area.
4. In the New Trigger window, take the following steps:
a. Under Name , enter RunEver yMinute .
b. Update the Star t date for your trigger. If the date is before current datetime, the trigger will start to
take effect once the change is published.
c. Under Time zone , select the drop-down list.
d. Set the Recurrence to Ever y 1 Minute(s) .
e. Select the checkbox for Specify an end date , and update the End On part to be a few minutes past
the current datetime. The trigger is activated only after you publish the changes. If you set it to only a
couple of minutes apart, and you don't publish it by then, you don't see a trigger run.
f. For Activated option, select Yes .
g. Select OK .
IMPORTANT
A cost is associated with each pipeline run, so set the end date appropriately.

5. On the Edit trigger page, review the warning, and then select Save . The pipeline in this example doesn't
take any parameters.
6. Click Publish all to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.

8. To switch from the Pipeline Runs view to the Trigger Runs view, select Trigger Runs on the left side
of the window.
9. You see the trigger runs in a list.
10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.

Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how
to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create a Data Factory pipeline that copies data from Azure Blob Storage to Azure SQL
Database. The configuration pattern in this tutorial applies to copying from a file-based data store to a relational
data store. For a list of data stores supported as sources and sinks, see supported data stores and formats.
You take the following steps in this tutorial:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline contains a Copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory; refer to samples
under Quickstar ts .
If you don't have an Azure subscription, create a free Azure account before you begin.

Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see Create a general-purpose storage account.
Azure SQL Database. You use the database as sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database.
Visual Studio. The walkthrough in this article uses Visual Studio 2019.
Azure SDK for .NET.
Azure Active Directory application. If you don't have an Azure Active Directory application, see the Create an
Azure Active Directory application section of How to: Use the portal to create an Azure AD application. Copy
the following values for use in later steps: Application (client) ID , authentication key , and Director y
(tenant) ID . Assign the application to the Contributor role by following the instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by creating a source blog and a sink SQL
table.
Create a source blob
First, create a source blob by creating a container and uploading an input text file to it:
1. Open Notepad. Copy the following text and save it locally to a file named inputEmp.txt.

John|Doe
Jane|Doe

2. Use a tool such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
Next, create a sink SQL table:
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

2. Allow Azure services to access SQL Database. Ensure that you allow access to Azure services in your
server so that the Data Factory service can write data to SQL Database. To verify and turn on this setting,
do the following steps:
a. Go to the Azure portal to manage your SQL server. Search for and select SQL ser vers .
b. Select your server.
c. Under the SQL server menu's Security heading, select Firewalls and vir tual networks .
d. In the Firewall and vir tual networks page, under Allow Azure ser vices and resources to
access this ser ver , select ON .

Create a Visual Studio project


Using Visual Studio, create a C# .NET console application.
1. Open Visual Studio.
2. In the Star t window, select Create a new project .
3. In the Create a new project window, choose the C# version of Console App (.NET Framework) from the
list of project types. Then select Next .
4. In the Configure your new project window, enter a Project name of ADFv2Tutorial. For Location ,
browse to and/or create the directory to save the project in. Then select Create . The new project appears in
the Visual Studio IDE.

Install NuGet packages


Next, install the required library packages using the NuGet package manager.
1. In the menu bar, choose Tools > NuGet Package Manager > Package Manager Console .
2. In the Package Manager Console pane, run the following commands to install packages. For
information about the Azure Data Factory NuGet package, see Microsoft.Azure.Management.DataFactory.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -PreRelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


Follow these steps to create a data factory client.
1. Open Program.cs, then overwrite the existing using statements with the following code to add
references to namespaces.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Rest.Serialization;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add the following code to the Main method that sets variables. Replace the 14 placeholders with your
own values.
To see the list of Azure regions in which Data Factory is currently available, see Products available by
region. Under the Products drop-down list, choose Browse > Analytics > Data Factor y . Then in the
Regions drop-down list, choose the regions that interest you. A grid appears with the availability status
of Data Factory products for your selected regions.

NOTE
Data stores, such as Azure Storage and Azure SQL Database, and computes, such as HDInsight, that Data Factory
uses can be in other regions than what you choose for Data Factory.

// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";

string region = "<location to create the data factory in, such as East US>";
string dataFactoryName = "<name of data factory to create (must be globally unique)>";

// Specify the source Azure Blob information


string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
string inputBlobPath = "adfv2tutorial/";
string inputBlobName = "inputEmp.txt";

// Specify the sink Azure SQL Database information


string azureSqlConnString =
"Server=tcp:<your server name>.database.windows.net,1433;" +
"Database=<your database name>;" +
"User ID=<your username>@<your server name>;" +
"Password=<your password>;" +
"Trusted_Connection=False;Encrypt=True;Connection Timeout=30";
string azureSqlTableName = "dbo.emp";

string storageLinkedServiceName = "AzureStorageLinkedService";


string sqlDbLinkedServiceName = "AzureSqlDbLinkedService";
string blobDatasetName = "BlobDataset";
string sqlDatasetName = "SqlDataset";
string pipelineName = "Adfv2TutorialBlobToSqlCopy";

3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync(
"https://management.azure.com/", cc
).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


Add the following code to the Main method that creates a data factory.

// Create a data factory


Console.WriteLine("Creating a data factory " + dataFactoryName + "...");
Factory dataFactory = new Factory
{
Location = region,
Identity = new FactoryIdentity()
};

client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);


Console.WriteLine(
SafeJsonConvert.SerializeObject(dataFactory, client.SerializationSettings)
);

while (
client.Factories.Get(
resourceGroup, dataFactoryName
).ProvisioningState == "PendingCreation"
)
{
System.Threading.Thread.Sleep(1000);
}

Create linked services


In this tutorial, you create two linked services for the source and sink, respectively.
Create an Azure Storage linked service
Add the following code to the Main method that creates an Azure Storage linked service. For information about
supported properties and details, see Azure Blob linked service properties.
// Create an Azure Storage linked service
Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(


new AzureStorageLinkedService
{
ConnectionString = new SecureString(
"DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey
)
}
);

client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings)
);

Create an Azure SQL Database linked service


Add the following code to the Main method that creates an Azure SQL Database linked service. For information
about supported properties and details, see Azure SQL Database linked service properties.

// Create an Azure SQL Database linked service


Console.WriteLine("Creating linked service " + sqlDbLinkedServiceName + "...");

LinkedServiceResource sqlDbLinkedService = new LinkedServiceResource(


new AzureSqlDatabaseLinkedService
{
ConnectionString = new SecureString(azureSqlConnString)
}
);

client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, sqlDbLinkedServiceName, sqlDbLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(sqlDbLinkedService, client.SerializationSettings)
);

Create datasets
In this section, you create two datasets: one for the source, the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
supported properties and details, see Azure Blob dataset properties.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName
The blob format indicating how to parse the content: TextFormat and its settings, such as column delimiter
The data structure, including column names and data types, which map in this example to the sink SQL table
// Create an Azure Blob dataset
Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference {
ReferenceName = storageLinkedServiceName
},
FolderPath = inputBlobPath,
FileName = inputBlobName,
Format = new TextFormat { ColumnDelimiter = "|" },
Structure = new List<DatasetDataElement>
{
new DatasetDataElement { Name = "FirstName", Type = "String" },
new DatasetDataElement { Name = "LastName", Type = "String" }
}
}
);

client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, blobDatasetName, blobDataset
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings)
);

Create a dataset for sink Azure SQL Database


Add the following code to the Main method that creates an Azure SQL Database dataset. For information about
supported properties and details, see Azure SQL Database dataset properties.
You define a dataset that represents the sink data in Azure SQL Database. This dataset refers to the Azure SQL
Database linked service you created in the previous step. It also specifies the SQL table that holds the copied
data.

// Create an Azure SQL Database dataset


Console.WriteLine("Creating dataset " + sqlDatasetName + "...");
DatasetResource sqlDataset = new DatasetResource(
new AzureSqlTableDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = sqlDbLinkedServiceName
},
TableName = azureSqlTableName
}
);

client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, sqlDatasetName, sqlDataset
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(sqlDataset, client.SerializationSettings)
);

Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: CopyActivity , which takes in the Blob dataset as source and the SQL dataset as
sink. For information about copy activity details, see Copy activity in Azure Data Factory.
// Create a pipeline with copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToSQL",
Inputs = new List<DatasetReference>
{
new DatasetReference() { ReferenceName = blobDatasetName }
},
Outputs = new List<DatasetReference>
{
new DatasetReference { ReferenceName = sqlDatasetName }
},
Source = new BlobSource { },
Sink = new SqlSink { }
}
}
};

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);


Console.WriteLine(
SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings)
);

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(
resourceGroup, dataFactoryName, pipelineName
).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Monitor a pipeline run


Now insert the code to check pipeline run states and to get details about the copy activity run.
1. Add the following code to the Main method to continuously check the statuses of the pipeline run until it
finishes copying the data.

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(
resourceGroup, dataFactoryName, runResponse.RunId
);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}
2. Add the following code to the Main method that retrieves copy activity run details, such as the size of the
data that was read or written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

RunFilterParameters filterParams = new RunFilterParameters(


DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10)
);

ActivityRunsQueryResponse queryResponse = client.ActivityRuns.QueryByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, filterParams
);

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(queryResponse.Value.First().Output);
}
else
Console.WriteLine(queryResponse.Value.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build the application by choosing Build > Build Solution . Then start the application by choosing Debug >
Star t Debugging , and verify the pipeline execution.
The console prints the progress of creating a data factory, linked service, datasets, pipeline, and pipeline run. It
then checks the pipeline run status. Wait until you see the copy activity run details with the data read/written
size. Then, using tools such as SQL Server Management Studio (SSMS) or Visual Studio, you can connect to your
destination Azure SQL Database and check whether the destination table you specified contains the copied data.
Sample output

Creating a data factory AdfV2Tutorial...


{
"identity": {
"type": "SystemAssigned"
},
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=<accountKey>"
}
}
}
}
Creating linked service AzureSqlDbLinkedService...
{
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
}
Creating dataset BlobDataset...
{
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adfv2tutorial/",
"fileName": "inputEmp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|"
}
},
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
}
Creating dataset SqlDataset...
{
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "dbo.emp"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureSqlDbLinkedService"
}
}
}
Creating pipeline Adfv2TutorialBlobToSqlCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "BlobDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SqlDataset"
"referenceName": "SqlDataset"
}
],
"name": "CopyFromBlobToSQL"
}
]
}
}
Creating pipeline run...
Pipeline run ID: 1cd03653-88a0-4c90-aabc-ae12d843e252
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 18,
"dataWritten": 28,
"rowsCopied": 2,
"copyDuration": 2,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"usedDataIntegrationUnits": 2,
"billedDuration": 2
}

Press any key to exit...

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline containing a copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from a SQL Server database to Azure
Blob storage by using the Copy Data tool
7/13/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that copies data from a SQL Server database to Azure Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the
upper-right corner, and then select Permissions . If you have access to multiple subscriptions, select the
appropriate subscription. For sample instructions on how to add a user to a role, see Assign Azure roles using
the Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Blob storage (sink). You then create a table
named emp in your SQL Server database and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Azure storage account


In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose storage account, see Create a storage account for instructions to
create one. The pipeline in the data factory you that create in this tutorial copies data from the SQL Server
database (source) to this Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select All ser vices . Filter by using the Storage keyword, and then select Storage
accounts .

3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.

Create a data factory


1. On the menu on the left, select Create a resource > Integration > Data Factor y .
2. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name of the data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which you want to create the data factory.
4. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version , select V2 .
6. Under Location , select the location for the data factory. Only locations that are supported are displayed
in the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other locations/regions.
7. Select Create .
8. After the creation is finished, you see the Data Factor y page as shown in the image.
9. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Factory user interface in a
separate tab.

Use the Copy Data tool to create a pipeline


1. On the Azure Data Factory home page, select Ingest to launch the Copy Data tool.

2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , and choose
Run once now under Task cadence or task schedule , then select Next .
3. On the Source data store page, select on + Create new connection .
4. Under New connection , search for SQL Ser ver , and then select Continue .
5. In the New connection (SQL ser ver) dialog box, under Name , enter SqlSer verLinkedSer vice .
Select +New under Connect via integration runtime . You must create a self-hosted integration
runtime, download it to your machine, and register it with Data Factory. The self-hosted integration
runtime copies data between your on-premises environment and the cloud.
6. In the Integration runtime setup dialog box, select Self-Hosted . Then select Continue .
7. In the Integration runtime setup dialog box, under Name , enter TutorialIntegrationRuntime . Then
select Create .
8. In the Integration runtime setup dialog box, select Click here to launch the express setup for
this computer . This action installs the integration runtime on your machine and registers it with Data
Factory. Alternatively, you can use the manual setup option to download the installation file, run it, and
use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. In the New Connection (SQL Ser ver) dialog box, confirm that TutorialIntegrationRuntime is
selected under Connect via integration runtime . Then, take the following steps:
a. Under Name , enter SqlSer verLinkedSer vice .
b. Under Ser ver name , enter the name of your SQL Server instance.
c. Under Database name , enter the name of your on-premises database.
d. Under Authentication type , select appropriate authentication.
e. Under User name , enter the name of user with access to SQL Server.
f. Enter the Password for the user.
g. Test connection and select Create .
11. On the Source data store page, ensure that the newly created SQL Ser ver connection is selected in
the Connection block. Then in the Source tables section, choose EXISTING TABLES and select the
dbo.emp table in the list, and select Next . You can select any other table based on your database.
12. On the Apply filter page, you can preview data and view the schema of the input data by selecting the
Preview data button. Then select Next .
13. On the Destination data store page, select + Create new connection
14. In New connection , search and select Azure Blob Storage , and then select Continue .
15. On the New connection (Azure Blob Storage) dialog, take the following steps:
a. Under Name , enter AzureStorageLinkedSer vice .
b. Under Connect via integration runtime , select TutorialIntegrationRuntime , and select Account
key under Authentication method .
c. Under Azure subscription , select your Azure subscription from the drop-down list.
d. Under Storage account name , select your storage account from the drop-down list.
e. Test connection and select Create .
16. In the Destination data store dialog, make sure that the newly created Azure Blob Storage
connection is selected in the Connection block. Then under Folder path , enter
adftutorial/fromonprem . You created the adftutorial container as part of the prerequisites. If the
output folder doesn't exist (in this case fromonprem ), Data Factory automatically creates it. You can also
use the Browse button to browse the blob storage and its containers/folders. If you do not specify any
value under File name , by default the name from the source would be used (in this case dbo.emp ).
17. On the File format settings dialog, select Next .
18. On the Settings dialog, under Task name , enter CopyFromOnPremSqlToAzureBlobPipeline , and
then select Next . The Copy Data tool creates a pipeline with the name you specify for this field.
19. On the Summar y dialog, review values for all the settings, and select Next .
20. On the Deployment page, select Monitor to monitor the pipeline (task).
21. When the pipeline run completes, you can view the status of the pipeline you created.
22. On the "Pipeline runs" page, select Refresh to refresh the list. Select the link under Pipeline name to
view activity run details or rerun the pipeline.

23. On the "Activity runs" page, select the Details link (eyeglasses icon) under the Activity name column
for more details about copy operation. To go back to the "Pipeline runs" page, select the All pipeline
runs link in the breadcrumb menu. To refresh the view, select Refresh .
24. Confirm that you see the output file in the fromonprem folder of the adftutorial container.
25. Select the Author tab on the left to switch to the editor mode. You can update the linked services,
datasets, and pipelines created by the tool by using the editor. Select Code to view the JSON code
associated with the entity opened in the editor. For details on how to edit these entities in the Data
Factory UI, see the Azure portal version of this tutorial.

Next steps
The pipeline in this sample copies data from a SQL Server database to Blob storage. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from a SQL Server database to Azure
Blob storage
7/7/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure Data Factory user interface (UI) to create a data factory pipeline that copies
data from a SQL Server database to Azure Blob storage. You create and use a self-hosted integration runtime,
which moves data between on-premises and cloud data stores.

NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data
Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select
your user name, and then select Permissions . If you have access to multiple subscriptions, select the
appropriate subscription. For sample instructions on how to add a user to a role, see Assign Azure roles using
the Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Blob storage (sink). You then create a table
named emp in your SQL Server database and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Azure storage account


In this tutorial, you use a general-purpose Azure storage account (specifically, Blob storage) as a destination/sink
data store. If you don't have a general-purpose Azure storage account, see Create a storage account. The
pipeline in the data factory that you create in this tutorial copies data from the SQL Server database (source) to
Blob storage (sink).
Get the storage account name and account key
You use the name and key of your storage account in this tutorial. To get the name and key of your storage
account, take the following steps:
1. Sign in to the Azure portal with your Azure user name and password.
2. In the left pane, select All ser vices . Filter by using the Storage keyword, and then select Storage
accounts .

3. In the list of storage accounts, filter for your storage account if needed. Then select your storage account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Over view , and then select Containers .

2. In the Containers window, select + Container to create a new one.


3. In the New container window, under Name , enter adftutorial . Then select Create .
4. In the list of containers, select adftutorial you just created.
5. Keep the container window for adftutorial open. You use it to verify the output at the end of the
tutorial. Data Factory automatically creates the output folder in this container, so you don't need to create
one.

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open the Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported
only in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y :
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name of the data factory must be globally unique. If you see the following error message for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select the location for the data factory. Only locations that are supported are displayed
in the drop-down list. The data stores (for example, Storage and SQL Database) and computes (for
example, Azure HDInsight) used by Data Factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the Data Factor y page as shown in the image:
10. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Factory UI in a separate
tab.

Create a pipeline
1. On the Azure Data Factory home page, select Orchestrate . A pipeline is automatically created for you.
You see the pipeline in the tree view, and its editor opens.

2. In the General panel under Proper ties , specify SQLSer verToBlobPipeline for Name . Then collapse
the panel by clicking the Properties icon in the top-right corner.
3. In the Activities tool box, expand Move & Transform . Drag and drop the Copy activity to the pipeline
design surface. Set the name of the activity to CopySqlSer verToAzureBlobActivity .
4. In the Proper ties window, go to the Source tab, and select + New .
5. In the New Dataset dialog box, search for SQL Ser ver . Select SQL Ser ver , and then select Continue .
6. In the Set Proper ties dialog box, under Name , enter SqlSer verDataset . Under Linked ser vice , select
+ New . You create a connection to the source data store (SQL Server database) in this step.
7. In the New Linked Ser vice dialog box, add Name as SqlSer verLinkedSer vice . Under Connect via
integration runtime , select +New . In this section, you create a self-hosted integration runtime and
associate it with an on-premises machine with the SQL Server database. The self-hosted integration
runtime is the component that copies data from the SQL Server database on your machine to Blob
storage.
8. In the Integration Runtime Setup dialog box, select Self-Hosted , and then select Continue .
9. Under name, enter TutorialIntegrationRuntime . Then select Create .
10. For Settings, select Click here to launch the express setup for this computer . This action installs
the integration runtime on your machine and registers it with Data Factory. Alternatively, you can use the
manual setup option to download the installation file, run it, and use the key to register the integration
runtime.
11. In the Integration Runtime (Self-hosted) Express Setup window, select Close when the process is
finished.
12. In the New linked ser vice (SQL Ser ver) dialog box, confirm that TutorialIntegrationRuntime is
selected under Connect via integration runtime . Then, take the following steps:
a. Under Name , enter SqlSer verLinkedSer vice .
b. Under Ser ver name , enter the name of your SQL Server instance.
c. Under Database name , enter the name of the database with the emp table.
d. Under Authentication type , select the appropriate authentication type that Data Factory should use
to connect to your SQL Server database.
e. Under User name and Password , enter the user name and password. Use mydomain\myuser as user
name if needed.
f. Select Test connection . This step is to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
g. To save the linked service, select Create .
13. After the linked service is created, you're back to the Set proper ties page for the SqlServerDataset. Take
the following steps:
a. In Linked ser vice , confirm that you see SqlSer verLinkedSer vice .
b. Under Table name , select [dbo].[emp] .
c. Select OK .
14. Go to the tab with SQLSer verToBlobPipeline , or select SQLSer verToBlobPipeline in the tree view.
15. Go to the Sink tab at the bottom of the Proper ties window, and select + New .
16. In the New Dataset dialog box, select Azure Blob Storage . Then select Continue .
17. In Select Format dialog box, choose the format type of your data. Then select Continue .
18. In the Set Proper ties dialog box, enter AzureBlobDataset for Name. Next to the Linked ser vice text
box, select + New .
19. In the New Linked Ser vice (Azure Blob Storage) dialog box, enter AzureStorageLinkedSer vice as
name, select your storage account from the Storage account name list. Test connection, and then select
Create to deploy the linked service.
20. After the linked service is created, you're back to the Set proper ties page. Select OK .
21. Open the sink dataset. On the Connection tab, take the following steps:
a. In Linked ser vice , confirm that AzureStorageLinkedSer vice is selected.
b. In File path , enter adftutorial/fromonprem for the Container/ Director y part. If the output folder
doesn't exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the File part, select Add dynamic content .
d. Add @CONCAT(pipeline().RunId, '.txt') , and then select Finish . This action will rename the file with
PipelineRunID.txt.
22. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset , confirm
that AzureBlobDataset is selected.
23. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
validation output , select the >> icon.

24. To publish entities you created to Data Factory, select Publish all .
25. Wait until you see the Publishing completed pop-up. To check the status of publishing, select the Show
notifications link on the top of the window. To close the notification window, select Close .

Trigger a pipeline run


Select Add Trigger on the toolbar for the pipeline, and then select Trigger Now .

Monitor the pipeline run


1. Go to the Monitor tab. You see the pipeline that you manually triggered in the previous step.
2. To view activity runs associated with the pipeline run, select the SQLSer verToBlobPipeline link under
PIPELINE NAME.

3. On the Activity runs page, select the Details (eyeglasses image) link to see details about the copy
operation. To go back to the Pipeline Runs view, select All pipeline runs at the top.

Verify the output


The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the [pipeline().RunId].txt file in the output folder.

Next steps
The pipeline in this sample copies data from one location to another in Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Storage linked services.
Create SQL Server and Blob storage datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Tutorial: Copy data from a SQL Server database to
Azure Blob storage
3/5/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use Azure PowerShell to create a data-factory pipeline that copies data from a SQL Server
database to Azure Blob storage. You create and use a self-hosted integration runtime, which moves data
between on-premises and cloud data stores.

NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction
to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.

Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-
right corner, and then select Permissions . If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Assign Azure roles using the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Azure Blob storage (sink). You then create a
table named emp in your SQL Server database, and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')


INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Azure Storage account


In this tutorial, you use a general-purpose Azure storage account (specifically, Azure Blob storage) as a
destination/sink data store. If you don't have a general-purpose Azure storage account, see Create a storage
account. The pipeline in the data factory you that create in this tutorial copies data from the SQL Server
database (source) to this Azure Blob storage (sink).
Get storage account name and account key
You use the name and key of your Azure storage account in this tutorial. Get the name and key of your storage
account by doing the following:
1. Sign in to the Azure portal with your Azure username and password.
2. In the left pane, select More ser vices , filter by using the Storage keyword, and then select Storage
accounts .

3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Over view , and then select Blobs .

2. In the Blob ser vice window, select Container .


3. In the New container window, in the Name box, enter adftutorial , and then select OK .

4. In the list of containers, select adftutorial .


5. Keep the container window for adftutorial open. You use it verify the output at the end of the tutorial.
Data Factory automatically creates the output folder in this container, so you don't need to create one.
Windows PowerShell
Install Azure PowerShell

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:

Connect-AzAccount

3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

Create a data factory


1. Define a variable for the resource group name that you'll use later in PowerShell commands. Copy the
following command to PowerShell, specify a name for the Azure resource group (enclosed in double
quotation marks; for example, "adfrg" ), and then run the command.

$resourceGroupName = "ADFTutorialResourceGroup"

2. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName -location 'East US'

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

3. Define a variable for the data factory name that you can use in PowerShell commands later. The name
must start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.

IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFTutorialFactory"

4. Define a variable for the location of the data factory:

$location = "East US"

5. To create the data factory, run the following Set-AzDataFactoryV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName


NOTE
The name of the data factory must be globally unique. If you receive the following error, change the name and try
again.

The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.

To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factor y : Products available by region. The data stores
(Azure Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory
can be in other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with
the SQL Server database. The self-hosted integration runtime is the component that copies data from the SQL
Server database on your machine to Azure Blob storage.
1. Create a variable for the name of integration runtime. Use a unique name, and note the name. You use it
later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $integrationRuntimeName -Type SelfHosted -Description "selfhosted IR
description"

Here is the sample output:

Name : ADFTutorialIR
Type : SelfHosted
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Description : selfhosted IR description
Id : /subscriptions/<subscription
ID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/<dataFactoryName>/in
tegrationruntimes/<integrationRuntimeName>

3. To retrieve the status of the created integration runtime, run the following command:

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:


State : NeedRegistration
Version :
CreateTime : 9/10/2019 3:24:09 AM
AutoUpdate : On
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
InternalChannelEncryption :
Capabilities : {}
ServiceUrls : {eu.frontend.clouddatahub.net}
Nodes : {}
Links : {}
Name : <Integration Runtime name>
Type : SelfHosted
ResourceGroupName : <resourceGroup name>
DataFactoryName : <dataFactory name>
Description : selfhosted IR description
Id : /subscriptions/<subscription
ID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/<dataFactoryName>/in
tegrationruntimes/<integrationRuntimeName>

4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data
Factory service in the cloud, run the following command. Copy one of the keys (excluding the quotation
marks) for registering the self-hosted integration runtime that you install on your machine in the next
step.

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}

Install the integration runtime


1. Download Azure Data Factory Integration Runtime on a local Windows machine, and then run the
installation.
2. In the Welcome to Microsoft Integration Runtime Setup wizard, select Next .
3. In the End-User License Agreement window, accept the terms and license agreement, and select Next .
4. In the Destination Folder window, select Next .
5. In the Ready to install Microsoft Integration Runtime window, select Install .
6. In the Completed the Microsoft Integration Runtime Setup wizard, select Finish .
7. In the Register Integration Runtime (Self-hosted) window, paste the key you saved in the previous
section, and then select Register .
8. In the New Integration Runtime (Self-hosted) Node window, select Finish .

9. When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the Register Integration Runtime (Self-hosted) window, select Launch Configuration
Manager .
11. When the node is connected to the cloud service, the following message is displayed:

12. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlSer ver .
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test .

If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect
to your SQL Server instance.
Note all the preceding values for later use in this tutorial.

Create linked services


To link your data stores and compute services to the data factory, create linked services in the data factory. In
this tutorial, you link your Azure storage account and SQL Server instance to the data store. The linked services
have the connection information that the Data Factory service uses at runtime to connect to them.
Create an Azure Storage linked service (destination/sink)
In this step, you link your Azure storage account to the data factory.
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADFv2Tutorial folder with the
following code. If the ADFv2Tutorial folder does not already exist, create it.

IMPORTANT
Before you save the file, replace <accountName> and <accountKey> with the name and key of your Azure
storage account. You noted them in the Prerequisites section.
{
"name": "AzureStorageLinkedService",
"properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net"
}
}
}

2. In PowerShell, switch to the C:\ADFv2Tutorial folder.

Set-Location 'C:\ADFv2Tutorial'

3. To create the linked service, AzureStorageLinkedService, run the following


Set-AzDataFactoryV2LinkedService cmdlet:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is a sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroup name>
DataFactoryName : <dataFactory name>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobStorageLinkedService

If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:

IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.

Using SQL authentication (sa):


{
"name":"SqlServerLinkedService",
"type":"Microsoft.DataFactory/factories/linkedservices",
"properties":{
"annotations":[

],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=False;data source=<serverName>;initial catalog=
<databaseName>;user id=<userName>;password=<password>"
},
"connectVia":{
"referenceName":"<integration runtime name> ",
"type":"IntegrationRuntimeReference"
}
}
}

Using Windows authentication:

{
"name":"SqlServerLinkedService",
"type":"Microsoft.DataFactory/factories/linkedservices",
"properties":{
"annotations":[

],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=True;data source=<serverName>;initial catalog=
<databaseName>",
"userName":"<username> or <domain>\\<username>",
"password":{
"type":"SecureString",
"value":"<password>"
}
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}

IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <ser vername> , <databasename> , <username> , and <password>
with the values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.

2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application
Programming Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted
integration runtime node (local machine). The output payload can be redirected to another JSON file (in
this case, encryptedLinkedService.json) that contains encrypted credentials.

New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -


ResourceGroupName $ResourceGroupName -IntegrationRuntimeName $integrationRuntimeName -File
".\SQLServerLinkedService.json" > encryptedSQLServerLinkedService.json

3. Run the following command, which creates EncryptedSqlServerLinkedService:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$ResourceGroupName -Name "EncryptedSqlServerLinkedService" -File
".\encryptedSqlServerLinkedService.json"

Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server
instance at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial,
the emp table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"name":"SqlServerDataset",
"properties":{
"linkedServiceName":{
"referenceName":"EncryptedSqlServerLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"SqlServerTable",
"schema":[

],
"typeProperties":{
"schema":"dbo",
"table":"emp"
}
}
}

2. To create the dataset SqlServerDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SqlServerDataset" -File ".\SqlServerDataset.json"

Here is the sample output:


DatasetName : SqlServerDataset
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a dataset for Azure Blob storage (sink)


In this step, you define a dataset that represents data that will be copied to Azure Blob storage. The dataset is of
the type AzureBlob. It refers to the Azure Storage linked service that you created earlier in this tutorial.
The linked service has the connection information that the data factory uses at runtime to connect to your Azure
storage account. This dataset specifies the folder in the Azure storage to which the data is copied from the SQL
Server database. In this tutorial, the folder is adftutorial/fromonprem, where adftutorial is the blob container
and fromonprem is the folder.
1. Create a JSON file named AzureBlobDataset.json in the C:\ADFv2Tutorial folder, with the following code:

{
"name":"AzureBlobDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"DelimitedText",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":"fromonprem",
"container":"adftutorial"
},
"columnDelimiter":",",
"escapeChar":"\\",
"quoteChar":"\""
},
"schema":[

]
},
"type":"Microsoft.DataFactory/factories/datasets"
}

2. To create the dataset AzureBlobDataset, run the Set-AzDataFactoryV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "AzureBlobDataset" -File ".\AzureBlobDataset.json"

Here is the sample output:

DatasetName : AzureBlobDataset
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.DelimitedTextDataset
Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set
to BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:

{
"name":"SqlServerToBlobPipeline",
"properties":{
"activities":[
{
"name":"CopySqlServerToAzureBlobActivity",
"type":"Copy",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"source":{
"type":"SqlServerSource"
},
"sink":{
"type":"DelimitedTextSink",
"storeSettings":{
"type":"AzureBlobStorageWriteSettings"
},
"formatSettings":{
"type":"DelimitedTextWriteSettings",
"quoteAllText":true,
"fileExtension":".txt"
}
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"SqlServerDataset",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"AzureBlobDataset",
"type":"DatasetReference"
}
]
}
],
"annotations":[

]
}
}
2. To create the pipeline SQLServerToBlobPipeline, run the Set-AzDataFactoryV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SQLServerToBlobPipeline" -File ".\SQLServerToBlobPipeline.json"

Here is the sample output:

PipelineName : SQLServerToBlobPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :

Create a pipeline run


Start a pipeline run for the SQLServerToBlobPipeline pipeline, and capture the pipeline run ID for future
monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'SQLServerToBlobPipeline'

Monitor the pipeline run


1. To continuously check the run status of pipeline SQLServerToBlobPipeline, run the following script in
PowerShell, and print the final result:

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)

if (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {


Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
Start-Sleep -Seconds 30
}
else {
Write-Host "Pipeline 'SQLServerToBlobPipeline' run finished. Result:" -foregroundcolor
"Yellow"
$result
break
}
}

Here is the output of the sample run:


ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityRunId : 24af7cf6-efca-4a95-931d-067c5c921c25
ActivityName : CopySqlServerToAzureBlobActivity
ActivityType : Copy
PipelineRunId : 7b538846-fd4e-409c-99ef-2475329f5729
PipelineName : SQLServerToBlobPipeline
Input : {source, sink, enableStaging}
Output : {dataRead, dataWritten, filesWritten, sourcePeakConnections...}
LinkedServiceName :
ActivityRunStart : 9/11/2019 7:10:37 AM
ActivityRunEnd : 9/11/2019 7:10:58 AM
DurationInMs : 21094
Status : Succeeded
Error : {errorCode, message, failureType, target}
AdditionalProperties : {[retryAttempt, ], [iterationHash, ], [userProperties, {}], [recoveryStatus,
None]...}

2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:

Write-Host "Pipeline 'SQLServerToBlobPipeline' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "CopySqlServerToAzureBlobActivity"}).Output.ToString()

Here is the output of the sample run:

{
"dataRead":36,
"dataWritten":32,
"filesWritten":1,
"sourcePeakConnections":1,
"sinkPeakConnections":1,
"rowsRead":2,
"rowsCopied":2,
"copyDuration":18,
"throughput":0.01,
"errors":[

],
"effectiveIntegrationRuntime":"ADFTutorialIR",
"usedParallelCopies":1,
"executionDetails":[
{
"source":{
"type":"SqlServer"
},
"sink":{
"type":"AzureBlobStorage",
"region":"CentralUS"
},
"status":"Succeeded",
"start":"2019-09-11T07:10:38.2342905Z",
"duration":18,
"usedParallelCopies":1,
"detailedDurations":{
"queuingDuration":6,
"timeToFirstByte":0,
"transferDuration":5
}
}
]
}
Verify the output
The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the dbo.emp.txt file in the output folder.
1. In the Azure portal, in the adftutorial container window, select Refresh to see the output folder.
2. Select fromonprem in the list of folders.
3. Confirm that you see a file named dbo.emp.txt .

Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Load data into Azure Data Lake Storage Gen2 with
Azure Data Factory
7/7/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob
storage. It allows you to interface with your data using both file system and object storage paradigms.
Azure Data Factory (ADF) is a fully managed cloud-based data integration service. You can use the service to
populate the lake with data from a rich set of on-premises and cloud-based data stores and save time when
building your analytics solutions. For a detailed list of supported connectors, see the table of Supported data
stores.
Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of
ADF, it can ingest data at a high throughput. For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3
service into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.

TIP
For copying data from Azure Data Lake Storage Gen1 into Gen2, refer to this specific walkthrough.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.
AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. You
can use other data stores by following similar steps.

Create a data factory


1. On the left menu, select Create a resource > Integration > Data Factor y :
2. In the New data factor y page, provide values for following fields:
Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name YourDataFactoryName is not available", enter a different name for the data factory. For example,
you could use the name yourname ADFTutorialDataFactor y . Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:

Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.

Load data into Azure Data Lake Storage Gen2


1. In the home page of Azure Data Factory, select the Ingest tile to launch the Copy Data tool.
2. In the Proper ties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select
Next .
3. In the Source data store page, click + Create new connection . Select Amazon S3 from the
connector gallery, and select Continue .

4. In the New linked ser vice (Amazon S3) page, do the following steps:
a. Specify the Access Key ID value.
b. Specify the Secret Access Key value.
c. Click Test connection to validate the settings, then select Create .

d. You will see a new AmazonS3 connection gets created. Select Next .
5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, and then select Choose .
6. Specify the copy behavior by checking the Recursively and Binar y copy options. Select Next .

7. In the Destination data store page, click + Create new connection , and then select Azure Data
Lake Storage Gen2 , and select Continue .
8. In the New linked ser vice (Azure Data Lake Storage Gen2) page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop-down
list.
b. Select Create to create the connection. Then select Next .

9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and
select Next . ADF will create the corresponding ADLS Gen2 file system and subfolders during copy if it
doesn't exist.
10. In the Settings page, select Next to use the default settings.

11. In the Summar y page, review the settings, and select Next .
12. On the Deployment page , select Monitor to monitor the pipeline (task).
13. When the pipeline run completes successfully, you see a pipeline run that is triggered by a manual trigger.
You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.

14. To see activity runs associated with the pipeline run, select the CopyFromAmazonS3ToADLS link under
the PIPELINE NAME column. For details about the copy operation, select the Details link (eyeglasses
icon) under the ACTIVITY NAME column. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configuration.
15. To refresh the view, select Refresh. Select All pipeline runs at the top to go back to the Pipeline Runs
view.
16. Verify that the data is copied into your Data Lake Storage Gen2 account.

Next steps
Copy activity overview
Azure Data Lake Storage Gen2 connector
Load data into Azure Data Lake Storage Gen1 by
using Azure Data Factory
7/7/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale
repository for big data analytic workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and
ingestion speed. The data is captured in a single place for operational and exploratory analytics.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from your existing system and save time when building your analytics solutions.
Azure Data Factory offers the following benefits for loading data into Data Lake Storage Gen1:
Easy to set up : An intuitive 5-step wizard with no scripting required.
Rich data store suppor t : Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant : Data is transferred over HTTPS or ExpressRoute. The global service presence
ensures that your data never leaves the geographical boundary.
High performance : Up to 1-GB/s data loading speed into Data Lake Storage Gen1. For details, see Copy
activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Amazon S3 into Data Lake
Storage Gen1. You can follow similar steps to copy data from other types of data stores.

NOTE
For more information, see Copy data to or from Data Lake Storage Gen1 by using Azure Data Factory.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Data Lake Storage Gen1 account: If you don't have a Data Lake Storage Gen1 account, see the instructions in
Create a Data Lake Storage Gen1 account.
Amazon S3: This article shows how to copy data from Amazon S3. You can use other data stores by following
similar steps.

Create a data factory


1. On the left menu, select Create a resource > Analytics > Data Factor y :
2. In the New data factor y page, provide values for the fields that are shown in the following image:

Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSG1Demo" is not available," enter a different name for the data factory. For example,
you could use the name yourname ADFTutorialDataFactor y . Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These
data stores include Azure Data Lake Storage Gen1, Azure Storage, Azure SQL Database, and so on.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:

Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.

Load data into Data Lake Storage Gen1


1. In the home page, select the Ingest tile to launch the Copy Data tool:

2. In the Proper ties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select
Next :

3. In the Source data store page, click + Create new connection :

Select Amazon S3 , and select Continue


4. In the Specify Amazon S3 connection page, do the following steps:
a. Specify the Access Key ID value.
b. Specify the Secret Access Key value.
c. Select Finish .

d. You will see a new connection. Select Next .


5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, select Choose , and then select Next :

6. Choose the copy behavior by selecting the Copy files recursively and Binar y copy (copy files as-is)
options. Select Next :
7. In the Destination data store page, click + Create new connection , and then select Azure Data
Lake Storage Gen1 , and select Continue :

8. In the New Linked Ser vice (Azure Data Lake Storage Gen1) page, do the following steps:
a. Select your Data Lake Storage Gen1 account for the Data Lake Store account name .
b. Specify the Tenant , and select Finish.
c. Select Next .

IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage
Gen1 account. Be sure to grant the MSI the proper permissions in Data Lake Storage Gen1 by following these
instructions.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and
select Next :

10. In the Settings page, select Next :


11. In the Summar y page, review the settings, and select Next :
12. In the Deployment page , select Monitor to monitor the pipeline (task):
13. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to
view activity run details and to rerun the pipeline:

14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To
switch back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the
list.

15. To monitor the execution details for each copy activity, select the Details link under Actions in the
activity monitoring view. You can monitor details like the volume of data copied from the source to the
sink, data throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen1 account:

Next steps
Advance to the following article to learn about Data Lake Storage Gen1 support:
Azure Data Lake Storage Gen1 connector
Copy data from Azure Data Lake Storage Gen1 to
Gen2 with Azure Data Factory
7/7/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics that's built into Azure Blob
storage. You can use it to interface with your data by using both file system and object storage paradigms.
If you currently use Azure Data Lake Storage Gen1, you can evaluate Azure Data Lake Storage Gen2 by copying
data from Data Lake Storage Gen1 to Gen2 by using Azure Data Factory.
Azure Data Factory is a fully managed cloud-based data integration service. You can use the service to populate
the lake with data from a rich set of on-premises and cloud-based data stores and save time when you build
your analytics solutions. For a list of supported connectors, see the table of Supported data stores.
Azure Data Factory offers a scale-out, managed data movement solution. Because of the scale-out architecture
of Data Factory, it can ingest data at a high throughput. For more information, see Copy activity performance.
This article shows you how to use the Data Factory copy data tool to copy data from Azure Data Lake Storage
Gen1 into Azure Data Lake Storage Gen2. You can follow similar steps to copy data from other types of data
stores.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Data Lake Storage Gen1 account with data in it.
Azure Storage account with Data Lake Storage Gen2 enabled. If you don't have a Storage account, create an
account.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factor y .
2. On the New data factor y page, provide values for the fields that are shown in the following image:
Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSDemo" is not available," enter a different name for the data factory. For example, use
the name yourname ADFTutorialDataFactor y . Create the data factory again. For the naming rules
for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list. You also can select the
Create new option and enter the name of a resource group. To learn about resource groups, see Use
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by the data factory can be in other locations and regions.
3. Select Create .
4. After creation is finished, go to your data factory. You see the Data Factor y home page as shown in the
following image:
5. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration application in
a separate tab.

Load data into Azure Data Lake Storage Gen2


1. On the home page, select the Ingest tile to launch the copy data tool.

2. On the Proper ties page, specify CopyFromADLSGen1ToGen2 for the Task name field. Select Next .
3. On the Source data store page, select + Create new connection .

4. Select Azure Data Lake Storage Gen1 from the connector gallery, and select Continue .
5. On the Specify Azure Data Lake Storage Gen1 connection page, follow these steps:
a. Select your Data Lake Storage Gen1 for the account name, and specify or validate the Tenant .
b. Select Test connection to validate the settings. Then select Finish .
c. You see that a new connection was created. Select Next .

IMPORTANT
In this walk-through, you use a managed identity for Azure resources to authenticate your Azure Data Lake
Storage Gen1. To grant the managed identity the proper permissions in Azure Data Lake Storage Gen1, follow
these instructions.
6. On the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder or file, and select Choose .

7. Specify the copy behavior by selecting the Copy files recursively and Binar y copy options. Select
Next .
8. On the Destination data store page, select + Create new connection > Azure Data Lake Storage
Gen2 > Continue .

9. On the Specify Azure Data Lake Storage Gen2 connection page, follow these steps:
a. Select your Data Lake Storage Gen2 capable account from the Storage account name drop-down list.
b. Select Finish to create the connection. Then select Next .
10. On the Choose the output file or folder page, enter copyfromadlsgen1 as the output folder name,
and select Next . Data Factory creates the corresponding Azure Data Lake Storage Gen2 file system and
subfolders during copy if they don't exist.

11. On the Settings page, select Next to use the default settings.
12. On the Summar y page, review the settings, and select Next .
13. On the Deployment page , select Monitor to monitor the pipeline.

14. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to
view activity run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To
switch back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the
list.

16. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used
configurations.

17. Verify that the data is copied into your Azure Data Lake Storage Gen2 account.

Best practices
To assess upgrading from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 in general, see
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
The following sections introduce best practices for using Data Factory for a data upgrade from Data Lake
Storage Gen1 to Data Lake Storage Gen2.
Data partition for historical data copy
If your total data size in Data Lake Storage Gen1 is less than 30 TB and the number of files is less than 1
million, you can copy all data in a single copy activity run.
If you have a larger amount of data to copy, or you want the flexibility to manage data migration in batches
and make each of them complete within a specific time frame, partition the data. Partitioning also reduces
the risk of any unexpected issue.
Use a proof of concept to verify the end-to-end solution and test the copy throughput in your environment.
Major proof-of-concept steps:
1. Create one Data Factory pipeline with a single copy activity to copy several TBs of data from Data Lake
Storage Gen1 to Data Lake Storage Gen2 to get a copy performance baseline. Start with data integration
units (DIUs) as 128.
2. Based on the copy throughput you get in step 1, calculate the estimated time that's required for the entire
data migration.
3. (Optional) Create a control table and define the file filter to partition the files to be migrated. The way to
partition the files is to:
Partition by folder name or folder name with a wildcard filter. We recommend this method.
Partition by a file's last modified time.
Network bandwidth and storage I/O
You can control the concurrency of Data Factory copy jobs that read data from Data Lake Storage Gen1 and
write data to Data Lake Storage Gen2. In this way, you can manage the use on that storage I/O to avoid affecting
the normal business work on Data Lake Storage Gen1 during the migration.
Permissions
In Data Factory, the Data Lake Storage Gen1 connector supports service principal and managed identity for
Azure resource authentications. The Data Lake Storage Gen2 connector supports account key, service principal,
and managed identity for Azure resource authentications. To make Data Factory able to navigate and copy all the
files or access control lists (ACLs) you need, grant high enough permissions for the account you provide to
access, read, or write all files and set ACLs if you choose to. Grant it a super-user or owner role during the
migration period.
Preserve ACLs from Data Lake Storage Gen1
If you want to replicate the ACLs along with data files when you upgrade from Data Lake Storage Gen1 to Data
Lake Storage Gen2, see Preserve ACLs from Data Lake Storage Gen1.
Incremental copy
You can use several approaches to load only the new or updated files from Data Lake Storage Gen1:
Load new or updated files by time partitioned folder or file name. An example is /2019/05/13/*.
Load new or updated files by LastModifiedDate.
Identify new or updated files by any third-party tool or solution. Then pass the file or folder name to the Data
Factory pipeline via parameter or a table or file.
The proper frequency to do incremental load depends on the total number of files in Azure Data Lake Storage
Gen1 and the volume of new or updated files to be loaded every time.

Next steps
Copy activity overview Azure Data Lake Storage Gen1 connector Azure Data Lake Storage Gen2 connector
Load data into Azure Synapse Analytics by using
Azure Data Factory
7/7/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Synapse Analytics is a cloud-based, scale-out database that's capable of processing massive volumes of
data, both relational and non-relational. Azure Synapse Analytics is built on the massively parallel processing
(MPP) architecture that's optimized for enterprise data warehouse workloads. It offers cloud elasticity with the
flexibility to scale storage and compute independently.
Getting started with Azure Synapse Analytics is now easier than ever when you use Azure Data Factory. Azure
Data Factory is a fully managed cloud-based data integration service. You can use the service to populate an
Azure Synapse Analytics with data from your existing system and save time when building your analytics
solutions.
Azure Data Factory offers the following benefits for loading data into Azure Synapse Analytics:
Easy to set up : An intuitive 5-step wizard with no scripting required.
Rich data store suppor t : Built-in support for a rich set of on-premises and cloud-based data stores. For a
detailed list, see the table of Supported data stores.
Secure and compliant : Data is transferred over HTTPS or ExpressRoute. The global service presence
ensures that your data never leaves the geographical boundary.
Unparalleled performance by using PolyBase : Polybase is the most efficient way to move data into
Azure Synapse Analytics. Use the staging blob feature to achieve high load speeds from all types of data
stores, including Azure Blob storage and Data Lake Store. (Polybase supports Azure Blob storage and Azure
Data Lake Store by default.) For details, see Copy activity performance.
This article shows you how to use the Data Factory Copy Data tool to load data from Azure SQL Database into
Azure Synapse Analytics. You can follow similar steps to copy data from other types of data stores.

NOTE
For more information, see Copy data to or from Azure Synapse Analytics by using Azure Data Factory.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Synapse Analytics: The data warehouse holds the data that's copied over from the SQL database. If you
don't have an Azure Synapse Analytics, see the instructions in Create an Azure Synapse Analytics.
Azure SQL Database: This tutorial copies data from the Adventure Works LT sample dataset in Azure SQL
Database. You can create this sample database in SQL Database by following the instructions in Create a
sample database in Azure SQL Database.
Azure storage account: Azure Storage is used as the staging blob in the bulk copy operation. If you don't have
an Azure storage account, see the instructions in Create a storage account.

Create a data factory


1. On the left menu, select Create a resource > Data + Analytics > Data Factor y :
2. On the New data factor y page, provide values for following items:
Name : Enter LoadSQLDWDemo for name. The name for your data factory must be *globally unique. If
you receive the error "Data factory name 'LoadSQLDWDemo' is not available", enter a different name
for the data factory. For example, you could use the name yourname ADFTutorialDataFactor y . Try
creating the data factory again. For the naming rules for Data Factory artifacts, see Data Factory
naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These
data stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:

Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.

Load data into Azure Synapse Analytics


1. In the home page of Azure Data Factory, select the Ingest tile to launch the Copy Data tool.
2. In the Proper ties page, specify CopyFromSQLToSQLDW for the Task name field, and select Next .
3. In the Source data store page, complete the following steps:

TIP
In this tutorial, you use SQL authentication as the authentication type for your source data store, but you can
choose other supported authentication methods:Service Principal and Managed Identity if needed. Refer to
corresponding sections in this article for details. To store secrets for data stores securely, it's also recommended to
use an Azure Key Vault. Refer to this article for detailed illustrations.

a. click + Create new connection .


b. Select Azure SQL Database from the gallery, and select Continue . You can type "SQL" in the search
box to filter the connectors.
c. In the New Linked Ser vice page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Create .
d. Select the newly created linked service as source, then click Next .
4. In the Select tables from which to copy the data or use a custom quer y page, enter SalesLT to
filter the tables. Choose the (Select all) box to use all of the tables for the copy, and then select Next .
5. In the Apply filter page, specify your settings or select Next .
6. In the Destination data store page, complete the following steps:

TIP
In this tutorial, you use SQL authentication as the authentication type for your destination data store, but you can
choose other supported authentication methods:Service Principal and Managed Identity if needed. Refer to
corresponding sections in this article for details. To store secrets for data stores securely, it's also recommended to
use an Azure Key Vault. Refer to this article for detailed illustrations.

a. Click + Create new connection to add a connection


b. Select Azure Synapse Analytics from the gallery, and select Continue .
c. In the New Linked Ser vice page, select your server name and DB name from the dropdown list, and
specify the username and password. Click Test connection to validate the settings, then select Create .
d. Select the newly created linked service as sink, then click Next .
7. In the Table mapping page, review the content, and select Next . An intelligent table mapping displays.
The source tables are mapped to the destination tables based on the table names. If a source table doesn't
exist in the destination, Azure Data Factory creates a destination table with the same name by default. You
can also map a source table to an existing destination table.
8. In the Column mapping page, review the content, and select Next . The intelligent table mapping is
based on the column name. If you let Data Factory automatically create the tables, data type conversion
can occur when there are incompatibilities between the source and destination stores. If there's an
unsupported data type conversion between the source and destination column, you see an error
message next to the corresponding table.

9. In the Settings page, complete the following steps:


a. In Staging settings section, click + New to new a staging storage. The storage is used for staging the
data before it loads into Azure Synapse Analytics by using PolyBase. After the copy is complete, the
interim data in Azure Blob Storage is automatically cleaned up.
b. In the New Linked Ser vice page, select your storage account, and select Create to deploy the linked
service.
c. Deselect the Use type default option, and then select Next .

10. In the Summar y page, review the settings, and select Next .
11. On the Deployment page , select Monitor to monitor the pipeline (task).
12. Notice that the Monitor tab on the left is automatically selected. When the pipeline run completes
successfully, select the CopyFromSQLToSQLDW link under the PIPELINE NAME column to view
activity run details or to rerun the pipeline.

13. To switch back to the pipeline runs view, select the All pipeline runs link at the top. Select Refresh to
refresh the list.

14. To monitor the execution details for each copy activity, select the Details link (eyeglasses icon) under
ACTIVITY NAME in the activity runs view. You can monitor details like the volume of data copied from
the source to the sink, data throughput, execution steps with corresponding duration, and used
configurations.
Next steps
Advance to the following article to learn about Azure Synapse Analytics support:
Azure Synapse Analytics connector
Copy data from SAP Business Warehouse by using
Azure Data Factory
7/7/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW) via Open
Hub to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data
stores.

TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction
flow, see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.

Prerequisites
Azure Data Factor y : If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD) with destination type "Database Table" : To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions :
Authorization for Remote Function Calls (RFC) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0 . Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Open on the Open Azure Data Factor y Studio tile to open
the Data Factory UI in a separate tab.
1. On the home page, select Ingest to open the Copy Data tool.
2. On the Proper ties page, specify a Task name , and then select Next .
3. On the Source data store page, select +Create new connection . Select SAP BW Open Hub from
the connector gallery, and then select Continue . To filter the connectors, you can type SAP in the search
box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New , and then select Self-hosted . Enter a Name , and then
select Next . Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Ser ver name , System number , Client ID, Language (if other than EN ),
User name , and Password .
c. Select Test connection to validate the settings, and then select Finish .
d. A new connection is created. Select Next .
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next .
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP)
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next .

7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue .
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next .
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name.
Then select Next .
10. On the File format setting page, select Next to use the default settings.

11. On the Settings page, expand Performance settings . Enter a value for Degree of copy parallelism
such as 5 to load from SAP BW in parallel. Then select Next .
12. On the Summar y page, review the settings. Then select Next .
13. On the Deployment page, select Monitor to monitor the pipeline.

14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back
to the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.

17. To view the maximum Request ID , go back to the activity-monitoring view and select Output under
Actions .

Incremental copy from SAP BW Open Hub


TIP
See SAP BW Open Hub connector delta extraction flow to learn how the SAP BW Open Hub connector in Data Factory
copies incremental data from SAP BW. This article can also help you understand basic connector configuration.
Now, let's continue to configure incremental copy from SAP BW Open Hub.
Incremental copy uses a "high-watermark" mechanism that's based on the request ID . That ID is automatically
generated in SAP BW Open Hub Destination by the DTP. The following diagram shows this workflow:

On the data factory home page, select Pipeline templates in the Discover more section to use the built-in
template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage : In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub : This is the source to copy data from. Refer to the previous full-copy
walkthrough for detailed configuration.
Azure Data Lake Storage Gen2 : This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.

3. This template generates a pipeline with the following three activities and makes them chained on-
success: Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName : Specify the Open Hub table name to copy data from.
Data_Destination_Container : Specify the destination Azure Data Lake Storage Gen2 container
to copy data to. If the container doesn't exist, the Data Factory copy activity creates one during
execution.
Data_Destination_Director y : Specify the folder path under the Azure Data Lake Storage Gen2
container to copy data to. If the path doesn't exist, the Data Factory copy activity creates a path
during execution.
HighWatermarkBlobContainer : Specify the container to store the high-watermark value.
HighWatermarkBlobDirector y : Specify the folder path under the container to store the high-
watermark value.
HighWatermarkBlobName : Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobContainer+HighWatermarkBlobDirectory+HighWatermarkBlobName, such as
container/path/requestIdCache.txt. Create a blob with content 0.

LogicAppURL : In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST
URL .

a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go
to Logic Apps Designer .
b. Create a trigger of When an HTTP request is received . Specify the HTTP request body as
follows:

{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}

c. Add a Create blob action. For Folder path and Blob name , use the same values that you
configured previously in HighWatermarkBlobContainer+HighWatermarkBlobDirectory and
HighWatermarkBlobName.
d. Select Save . Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to
validate the configuration. Or, select Publish to publish all the changes, and then select Add trigger to
execute a run.

SAP BW Open Hub Destination configurations


This section introduces configuration of the SAP BW side to use the SAP BW Open Hub connector in Data
Factory to copy data.
Configure delta extraction in SAP BW
If you need both historical copy and incremental copy or only incremental copy, configure delta extraction in
SAP BW.
1. Create the Open Hub Destination. You can create the OHD in SAP Transaction RSA1, which automatically
creates the required transformation and data-transfer process. Use the following settings:
ObjectType : You can use any object type. Here, we use InfoCube as an example.
Destination Type : Select Database Table .
Key of the Table : Select Technical Key .
Extraction : Select Keep Data and Inser t Records into Table .

You might increase the number of parallel running SAP work processes for the DTP:

2. Schedule the DTP in process chains.


A delta DTP for a cube only works if the necessary rows haven't been compressed. Make sure that BW
cube compression isn't running before the DTP to the Open Hub table. The easiest way to do this is to
integrate the DTP into your existing process chains. In the following example, the DTP (to the OHD) is
inserted into the process chain between the Adjust (aggregate rollup) and Collapse (cube compression)
steps.

Configure full extraction in SAP BW


In addition to delta extraction, you might want a full extraction of the same SAP BW InfoProvider. This usually
applies if you want to do full copy but not incremental, or you want to resync delta extraction.
You can't have more than one DTP for the same OHD. So, you must create an additional OHD before delta
extraction.

For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Inser t Records . Otherwise, data will be
extracted many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full . You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request . Otherwise, nothing will
be extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy
activity until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways
to avoid this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched .
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched , you can use the following option to run the delta DTP manually:
No Data Transfer; Delta Status in Source: Fetched

Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Load data from Office 365 by using Azure Data
Factory
7/7/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article shows you how to use the Data Factory load data from Office 365 into Azure Blob storage. You can
follow similar steps to copy data to Azure Data Lake Gen1 or Gen2. Refer to Office 365 connector article on
copying data from Office 365 in general.

Create a data factory


1. On the left menu, select Create a resource > Analytics > Data Factor y :

2. In the New data factor y page, provide values for the fields that are shown in the following image:
Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name LoadFromOffice365Demo is not available", enter a different name for the data factory. For
example, you could use the name yourname LoadFromOffice365Demo . Try creating the data
factory again. For the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These
data stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:
5. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.

Create a pipeline
1. On the home page, select Orchestrate .

2. In the General tab for the pipeline, enter "CopyPipeline" for Name of the pipeline.
3. In the Activities tool box > Move & Transform category > drag and drop the Copy activity from the tool
box to the pipeline designer surface. Specify "CopyFromOffice365ToBlob" as activity name.
Configure source
1. Go to the pipeline > Source tab , click + New to create a source dataset.
2. In the New Dataset window, select Office 365 , and then select Continue .
3. You are now in the copy activity configuration tab. Click on the Edit button next to the Office 365 dataset
to continue the data configuration.
4. You see a new tab opened for Office 365 dataset. In the General tab at the bottom of the Properties
window, enter "SourceOffice365Dataset" for Name.
5. Go to the Connection tab of the Properties window. Next to the Linked service text box, click + New .
6. In the New Linked Service window, enter "Office365LinkedService" as name, enter the service principal ID
and service principal key, then test connection and select Create to deploy the linked service.

7. After the linked service is created, you are back in the dataset settings. Next to Table , choose the down-
arrow to expand the list of available Office 365 datasets, and choose "BasicDataSet_v0.Message_v0" from
the drop-down list:

8. Now go back to the pipeline > Source tab to continue configuring additional properties for Office 365
data extraction. User scope and user scope filter are optional predicates that you can define to restrict the
data you want to extract out of Office 365. See Office 365 dataset properties section for how you
configure these settings.
9. You are required to choose one of the date filters and provide the start time and end time values.
10. Click on the Impor t Schema tab to import the schema for Message dataset.
Configure sink
1. Go to the pipeline > Sink tab , and select + New to create a sink dataset.
2. In the New Dataset window, notice that only the supported destinations are selected when copying from
Office 365. Select Azure Blob Storage , select Binary format, and then select Continue . In this tutorial,
you copy Office 365 data into an Azure Blob Storage.
3. Click on Edit button next to the Azure Blob Storage dataset to continue the data configuration.
4. On the General tab of the Properties window, in Name, enter "OutputBlobDataset".
5. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New .
6. In the New Linked Service window, enter "AzureStorageLinkedService" as name, select "Service Principal"
from the dropdown list of authentication methods, fill in the Service Endpoint, Tenant, Service principal
ID, and Service principal key, then select Save to deploy the linked service. Refer here for how to set up
service principal authentication for Azure Blob Storage.

Validate the pipeline


To validate the pipeline, select Validate from the tool bar.
You can also see the JSON code associated with the pipeline by clicking Code on the upper-right.

Publish the pipeline


In the top toolbar, select Publish All . This action publishes entities (datasets, and pipelines) you created to Data
Factory.

Trigger the pipeline manually


Select Add Trigger on the toolbar, and then select Trigger Now . On the Pipeline Run page, select Finish .

Monitor the pipeline


Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger. You can use links
in the Actions column to view activity details and to rerun the pipeline.

To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions column.
In this example, there is only one activity, so you see only one entry in the list. For details about the copy
operation, select the Details link (eyeglasses icon) in the Actions column.

If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access
request), you will see the copy activity status as In Progress , and only when you click into "Details" link under
Actions will you see the status as RequesetingConsent . A member of the data access approver group needs to
approve the request in the Privileged Access Management before the data extraction can proceed.
Status as requesting consent:
Status as extracting data:

Once the consent is provided, data extraction will continue and, after some time, the pipeline run will show as
succeeded.

Now go to the destination Azure Blob Storage and verify that Office 365 data has been extracted in Binary
format.

Next steps
Advance to the following article to learn about Azure Synapse Analytics support:
Office 365 connector
Copy multiple tables in bulk by using Azure Data
Factory in the Azure portal
7/7/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure Synapse
Analytics . You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Azure Synapse Analytics /Azure Blob, copying different paths from
Blob to Azure SQL Database tables.

NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.

At a high level, this tutorial involves following steps:


Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure portal. To learn about using other tools/SDKs to create a data factory, see Quickstarts.

End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to Azure Synapse
Analytics. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the
pipeline triggers another pipeline, which iterates over each table in the database and performs the data copy
operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the
list, copy the specific table in Azure SQL Database to the corresponding table in Azure Synapse Analytics
using staged copy via Blob storage and PolyBase for best performance. In this example, the first pipeline
passes the list of tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account . The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database . This database contains the source data. Create a database in SQL Database with
Adventure Works LT sample data following Create a database in Azure SQL Database article. This tutorial
copies all the tables from this sample database to an Azure Synapse Analytics.
Azure Synapse Analytics . This data warehouse holds the data copied over from the SQL Database. If you
don't have an Azure Synapse Analytics workspace, see the Get started with Azure Synapse Analytics article
for steps to create one.

Azure services to access SQL server


For both SQL Database and Azure Synapse Analytics, allow Azure services to access SQL server. Ensure that
Allow Azure ser vices and resources to access this ser ver setting is turned ON for your server. This
setting allows the Data Factory service to read data from your Azure SQL Database and write data to your Azure
Synapse Analytics.
To verify and turn on this setting, go to your server > Security > Firewalls and virtual networks > set the Allow
Azure ser vices and resources to access this ser ver to ON .

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure portal.
3. On the left of the Azure portal menu, select Create a resource > Integration > Data Factor y .
4. On the New data factor y page, enter ADFTutorialBulkCopyDF for name .
The name of the Azure data factory must be globally unique . If you see the following error for the
name field, change the name of the data factory (for example, yournameADFTutorialBulkCopyDF). See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name "ADFTutorialBulkCopyDF" is not available

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version .
8. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
9. Click Create .
10. After the creation is complete, select Go to resource to navigate to the Data Factor y page.
11. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Factory UI application in a
separate tab.

Create linked services


You create linked services to link your data stores and computes to a data factory. A linked service has the
connection information that the Data Factory service uses to connect to the data store at runtime.
In this tutorial, you link your Azure SQL Database, Azure Synapse Analytics, and Azure Blob Storage data stores
to your data factory. The Azure SQL Database is the source data store. The Azure Synapse Analytics is the
sink/destination data store. The Azure Blob Storage is to stage the data before the data is loaded into Azure
Synapse Analytics by using PolyBase.
Create the source Azure SQL Database linked service
In this step, you create a linked service to link your database in Azure SQL Database to the data factory.
1. Open Manage tab from the left pane.
2. On the Linked services page, select +New to create a new linked service.
3. In the New Linked Ser vice window, select Azure SQL Database , and click Continue .
4. In the New Linked Ser vice (Azure SQL Database) window, do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for Name .
b. Select your server for Ser ver name
c. Select your database for Database name .
d. Enter name of the user to connect to your database.
e. Enter password for the user.
f. To test the connection to your database using the specified information, click Test connection .
g. Click Create to save the linked service.
Create the sink Azure Synapse Analytics linked service
1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Ser vice window, select Azure Synapse Analytics , and click Continue .
3. In the New Linked Ser vice (Azure Synapse Analytics) window, do the following steps:
a. Enter AzureSqlDWLinkedSer vice for Name .
b. Select your server for Ser ver name
c. Select your database for Database name .
d. Enter User name to connect to your database.
e. Enter Password for the user.
f. To test the connection to your database using the specified information, click Test connection .
g. Click Create .
Create the staging Azure Storage linked service
In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. In the Connections tab, click + New on the toolbar again.
2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
3. In the New Linked Ser vice (Azure Blob Storage) window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Click Create .

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedSer vice . The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the
database and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedSer vice . The linked service
specifies the connection string to connect to the Azure Synapse Analytics. The dataset specifies the database and
the table to which the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Select Author tab from the left pane.
2. Select the + (plus) in the left pane, and then select Dataset .

3. In the New Dataset window, select Azure SQL Database , and then click Continue .
4. In the Set proper ties window, under Name , enter AzureSqlDatabaseDataset . Under Linked
ser vice , select AzureSqlDatabaseLinkedSer vice . Then click OK .
5. Switch to the Connection tab, select any table for Table . This table is a dummy table. You specify a query
on the source dataset when creating a pipeline. The query is used to extract data from your database.
Alternatively, you can click Edit check box, and enter dbo.dummyName as the table name.
Create a dataset for sink Azure Synapse Analytics
1. Click + (plus) in the left pane, and click Dataset .
2. In the New Dataset window, select Azure Synapse Analytics , and then click Continue .
3. In the Set proper ties window, under Name , enter AzureSqlDWDataset . Under Linked ser vice , select
AzureSqlDWLinkedSer vice . Then click OK .
4. Switch to the Parameters tab, click + New , and enter DWTableName for the parameter name. Click +
New again, and enter DWSchema for the parameter name. If you copy/paste this name from the page,
ensure that there's no trailing space character at the end of DWTableName and DWSchema.
5. Switch to the Connection tab,
a. For Table , check the Edit option. Select into the first input box and click the Add dynamic
content link below. In the Add Dynamic Content page, click the DWSchema under
Parameters , which will automatically populate the top expression text box @dataset().DWSchema ,
and then click Finish .

b. Select into the second input box and click the Add dynamic content link below. In the Add
Dynamic Content page, click the DWTAbleName under Parameters , which will automatically
populate the top expression text box @dataset().DWTableName , and then click Finish .
c. The tableName property of the dataset is set to the values that are passed as arguments for the
DWSchema and DWTableName parameters. The ForEach activity iterates through a list of tables,
and passes one by one to the Copy activity.

Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData .
The GetTableListAndTriggerCopyData pipeline performs two actions:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables pipeline takes a list of tables as a parameter. For each table in the list, it copies
data from the table in Azure SQL Database to Azure Synapse Analytics using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus) , and click Pipeline .

2. In the General panel under Proper ties , specify IterateAndCopySQLTables for Name . Then collapse
the panel by clicking the Properties icon in the top-right corner.
3. Switch to the Parameters tab, and do the following actions:
a. Click + New .
b. Enter tableList for the parameter Name .
c. Select Array for Type .
4. In the Activities toolbox, expand Iteration & Conditions , and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name .
b. Switch to the Settings tab, click the input box for Items , then click the Add dynamic content link
below.
c. In the Add Dynamic Content page, collapse the System Variables and Functions sections, click the
tableList under Parameters , which will automatically populate the top expression text box as
@pipeline().parameter.tableList . Then click Finish .
d. Switch to Activities tab, click the pencil icon to add a child activity to the ForEach activity.

5. In the Activities toolbox, expand Move & Transfer , and drag-drop Copy data activity into the pipeline
designer surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline
name and IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch
back to the pipeline editor from the ForEach editor, you can click the link in the breadcrumb menu.

6. Switch to the Source tab, and do the following steps:


a. Select AzureSqlDatabaseDataset for Source Dataset .
b. Select Quer y option for Use quer y .
c. Click the Quer y input box -> select the Add dynamic content below -> enter the following
expression for Quer y -> select Finish .

SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

7. Switch to the Sink tab, and do the following steps:


a. Select AzureSqlDWDataset for Sink Dataset .
b. Click the input box for the VALUE of DWTableName parameter -> select the Add dynamic
content below, enter @item().TABLE_NAME expression as script, -> select Finish .
c. Click the input box for the VALUE of DWSchema parameter -> select the Add dynamic content
below, enter @item().TABLE_SCHEMA expression as script, -> select Finish .
d. For Copy method, select PolyBase .
e. Clear the Use type default option.
f. For Table option, the default setting is "None". If you don’t have tables pre-created in the sink
Azure Synapse Analytics, enable Auto create table option, copy activity will then automatically
create tables for you based on the source data. For details, refer to Auto create sink tables.
g. Click the Pre-copy Script input box -> select the Add dynamic content below -> enter the
following expression as script -> select Finish .

IF EXISTS (SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]) TRUNCATE TABLE


[@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]
8. Switch to the Settings tab, and do the following steps:
a. Select the checkbox for Enable Staging .
b. Select AzureStorageLinkedSer vice for Store Account Linked Ser vice .
9. To validate the pipeline settings, click Validate on the top pipeline tool bar. Make sure that there's no
validation error. To close the Pipeline Validation Repor t , click the double angle brackets >> .
Create the pipeline GetTableListAndTriggerCopyData
This pipeline does two actions:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
Here are the steps to create the pipeline:
1. In the left pane, click + (plus) , and click Pipeline .
2. In the General panel under Proper ties , change the name of the pipeline to
GetTableListAndTriggerCopyData .
3. In the Activities toolbox, expand General , and drag-drop Lookup activity to the pipeline designer
surface, and do the following steps:
a. Enter LookupTableList for Name .
b. Enter Retrieve the table list from my database for Description .
4. Switch to the Settings tab, and do the following steps:
a. Select AzureSqlDatabaseDataset for Source Dataset .
b. Select Quer y for Use quer y .
c. Enter the following SQL query for Quer y .
SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE'
and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <> 'ProductModel'

d. Clear the checkbox for the First row only field.

5. Drag-drop Execute Pipeline activity from the Activities toolbox to the pipeline designer surface, and set
the name to TriggerCopy .
6. To Connect the Lookup activity to the Execute Pipeline activity, drag the green box attached to the
Lookup activity to the left of Execute Pipeline activity.

7. Switch to the Settings tab of Execute Pipeline activity, and do the following steps:
a. Select IterateAndCopySQLTables for Invoked pipeline .
b. Clear the checkbox for Wait on completion .
c. In the Parameters section, click the input box under VALUE -> select the Add dynamic content
below -> enter @activity('LookupTableList').output.value as table name value -> select Finish .
You're setting the result list from the Lookup activity as an input to the second pipeline. The result
list contains the list of tables whose data needs to be copied to the destination.

8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Repor t , click >> .
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish all on top of the
window. Wait until the publishing succeeds.

Trigger a pipeline run


1. Go to pipeline GetTableListAndTriggerCopyData , click Add Trigger on the top pipeline tool bar, and
then click Trigger now .
2. Confirm the run on the Pipeline run page, and then select Finish .

Monitor the pipeline run


1. Switch to the Monitor tab. Click Refresh until you see runs for both the pipelines in your solution.
Continue refreshing the list until you see the Succeeded status.
2. To view activity runs associated with the GetTableListAndTriggerCopyData pipeline, click the pipeline
name link for the pipeline. You should see two activity runs for this pipeline run.

3. To view the output of the Lookup activity, click the Output link next to the activity under the ACTIVITY
NAME column. You can maximize and restore the Output window. After reviewing, click X to close the
Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}

4. To switch back to the Pipeline Runs view, click All Pipeline runs link at the top of the breadcrumb
menu. Click IterateAndCopySQLTables link (under PIPELINE NAME column) to view activity runs of
the pipeline. Notice that there's one Copy activity run for each table in the Lookup activity output.
5. Confirm that the data was copied to the target Azure Synapse Analytics you used in this tutorial.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory using PowerShell
5/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This tutorial demonstrates copying a number of tables from Azure SQL Database to Azure Synapse
Analytics . You can apply the same pattern in other copy scenarios as well. For example, copying tables from
SQL Server/Oracle to Azure SQL Database/Data Warehouse/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.
At a high level, this tutorial involves following steps:
Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
This tutorial uses Azure PowerShell. To learn about using other tools/SDKs to create a data factory, see
Quickstarts.

End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to Azure Synapse
Analytics. Here is the logical sequence of steps in the workflow that happens in pipelines:

The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the
pipeline triggers another pipeline, which iterates over each table in the database and performs the data copy
operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the
list, copy the specific table in Azure SQL Database to the corresponding table in Azure Synapse Analytics
using staged copy via Blob storage and PolyBase for best performance. In this example, the first pipeline
passes the list of tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account . The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database . This database contains the source data.
Azure Synapse Analytics . This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and Azure Synapse Analytics
Prepare the source Azure SQL Database :
Create a database with the Adventure Works LT sample data in SQL Database by following Create a database in
Azure SQL Database article. This tutorial copies all the tables from this sample database to Azure Synapse
Analytics.
Prepare the sink Azure Synapse Analytics :
1. If you don't have an Azure Synapse Analytics workspace, see the Get started with Azure Synapse
Analytics article for steps to create one.
2. Create corresponding table schemas in Azure Synapse Analytics. You use Azure Data Factory to
migrate/copy data in a later step.

Azure services to access SQL server


For both SQL Database and Azure Synapse Analytics, allow Azure services to access SQL server. Ensure that
Allow access to Azure ser vices setting is turned ON for your server. This setting allows the Data Factory
service to read data from your Azure SQL Database and write data to Azure Synapse Analytics. To verify and
turn on this setting, do the following steps:
1. Click All ser vices on the left and click SQL ser vers .
2. Select your server, and click Firewall under SETTINGS .
3. In the Firewall settings page, click ON for Allow access to Azure ser vices .

Create a data factory


1. Launch PowerShell . Keep Azure PowerShell open until the end of this tutorial. If you close and reopen,
you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

2. Run the Set-AzDataFactor yV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.

$resourceGroupName = "<your resource group to create the factory>"


$dataFactoryName = "<specify the name of data factory to create. It must be globally unique.>"
Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error,
change the name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes
(HDInsight, etc.) used by data factory can be in other regions.

Create linked services


In this tutorial, you create three linked services for source, sink, and staging blob respectively, which includes
connections to your data stores:
Create the source Azure SQL Database linked service
1. Create a JSON file named AzureSqlDatabaseLinkedSer vice.json in C:\ADFv2TutorialBulkCopy
folder with the following content: (Create the folder ADFv2TutorialBulkCopy if it does not already exist.)

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.

{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

2. In Azure PowerShell , switch to the ADFv2TutorialBulkCopy folder.


3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service:
AzureSqlDatabaseLinkedSer vice .

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSqlDatabaseLinkedService" -File ".\AzureSqlDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create the sink Azure Synapse Analytics linked service


1. Create a JSON file named AzureSqlDWLinkedSer vice.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

2. To create the linked service: AzureSqlDWLinkedSer vice , run the Set-


AzDataFactor yV2LinkedSer vice cmdlet.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSqlDWLinkedService" -File ".\AzureSqlDWLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService

Create the staging Azure Storage linked service


In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase for a better copy
performance.
1. Create a JSON file named AzureStorageLinkedSer vice.json in the C:\ADFv2TutorialBulkCopy
folder, with the following content:
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}

2. To create the linked service: AzureStorageLinkedSer vice , run the Set-


AzDataFactor yV2LinkedSer vice cmdlet.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.

{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}

2. To create the dataset: AzureSqlDatabaseDataset , run the Set-AzDataFactor yV2Dataset cmdlet.


Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureSqlDatabaseDataset" -File ".\AzureSqlDatabaseDataset.json"

Here is the sample output:

DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for sink Azure Synapse Analytics


1. Create a JSON file named AzureSqlDWDataset.json in the C:\ADFv2TutorialBulkCopy folder, with
the following content: The "tableName" is set as a parameter, later the copy activity that references this
dataset passes the actual value into the dataset.

{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}

2. To create the dataset: AzureSqlDWDataset , run the Set-AzDataFactor yV2Dataset cmdlet.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "AzureSqlDWDataset" -File ".\AzureSqlDWDataset.json"

Here is the sample output:

DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset

Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure Synapse Analytics using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:

{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from Azure SQL Database to Azure Synapse
Analytics",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}

2. To create the pipeline: IterateAndCopySQLTables , Run the Set-AzDataFactor yV2Pipeline cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IterateAndCopySQLTables" -File ".\IterateAndCopySQLTables.json"

Here is the sample output:

PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Create the pipeline "GetTableListAndTriggerCopyData"


This pipeline performs two steps:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.
1. Create a JSON file named GetTableListAndTriggerCopyData.json in the C:\ADFv2TutorialBulkCopy
folder, with the following content:
{
"name":"GetTableListAndTriggerCopyData",
"properties":{
"activities":[
{
"name": "LookupTableList",
"description": "Retrieve the table list from Azure SQL dataabse",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT TABLE_SCHEMA, TABLE_NAME FROM
information_schema.TABLES WHERE TABLE_TYPE = 'BASE TABLE' and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME
<> 'ProductModel'"
},
"dataset": {
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "TriggerCopy",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"tableList": {
"value": "@activity('LookupTableList').output.value",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "IterateAndCopySQLTables",
"type": "PipelineReference"
},
"waitOnCompletion": true
},
"dependsOn": [
{
"activity": "LookupTableList",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}

2. To create the pipeline: GetTableListAndTriggerCopyData , Run the Set-AzDataFactor yV2Pipeline


cmdlet.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "GetTableListAndTriggerCopyData" -File ".\GetTableListAndTriggerCopyData.json"

Here is the sample output:


PipelineName : GetTableListAndTriggerCopyData
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {LookupTableList, TriggerCopy}
Parameters :

Start and monitor pipeline run


1. Start a pipeline run for the main "GetTableListAndTriggerCopyData" pipeline and capture the pipeline run
ID for future monitoring. Underneath, it triggers the run for pipeline "IterateAndCopySQLTables" as
specified in ExecutePipeline activity.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName 'GetTableListAndTriggerCopyData'

2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData , and print out the final pipeline run and activity run result.

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 15
}

$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result

Here is the output of the sample run:


Pipeline run details:
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
RunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
LastUpdated : 9/18/2017 4:08:15 PM
Parameters : {}
RunStart : 9/18/2017 4:06:44 PM
RunEnd : 9/18/2017 4:08:15 PM
DurationInMs : 90637
Status : Succeeded
Message :

Activity run details:


ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : LookupTableList
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {source, dataset, firstRowOnly}
Output : {count, value, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:06:46 PM
ActivityRunEnd : 9/18/2017 4:07:09 PM
DurationInMs : 22995
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}

3. You can get the run ID of pipeline "IterateAndCopySQLTables ", and check the detailed activity run
result as the following.

Write-Host "Pipeline 'IterateAndCopySQLTables' run result:" -foregroundcolor "Yellow"


($result | Where-Object {$_.ActivityName -eq "TriggerCopy"}).Output.ToString()

Here is the output of the sample run:

{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}

$result2 = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId <copy above run ID> -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
$result2

4. Connect to your sink Azure Synapse Analytics and confirm that data has been copied from Azure SQL
Database properly.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to
a destination data store
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used
scenario. The tutorials in this section show you different ways of loading data incrementally by using Azure Data
Factory.

Delta data loading from database by using a watermark


In this case, you define a watermark in your source database. A watermark is a column that has the last updated
time stamp or an incrementing key. The delta loading solution loads the changed data between an old
watermark and a new watermark. The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorials:


Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally copy data from multiple tables in a SQL Server instance to Azure SQL Database
For templates, see the following:
Delta copy with control table

Delta data loading from SQL DB by using the Change Tracking


technology
Change Tracking technology is a lightweight solution in SQL Server and Azure SQL Database that provides an
efficient change tracking mechanism for applications. It enables an application to easily identify data that was
inserted, updated, or deleted.
The workflow for this approach is depicted in the following diagram:

For step-by-step instructions, see the following tutorial:


Incrementally copy data from Azure SQL Database to Azure Blob storage by using Change Tracking
technology

Loading new and changed files only by using LastModifiedDate


You can copy the new and changed files only by using LastModifiedDate to the destination store. ADF will scan
all the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and
updated file since last time to the destination store. Please be aware if you let ADF scan huge amounts of files
but only copy a few files to destination, you would still expect the long duration due to file scanning is time
consuming as well.
For step-by-step instructions, see the following tutorial:
Incrementally copy new and changed files based on LastModifiedDate from Azure Blob storage to Azure Blob
storage
For templates, see the following:
Copy new files by LastModifiedDate

Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information
as part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performant approach for
incrementally loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure
Blob storage

Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from Azure SQL Database
to Azure Blob storage using the Azure portal
7/7/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure Data Factory with a pipeline that loads delta data from a table in Azure SQL
Database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column . Select one column in the source data store, which can be used to slice
the new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value . In this tutorial, you store the watermark value
in a SQL database.
3. Create a pipeline with the following workflow :
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to
the Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies
the delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next
time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see Create a database in Azure SQL Database for steps to create one.
Azure Storage . You use the blob storage as the sink data store. If you don't have a storage account, see
Create a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Ser ver Explorer , right-click the database, and choose New
Quer y .
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y :
3. In the New data factor y page, enter ADFIncCopyTutorialDF for the name .
The name of the Azure Data Factory must be globally unique . If you see a red exclamation mark with
the following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF)
and try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory
artifacts.
Data factory name "ADFIncCopyTutorialDF" is not available
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version .
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, Azure SQL Managed Instance, and so on)
and computes (HDInsight, etc.) used by data factory can be in other regions.
8. Click Create .
9. After the creation is complete, you see the Data Factor y page as shown in the image.

10. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. On the home page of Data Factory UI, click the Orchestrate tile.
2. In the General panel under Proper ties , specify IncrementalCopyPipeline for Name . Then collapse the
panel by clicking the Properties icon in the top-right corner.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand
General , and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the
activity to LookupOldWaterMarkActivity .

4. Switch to the Settings tab, and click + New for Source Dataset . In this step, you create a dataset to
represent data in the watermarktable . This table contains the old watermark that was used in the
previous copy operation.
5. In the New Dataset window, select Azure SQL Database , and click Continue . You see a new window
opened for the dataset.
6. In the Set proper ties window for the dataset, enter WatermarkDataset for Name .
7. For Linked Ser vice , select New , and then do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for Name .
b. Select your server for Ser ver name .
c. Select your Database name from the dropdown list.
d. Enter your User name & Password .
e. To test connection to the your SQL database, click Test connection .
f. Click Finish .
g. Confirm that AzureSqlDatabaseLinkedSer vice is selected for Linked ser vice .

h. Select Finish .
8. In the Connection tab, select [dbo].[watermarktable] for Table . If you want to preview data in the
table, click Preview data .
9. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
10. In the Activities toolbox, expand General , and drag-drop another Lookup activity to the pipeline
designer surface, and set the name to LookupNewWaterMarkActivity in the General tab of the
properties window. This Lookup activity gets the new watermark value from the table with the source
data to be copied to the destination.
11. In the properties window for the second Lookup activity, switch to the Settings tab, and click New . You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
12. In the New Dataset window, select Azure SQL Database , and click Continue .
13. In the Set proper ties window, enter SourceDataset for Name . Select
AzureSqlDatabaseLinkedSer vice for Linked ser vice .
14. Select [dbo].[data_source_table] for Table. You specify a query on this dataset later in the tutorial. The
query takes the precedence over the table you specify in this step.
15. Select Finish .
16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left. In the properties window for the Lookup activity, confirm that
SourceDataset is selected for the Source Dataset field.
17. Select Quer y for the Use Quer y field, and enter the following query: you are only selecting the
maximum value of LastModifytime from the data_source_table . Please make sure you have also
checked First row only .

select MAX(LastModifytime) as NewWatermarkvalue from data_source_table


18. In the Activities toolbox, expand Move & Transform , and drag-drop the Copy activity from the
Activities toolbox, and set the name to IncrementalCopyActivity .
19. Connect both Lookup activities to the Copy activity by dragging the green button attached to the
Lookup activities to the Copy activity. Release the mouse button when you see the border color of the
Copy activity changes to blue.

20. Select the Copy activity and confirm that you see the properties for the activity in the Proper ties
window.
21. Switch to the Source tab in the Proper ties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Quer y for the Use Quer y field.
c. Enter the following SQL query for the Quer y field.
select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime
<= '@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'

22. Switch to the Sink tab, and click + New for the Sink Dataset field.
23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage , and
click Continue in the New Dataset window.
24. In the Select Format window, select the format type of your data, and click Continue .
25. In the Set Proper ties window, enter SinkDataset for Name . For Linked Ser vice , select + New . In this
step, you create a connection (linked service) to your Azure Blob storage .
26. In the New Linked Ser vice (Azure Blob Storage) window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Test Connection and then click Finish .
27. In the Set Proper ties window, confirm that AzureStorageLinkedSer vice is selected for Linked
ser vice . Then select Finish .
28. Go to the Connection tab of SinkDataset and do the following steps:
a. For the File path field, enter adftutorial/incrementalcopy . adftutorial is the blob container name
and incrementalcopy is the folder name. This snippet assumes that you have a blob container
named adftutorial in your blob storage. Create the container if it doesn't exist, or set it to the name of
an existing one. Azure Data Factory automatically creates the output folder incrementalcopy if it
does not exist. You can also use the Browse button for the File path to navigate to a folder in a blob
container.
b. For the File part of the File path field, select Add dynamic content [Alt+P] , and then enter
@CONCAT('Incremental-', pipeline().RunId, '.txt') in the opened window. Then select Finish . The file
name is dynamically generated by using the expression. Each pipeline run has a unique ID. The Copy
activity uses the run ID to generate the file name.
29. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left.
30. In the Activities toolbox, expand General , and drag-drop the Stored Procedure activity from the
Activities toolbox to the pipeline designer surface. Connect the green (Success) output of the Copy
activity to the Stored Procedure activity.
31. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity .
32. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
33. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select usp_write_watermark .
b. To specify values for the stored procedure parameters, click Impor t parameter , and enter
following values for the parameters:

NAME TYPE VA L UE

LastModifiedtime DateTime @{activity('LookupNewWaterMark


Activity').output.firstRow.NewWate
rmarkvalue}

TableName String @{activity('LookupOldWaterMark


Activity').output.firstRow.TableNa
me}

34. To validate the pipeline settings, click Validate on the toolbar. Confirm that there are no validation errors.
To close the Pipeline Validation Repor t window, click >>.
35. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.

Trigger a pipeline run


1. Click Add Trigger on the toolbar, and click Trigger Now .
2. In the Pipeline Run window, select Finish .

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You see the status of the pipeline run triggered by a manual trigger.
You can use links under the PIPELINE NAME column to view run details and to rerun the pipeline.
2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column.
For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME
column. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view,
select Refresh .

Review the results


1. Connect to your Azure Storage Account by using tools such as Azure Storage Explorer. Verify that an
output file is created in the incrementalcopy folder of the adftutorial container.

2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.

1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

3. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the output:

TA B L EN A M E WAT ERM A RK VA L UE

data_source_table 2017-09-05 8:06:00.000

Add more data to source


Insert new data into your database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the your database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000
Trigger another pipeline run
1. Switch to the Edit tab. Click the pipeline in the tree view if it's not opened in the designer.
2. Click Add Trigger on the toolbar, and click Trigger Now .

Monitor the second pipeline run


1. Switch to the Monitor tab on the left. You see the status of the pipeline run triggered by a manual trigger.
You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column.
For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME
column. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view,
select Refresh .

Verify the second output


1. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-<GUID>.txt . Open that file, and you see two rows of records in it.

6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable

sample output:

TA B L EN A M E WAT ERM A RK VA L UE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in SQL Database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in a SQL Server database to SQL Database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from Azure SQL Database
to Azure Blob storage using PowerShell
3/5/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use Azure Data Factory to create a pipeline that loads delta data from a table in Azure SQL
Database to Azure Blob storage.
You perform the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.

Overview
Here is the high-level solution diagram:

Here are the important steps to create this solution:


1. Select the watermark column . Select one column in the source data store, which can be used to slice
the new or updated records for every run. Normally, the data in this selected column (for example,
last_modify_time or ID) keeps increasing when rows are created or updated. The maximum value in this
column is used as a watermark.
2. Prepare a data store to store the watermark value .
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following workflow :
The pipeline in this solution has the following activities:
Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to
the Copy activity.
Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies
the delta data from the source data store to Blob storage as a new file.
Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next
time.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see Create a database in Azure SQL Database for steps to create one.
Azure Storage . You use the blob storage as the sink data store. If you don't have a storage account, see
Create a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell . Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Ser ver Explorer , right-click the database, and choose New
Quer y .
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:

create table data_source_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000

Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.

INSERT INTO watermarktable


VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable .

Select * from watermarktable

Output:

TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database


Run the following command to create a stored procedure in your SQL database:

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.


$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.

$dataFactoryName = "ADFIncCopyTutorialFactory";

5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Storage, SQL Database, Azure SQL Managed Instance, and so on) and computes (Azure
HDInsight, etc.) used by the data factory can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your storage account and SQL Database.
Create a Storage linked service
1. Create a JSON file named AzureStorageLinkedService.json in the C:\ADF folder with the following
content. (Create the folder ADF if it doesn't already exist.) Replace <accountName> and <accountKey> with
the name and key of your storage account before you save the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create a SQL Database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedService.json in the C:\ADF folder with the following
content. (Create the folder ADF if it doesn't already exist.) Replace <server>, <database>, <user id>, and
<password> with the name of your server, database, user ID, and password before you save the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=
<database>; Persist Security Info=False; User ID=<user> ; Password=<password>;
MultipleActiveResultSets = False; Encrypt = True; TrustServerCertificate = False; Connection Timeout
= 30;"
}
}
}

2. In PowerShell, switch to the ADF folder.


3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:


LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
ProvisioningState :

Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:

{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different
name.
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content:
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incrementalcopy",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the
container if it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is
automatically created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by
using the expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .

2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SinkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset WatermarkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},

"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from
data_source_table"
},

"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},

{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],

"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},

{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {

"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": {
"value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}", "type":"String"}
}
},

"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},

"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]

}
}
2. Run the Set-AzDataFactor yV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity,
IncrementalCopyActivity, StoredProceduretoWriteWatermarkActivity}
Parameters :

Run the pipeline


1. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzDataFactor yV2Pipeline cmdlet.
Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

2. Check the status of the pipeline by running the Get-AzDataFactor yV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end
time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:42:50 AM
DurationInMs : 7777
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}

Review the results


1. In the blob storage (sink store), you see that the data were copied to the file defined in SinkDataset. In the
current tutorial, the file name is Incremental- d4bf3ce2-5d60-43f3-9318-923155f61037.txt . Open the file,
and you can see records in the file that are the same as the data in the SQL database.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000

2. Check the latest value from watermarktable . You see that the watermark value was updated.

Select * from watermarktable

Here is the sample output:

TA B L EN A M E WAT ERM A RK VA L UE

data_source_table 2017-09-05 8:06:00.000

Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).

INSERT INTO data_source_table


VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in the SQL database is:

PersonID | Name | LastModifytime


-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

2. Run the pipeline IncrementalCopyPipeline again by using the Invoke-AzDataFactor yV2Pipeline


cmdlet. Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroupName


$resourceGroupName -dataFactoryName $dataFactoryName

3. Check the status of the pipeline by running the Get-AzDataFactor yV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".

Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineRunId $RunId -RunStartedAfter "<start time>" -RunStartedBefore "<end
time>"

Here is the sample output:


ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupNewWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {NewWatermarkvalue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:58 AM
DurationInMs : 31758
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}

ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}

4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in
it.
5. Check the latest value from watermarktable . You see that the watermark value was updated again.

Select * from watermarktable


sample output:

TA B L EN A M E WAT ERM A RK VA L UE

data_source_table 2017-09-07 09:01:00.000

Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in Azure SQL Database to Blob storage. Advance to
the following tutorial to learn how to copy data from multiple tables in a SQL Server database to SQL Database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL
Server to a database in Azure SQL Database using
the Azure portal
7/7/2021 • 17 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure data factory with a pipeline that loads delta data from multiple tables in a
SQL Server database to a database in Azure SQL Database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column .
Select one column for each table in the source data store, which can be used to identify the new or
updated records for every run. Normally, the data in this selected column (for example, last_modify_time
or ID) keeps increasing when rows are created or updated. The maximum value in this column is used as
a watermark.
2. Prepare a data store to store the watermark value .
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities :
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
SQL Ser ver . You use a SQL Server database as the source data store in this tutorial.
Azure SQL Database . You use a database in Azure SQL Database as the sink data store. If you don't have a
database in SQL Database, see Create a database in Azure SQL Database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your SQL Server database.
2. In Ser ver Explorer , right-click the database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your database


1. Open SQL Server Management Studio, and connect to your database in Azure SQL Database.
2. In Ser ver Explorer , right-click the database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
create table customer_table
(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in your database to store the high watermark value
1. Run the following SQL command against your database to create a table named watermarktable to store
the watermark value:

create table watermarktable


(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.

INSERT INTO watermarktable


VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in your database


Run the following command to create a stored procedure in your database. This stored procedure updates the
watermark value after every pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in your database


Run the following query to create two stored procedures and two data types in your database. They're used to
merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge them into destination store. Be cautious that it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built
your own stored procedure without using table variable to merge them from the “staging” table to the “final”
table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y :
3. In the New data factor y page, enter ADFMultiIncCopyTutorialDF for the name .
The name of the Azure data factory must be globally unique . If you see a red exclamation mark with the
following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF) and
try creating again. See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name "ADFIncCopyTutorialDF" is not available

4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version .
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. Click Create .
9. After the creation is complete, you see the Data Factor y page as shown in the image.

10. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.

Create self-hosted integration runtime


As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a
self-hosted integration runtime (IR) in your on-premises environment. The self-hosted IR moves data between
your private network and Azure.
1. On the home page of Azure Data Factory UI, select the Manage tab from the leftmost pane.

2. Select Integration runtimes on the left pane, and then select +New .

3. In the Integration Runtime Setup window, select Perform data movement and dispatch
activities to external computes , and click Continue .
4. Select Self-Hosted , and click Continue .
5. Enter MySelfHostedIR for Name , and click Create .
6. Click Click here to launch the express setup for this computer in the Option 1: Express setup
section.

7. In the Integration Runtime (Self-hosted) Express Setup window, click Close .

8. In the Web browser, in the Integration Runtime Setup window, click Finish .
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your SQL Server database and your database in Azure SQL Database.
Create the SQL Server linked service
In this step, you link your SQL Server database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Ser vices tab, and
click + New .

2. In the New Linked Ser vice window, select SQL Ser ver , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter SqlSer verLinkedSer vice for Name .
b. Select MySelfHostedIR for Connect via integration runtime . This is an impor tant step. The
default integration runtime cannot connect to an on-premises data store. Use the self-hosted
integration runtime you created earlier.
c. For Ser ver name , enter the name of your computer that has the SQL Server database.
d. For Database name , enter the name of the database in your SQL Server that has the source data. You
created a table and inserted data into this database as part of the prerequisites.
e. For Authentication type , select the type of the authentication you want to use to connect to the
database.
f. For User name , enter the name of user that has access to the SQL Server database. If you need to use
a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An example
is mydomain\\myuser .
g. For Password , enter the password for the user.
h. To test whether Data Factory can connect to your SQL Server database, click Test connection . Fix any
errors until the connection succeeds.
i. To save the linked service, click Finish .
Create the Azure SQL Database linked service
In the last step, you create a linked service to link your source SQL Server database to the data factory. In this
step, you link your destination/sink database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Ser vices tab, and
click + New .
2. In the New Linked Ser vice window, select Azure SQL Database , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for Name .
b. For Ser ver name , select the name of your server from the drop-down list.
c. For Database name , select the database in which you created customer_table and project_table as
part of the prerequisites.
d. For User name , enter the name of user that has access to the database.
e. For Password , enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection . Fix any
errors until the connection succeeds.
g. To save the linked service, click Finish .
4. Confirm that you see two linked services in the list.

Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. In the left pane, click + (plus) , and click Dataset .
2. In the New Dataset window, select SQL Ser ver , click Continue .
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the
tree view. In the General tab of the Properties window at the bottom, enter SourceDataset for Name .
4. Switch to the Connection tab in the Properties window, and select SqlSer verLinkedSer vice for
Linked ser vice . You do not select a table here. The Copy activity in the pipeline uses a SQL query to load
the data rather than load the entire table.

Create a sink dataset


1. In the left pane, click + (plus) , and click Dataset .
2. In the New Dataset window, select Azure SQL Database , and click Continue .
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the
tree view. In the General tab of the Properties window at the bottom, enter SinkDataset for Name .
4. Switch to the Parameters tab in the Properties window, and do the following steps:
a. Click New in the Create/update parameters section.
b. Enter SinkTableName for the name , and String for the type . This dataset takes
SinkTableName as a parameter. The SinkTableName parameter is set by the pipeline dynamically
at runtime. The ForEach activity in the pipeline iterates through a list of table names and passes the
table name to this dataset in each iteration.
5. Switch to the Connection tab in the Properties window, and select AzureSqlDatabaseLinkedSer vice
for Linked ser vice . For Table property, click Add dynamic content .
6. In the Add Dynamic Content window, select SinkTableName in the Parameters section.
7. After clicking Finish , you see "@dataset().SinkTableName" as the table name.

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. In the left pane, click + (plus) , and click Dataset .
2. In the New Dataset window, select Azure SQL Database , and click Continue .
3. In the General tab of the Properties window at the bottom, enter WatermarkDataset for Name .
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
b. Select [dbo].[watermarktable] for Table .
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table
names and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in
the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark
column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to the
destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next
iteration.
Create the pipeline
1. In the left pane, click + (plus) , and click Pipeline .
2. In the General panel under Proper ties , specify IncrementalCopyPipeline for Name . Then collapse the
panel by clicking the Properties icon in the top-right corner.
3. In the Parameters tab, do the following steps:
a. Click + New .
b. Enter tableList for the parameter name .
c. Select Array for the parameter type .
4. In the Activities toolbox, expand Iteration & Conditionals , and drag-drop the ForEach activity to the
pipeline designer surface. In the General tab of the Proper ties window, enter IterateSQLTables .
5. Switch to the Settings tab, and enter @pipeline().parameters.tableList for Items . The ForEach activity
iterates through a list of tables and performs the incremental copy operation.

6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General , drag-drop the Lookup activity to the pipeline designer
surface, and enter LookupOldWaterMarkActivity for Name .
8. Switch to the Settings tab of the Proper ties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .

select * from watermarktable where TableName = '@{item().TABLE_NAME}'

9. Drag-drop the Lookup activity from the Activities toolbox, and enter LookupNewWaterMarkActivity
for Name .
10. Switch to the Settings tab.
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .

select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from @{item().TABLE_NAME}

11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for
Name .
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box
attached to the Lookup activity and drop it on the Copy activity. Release the mouse button when the
border color of the Copy activity changes to blue .

13. Select the Copy activity in the pipeline. Switch to the Source tab in the Proper ties window.
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .

select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'

14. Switch to the Sink tab, and select SinkDataset for Sink Dataset .
15. Do the following steps:
a. In the Dataset proper ties , for SinkTableName parameter, enter @{item().TABLE_NAME} .
b. For Stored Procedure Name property, enter @{item().StoredProcedureNameForMergeOperation} .
c. For Table type property, enter @{item().TableType} .
d. For Table type parameter name , enter @{item().TABLE_NAME} .

16. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer
surface. Connect the Copy activity to the Stored Procedure activity.
17. Select the Stored Procedure activity in the pipeline, and enter
StoredProceduretoWriteWatermarkActivity for Name in the General tab of the Proper ties
window.
18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedSer vice for Linked Ser vice .

19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select [dbo].[usp_write_watermark] .
b. Select Impor t parameter .
c. Specify the following values for the parameters:
NAME TYPE VA L UE

LastModifiedtime DateTime @{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermark

TableName String @{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}

20. Select Publish All to publish the entities you created to the Data Factory service.
21. Wait until you see the Successfully published message. To see the notifications, click the Show
Notifications link. Close the notifications window by clicking X .

Run the pipeline


1. On the toolbar for the pipeline, click Add trigger , and click Trigger Now .
2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish .

[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
Monitor the pipeline
1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger . You can
use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column.
For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME
column.
3. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view, select
Refresh .

Review the results


In SQL Server Management Studio, run the following queries against the target SQL database to verify that the
data was copied from source tables to destination tables:
Quer y

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Quer y

select * from project_table

Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Quer y

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.

Add more data to the source tables


Run the following query against the source SQL Server database to update an existing row in customer_table.
Insert a new row into project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. In the web browser window, switch to the Edit tab on the left.
2. On the toolbar for the pipeline, click Add trigger , and click Trigger Now .
3. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish .

[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]

Monitor the pipeline again


1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger . You can
use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column.
For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME
column.
3. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view, select
Refresh .

Review the final results


In SQL Server Management Studio, run the following queries against the target SQL database to verify that the
updated/new data was copied from source tables to destination tables.
Quer y

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Notice the new values of Name and LastModifytime for the PersonID for number 3.
Quer y

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Quer y

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to Azure SQL Database using PowerShell
7/7/2021 • 18 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure Data Factory with a pipeline that loads delta data from multiple tables in a
SQL Server database to Azure SQL Database.
You perform the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime.
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.

Overview
Here are the important steps to create this solution:
1. Select the watermark column .
Select one column for each table in the source data store, which you can identify the new or updated
records for every run. Normally, the data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value in this column is used as a
watermark.
2. Prepare a data store to store the watermark value .
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities :
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
SQL Ser ver . You use a SQL Server database as the source data store in this tutorial.
Azure SQL Database . You use a database in Azure SQL Database as the sink data store. If you don't have a
SQL database, see Create a database in Azure SQL Database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio (SSMS) or Azure Data Studio, and connect to your SQL Server
database.
2. In Ser ver Explorer (SSMS) or in the Connections pane (Azure Data Studio) , right-click the
database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
create table customer_table
(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

INSERT INTO customer_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'John','9/1/2017 12:56:00 AM'),
(2, 'Mike','9/2/2017 5:23:00 AM'),
(3, 'Alice','9/3/2017 2:36:00 AM'),
(4, 'Andy','9/4/2017 3:21:00 AM'),
(5, 'Anny','9/5/2017 8:06:00 AM');

INSERT INTO project_table


(Project, Creationtime)
VALUES
('project1','1/1/2015 0:00:00 AM'),
('project2','2/2/2016 1:23:00 AM'),
('project3','3/4/2017 5:16:00 AM');

Create destination tables in your Azure SQL Database


1. Open SQL Server Management Studio (SSMS) or Azure Data Studio, and connect to your SQL Server
database.
2. In Ser ver Explorer (SSMS) or in the Connections pane (Azure Data Studio) , right-click the
database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :

create table customer_table


(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

create table project_table


(
Project varchar(255),
Creationtime datetime
);

Create another table in Azure SQL Database to store the high watermark value
1. Run the following SQL command against your database to create a table named watermarktable to store
the watermark value:
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Insert initial watermark values for both source tables into the watermark table.

INSERT INTO watermarktable


VALUES
('customer_table','1/1/2010 12:00:00 AM'),
('project_table','1/1/2010 12:00:00 AM');

Create a stored procedure in the Azure SQL Database


Run the following command to create a stored procedure in your database. This stored procedure updates the
watermark value after every pipeline run.

CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create data types and additional stored procedures in Azure SQL Database
Run the following query to create two stored procedures and two data types in your database. They're used to
merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built
your own stored procedure without using table variable to merge them from the “staging” table to the “final”
table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

GO

CREATE PROCEDURE usp_upsert_customer_table @customer_table DataTypeforCustomerTable READONLY


AS

BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END

GO

CREATE TYPE DataTypeforProjectTable AS TABLE(


Project varchar(255),
Creationtime datetime
);

GO

CREATE PROCEDURE usp_upsert_project_table @project_table DataTypeforProjectTable READONLY


AS

BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END

Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotation
marks, and then run the command. An example is "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

2. Define a variable for the location of the data factory.


$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location

If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.

$dataFactoryName = "ADFIncMultiCopyTutorialFactory";

5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

Note the following points:


The name of the data factory must be globally unique. If you receive the following error, change the name
and try again:

Set-AzDataFactoryV2 : HTTP Status Code: Conflict


Error Code: DataFactoryNameInUse
Error Message: The specified resource name 'ADFIncMultiCopyTutorialFactory' is already in use.
Resource names must be globally unique.

To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, SQL Database, SQL Managed Instance, and so on) and computes (Azure
HDInsight, etc.) used by the data factory can be in other regions.

Create a self-hosted integration runtime


In this section, you create a self-hosted integration runtime and associate it with an on-premises machine with
the SQL Server database. The self-hosted integration runtime is the component that copies data from SQL
Server on your machine to Azure SQL Database.
1. Create a variable for the name of the integration runtime. Use a unique name, and make a note of it. You
use it later in this tutorial.

$integrationRuntimeName = "ADFTutorialIR"

2. Create a self-hosted integration runtime.


Set-AzDataFactoryV2IntegrationRuntime -Name $integrationRuntimeName -Type SelfHosted -DataFactoryName
$dataFactoryName -ResourceGroupName $resourceGroupName

Here is the sample output:

Name : <Integration Runtime name>


Type : SelfHosted
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Description :
Id : /subscriptions/<subscription
ID>/resourceGroups/<ResourceGroupName>/providers/Microsoft.DataFactory/factories/<DataFactoryName>/in
tegrationruntimes/ADFTutorialIR

3. To retrieve the status of the created integration runtime, run the following command. Confirm that the
value of the State property is set to NeedRegistration .

Get-AzDataFactoryV2IntegrationRuntime -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName -Status

Here is the sample output:

State : NeedRegistration
Version :
CreateTime : 9/24/2019 6:00:00 AM
AutoUpdate : On
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
InternalChannelEncryption :
Capabilities : {}
ServiceUrls : {eu.frontend.clouddatahub.net}
Nodes : {}
Links : {}
Name : ADFTutorialIR
Type : SelfHosted
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Description :
Id : /subscriptions/<subscription ID>/resourceGroups/<ResourceGroup
name>/providers/Microsoft.DataFactory/factories/<DataFactory name>/integrationruntimes/<Integration
Runtime name>

4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:

Get-AzDataFactoryV2IntegrationRuntimeKey -Name $integrationRuntimeName -DataFactoryName


$dataFactoryName -ResourceGroupName $resourceGroupName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.

Install the integration runtime tool


1. If you already have the integration runtime on your machine, uninstall it by using Add or Remove
Programs .
2. Download the self-hosted integration runtime on a local Windows machine. Run the installation.
3. On the Welcome to Microsoft Integration Runtime Setup page, select Next .
4. On the End-User License Agreement page, accept the terms and license agreement, and select Next .
5. On the Destination Folder page, select Next .
6. On the Ready to install Microsoft Integration Runtime page, select Install .
7. On the Completed the Microsoft Integration Runtime Setup page, select Finish .
8. On the Register Integration Runtime (Self-hosted) page, paste the key you saved in the previous
section, and select Register .

9. On the New Integration Runtime (Self-hosted) Node page, select Finish .


10. When the self-hosted integration runtime is registered successfully, you see the following message:
11. On the Register Integration Runtime (Self-hosted) page, select Launch Configuration Manager .
12. When the node is connected to the cloud service, you see the following page:

13. Now, test the connectivity to your SQL Server database.


a. On the Configuration Manager page, go to the Diagnostics tab.
b. Select SqlSer ver for the data source type.
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the user name.
g. Enter the password that's associated with for the user name.
h. Select Test to confirm that the integration runtime can connect to SQL Server. If the connection is
successful, you see a green check mark. If the connection is not successful, you see an error message. Fix
any issues, and ensure that the integration runtime can connect to SQL Server.

NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your SQL Server database and your database in Azure SQL Database.
Create the SQL Server linked service
In this step, you link your SQL Server database to the data factory.
1. Create a JSON file named SqlSer verLinkedSer vice.json in the
C:\ADFTutorials\IncCopyMultiTableTutorial folder (create the local folders if they don't already exist) with
the following content. Select the right section based on the authentication you use to connect to SQL
Server.
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.

If you use SQL authentication, copy the following JSON definition:

{
"name":"SqlServerLinkedService",
"properties":{
"annotations":[

],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=False;data source=<servername>;initial catalog=
<database name>;user id=<username>;Password=<password>"
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}

If you use Windows authentication, copy the following JSON definition:

{
"name":"SqlServerLinkedService",
"properties":{
"annotations":[

],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=True;data source=<servername>;initial catalog=
<database name>",
"userName":"<username> or <domain>\\<username>",
"password":{
"type":"SecureString",
"value":"<password>"
}
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}

IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ).
An example is mydomain\\myuser .

2. In PowerShell, run the following cmdlet to switch to the C:\ADFTutorials\IncCopyMultiTableTutorial folder.


Set-Location 'C:\ADFTutorials\IncCopyMultiTableTutorial'

3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "SqlServerLinkedService" -File ".\SqlServerLinkedService.json"

Here is the sample output:

LinkedServiceName : SqlServerLinkedService
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService

Create the SQL Database linked service


1. Create a JSON file named AzureSQLDatabaseLinkedSer vice.json in
C:\ADFTutorials\IncCopyMultiTableTutorial folder with the following content. (Create the folder ADF if it
doesn't already exist.) Replace <servername>, <database name>, <user name>, and <password> with
the name of your SQL Server database, name of your database, user name, and password before you
save the file.

{
"name":"AzureSQLDatabaseLinkedService",
"properties":{
"annotations":[

],
"type":"AzureSqlDatabase",
"typeProperties":{
"connectionString":"integrated security=False;encrypt=True;connection timeout=30;data
source=<servername>.database.windows.net;initial catalog=<database name>;user id=<user
name>;Password=<password>;"
}
}
}

2. In PowerShell, run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureSQLDatabaseLinkedService.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:

{
"name":"SourceDataset",
"properties":{
"linkedServiceName":{
"referenceName":"SqlServerLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"SqlServerTable",
"schema":[

]
}
}

The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SourceDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset

Create a sink dataset


1. Create a JSON file named SinkDataset.json in the same folder with the following content. The
tableName element is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline
iterates through a list of table names and passes the table name to this dataset in each iteration.
{
"name":"SinkDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureSQLDatabaseLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"SinkTableName":{
"type":"String"
}
},
"annotations":[

],
"type":"AzureSqlTable",
"typeProperties":{
"tableName":{
"value":"@dataset().SinkTableName",
"type":"Expression"
}
}
}
}

2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SinkDataset.

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a dataset for a watermark


In this step, you create a dataset for storing a high watermark value.
1. Create a JSON file named WatermarkDataset.json in the same folder with the following content:

{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset WatermarkDataset.


Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "WatermarkDataset" -File ".\WatermarkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : WatermarkDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table
names and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in
the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark
column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to
the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the
next iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following
content:

{
"name":"IncrementalCopyPipeline",
"properties":{
"activities":[
{
"name":"IterateSQLTables",
"type":"ForEach",
"dependsOn":[

],
"userProperties":[

],
"typeProperties":{
"items":{
"value":"@pipeline().parameters.tableList",
"type":"Expression"
},
"isSequential":false,
"activities":[
{
"name":"LookupOldWaterMarkActivity",
"type":"Lookup",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"source":{
"type":"AzureSqlSource",
"sqlReaderQuery":{
"value":"select * from watermarktable where TableName =
'@{item().TABLE_NAME}'",
"type":"Expression"
}
},
"dataset":{
"referenceName":"WatermarkDataset",
"type":"DatasetReference"
}
}
},
{
"name":"LookupNewWaterMarkActivity",
"type":"Lookup",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"source":{
"type":"SqlServerSource",
"sqlReaderQuery":{
"value":"select MAX(@{item().WaterMark_Column}) as
NewWatermarkvalue from @{item().TABLE_NAME}",
"type":"Expression"
}
},
"dataset":{
"referenceName":"SourceDataset",
"type":"DatasetReference"
},
"firstRowOnly":true
}
},
{
"name":"IncrementalCopyActivity",
"type":"Copy",
"dependsOn":[
{
"activity":"LookupOldWaterMarkActivity",
"dependencyConditions":[
"Succeeded"
]
},
{
"activity":"LookupNewWaterMarkActivity",
"dependencyConditions":[
"Succeeded"
]
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"source":{
"type":"SqlServerSource",
"sqlReaderQuery":{
"value":"select * from @{item().TABLE_NAME} where
@{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'",
"type":"Expression"
}
},
"sink":{
"type":"AzureSqlSink",
"sqlWriterStoredProcedureName":{
"value":"@{item().StoredProcedureNameForMergeOperation}",
"type":"Expression"
},
"sqlWriterTableType":{
"value":"@{item().TableType}",
"type":"Expression"
},
"storedProcedureTableTypeParameterName":{
"value":"@{item().TABLE_NAME}",
"type":"Expression"
},
"disableMetricsCollection":false
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"SourceDataset",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"SinkDataset",
"type":"DatasetReference",
"parameters":{
"SinkTableName":{
"value":"@{item().TABLE_NAME}",
"type":"Expression"
}
}
}
]
},
{
"name":"StoredProceduretoWriteWatermarkActivity",
"type":"SqlServerStoredProcedure",
"dependsOn":[
{
"activity":"IncrementalCopyActivity",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"storedProcedureName":"[dbo].[usp_write_watermark]",
"storedProcedureParameters":{
"LastModifiedtime":{
"value":{

"value":"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type":"Expression"
},
"type":"DateTime"
},
"TableName":{
"value":{

"value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"Expression"
},
"type":"String"
}
}
},
"linkedServiceName":{
"referenceName":"AzureSQLDatabaseLinkedService",
"type":"LinkedServiceReference"
}
}
]
}
}
],
"parameters":{
"tableList":{
"type":"array"
}
},
"annotations":[

]
}
}

2. Run the Set-AzDataFactor yV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:


PipelineName : IncrementalCopyPipeline
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Run the pipeline


1. Create a parameter file named Parameters.json in the same folder with the following content:

{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}

2. Run the pipeline IncrementalCopyPipeline by using the Invoke-AzDataFactor yV2Pipeline cmdlet.


Replace placeholders with your own resource group and data factory name.

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupName -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

Monitor the pipeline


1. Sign in to the Azure portal.
2. Select All ser vices , search with the keyword Data factories, and select Data factories .
3. Search for your data factory in the list of data factories, and select it to open the Data factor y page.
4. On the Data factor y page, select Open on the Open Azure Data Factor y Studio tile to launch Azure
Data Factory in a separate tab.
5. On the Azure Data Factory home page, select Monitor on the left side.
6. You can see all the pipeline runs and their status. Notice that in the following example, the status of the
pipeline run is Succeeded . To check parameters passed to the pipeline, select the link in the Parameters
column. If an error occurred, you see a link in the Error column.

7. When you select the link in the Actions column, you see all the activity runs for the pipeline.
8. To go back to the Pipeline Runs view, select All Pipeline Runs .

Review the results


In SQL Server Management Studio, run the following queries against the target SQL database to verify that the
data was copied from source tables to destination tables:
Quer y

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000

Quer y

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000

Quer y

select * from watermarktable

Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000

Notice that the watermark values for both tables were updated.

Add more data to the source tables


Run the following query against the source SQL Server database to update an existing row in customer_table.
Insert a new row into project_table.

UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3

INSERT INTO project_table


(Project, Creationtime)
VALUES
('NewProject','10/1/2017 0:00:00 AM');

Rerun the pipeline


1. Now, rerun the pipeline by executing the following PowerShell command:

$RunId = Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup


$resourceGroupname -dataFactoryName $dataFactoryName -ParameterFile ".\Parameters.json"

2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. When the
pipeline status is In Progress , you see another action link under Actions to cancel the pipeline run.
3. Select Refresh to refresh the list until the pipeline run succeeds.
4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.

Review the final results


In SQL Server Management Studio, run the following queries against the target database to verify that the
updated/new data was copied from source tables to destination tables.
Quer y

select * from customer_table

Output

===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Quer y

select * from project_table

Output

===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000

Notice that the NewProject entry was added to project_table.


Quer y

select * from watermarktable

Output

======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000

Notice that the watermark values for both tables were updated.

Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database
to Azure Blob Storage using change tracking
information using the Azure portal
7/7/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure Data Factory with a pipeline that loads delta data based on change
tracking information in the source database in Azure SQL Database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In
some cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time
you processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database
and SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with
SQL Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob
Storage. For more concrete information about SQL Change Tracking technology, see Change tracking in SQL
Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking
technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL
Database as the source data store. You can also use a SQL Server instance.

1. Initial loading of historical data (run once):


a. Enable Change Tracking technology in the source database in Azure SQL Database.
b. Get the initial value of SYS_CHANGE_VERSION in the database as the baseline to capture changed
data.
c. Load full data from the source database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table ,
and then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data
store (Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see the Create a database in Azure SQL Database article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named
adftutorial .
Create a data source table in Azure SQL Database
1. Launch SQL Ser ver Management Studio , and connect to SQL Database.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your database to create a table named data_source_table as
data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of the database in Azure SQL Database that has the
data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three
days or more, some changed data is not included. You need to either change the value of
CHANGE_RETENTION to a bigger number. Alternatively, ensure that your period to load the changed data is
within two days. For more information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:

create table table_store_ChangeTracking_version


(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)

NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change
tracking version is 0.
6. Run the following query to create a stored procedure in your database. The pipeline invokes this stored
procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factor y :
3. In the New data factor y page, enter ADFTutorialDataFactor y for the name .
The name of the Azure Data Factory must be globally unique . If you receive the following error, change
the name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name “ADFTutorialDataFactory” is not available
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 (Preview) for the version .
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. Select Pin to dashboard .
9. Click Create .
10. On the dashboard, you see the following tile with status: Deploying data factor y .

11. After the creation is complete, you see the Data Factor y page as shown in the image.
12. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
13. In the home page, switch to the Manage tab in the left panel as shown in the following image:

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your Azure Storage account and your database in Azure SQL Database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Click Connections , and click + New .

2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Click Save .
Create Azure SQL Database linked service.
In this step, you link your database to the data factory.
1. Click Connections , and click + New .
2. In the New Linked Ser vice window, select Azure SQL Database , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for the Name field.
b. Select your server for the Ser ver name field.
c. Select your database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus) , and click Dataset .
2. Select Azure SQL Database , and click Finish .
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Proper ties
window, change the name of the dataset to SourceDataset .

4. Switch to the Connection tab, and do the following steps:


a. Select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
b. Select [dbo].[data_source_table] for Table .

Create a dataset to represent data copied to sink data store.


In this step, you create a dataset to represent the data that is copied from the source data store. You created the
adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the container if it does not
exist (or) set it to the name of an existing one. In this tutorial, the output file name is dynamically generated by
using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt') .
1. In the treeview, click + (plus) , and click Dataset .
2. Select Azure Blob Storage , and click Finish .
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Proper ties
window, change the name of the dataset to SinkDataset .

4. Switch to the Connection tab in the Properties window, and do the following steps:
a. Select AzureStorageLinkedSer vice for Linked ser vice .
b. Enter adftutorial/incchgtracking for folder part of the filePath .
c. Enter @CONCAT('Incremental-', pipeline().RunId, '.txt') for file part of the filePath .

Create a dataset to represent change tracking data


In this step, you create a dataset for storing the change tracking version. You created the table
table_store_ChangeTracking_version as part of the prerequisites.
1. In the treeview, click + (plus) , and click Dataset .
2. Select Azure SQL Database , and click Finish .
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Proper ties
window, change the name of the dataset to ChangeTrackingDataset .
4. Switch to the Connection tab, and do the following steps:
a. Select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
b. Select [dbo].[table_store_ChangeTracking_version] for Table .

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
1. Click + (plus) in the left pane, and click Pipeline .

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to FullCopyPipeline .
3. In the Activities toolbox, expand Data Flow , and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity .
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.

5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Repor t by clicking >> .
7. To publish entities (linked services, datasets, and pipelines), click Publish . Wait until the publishing
succeeds.

8. Wait until you see the Successfully published message.

9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X .
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now .

Monitor the full copy pipeline


1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh . The links in the Actions column let you view activity runs associated with the pipeline run and to
rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to
the pipeline runs view, click Pipelines link at the top.

Review the results


You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from your database:

1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against your database to add a row and update a row.
INSERT INTO data_source_table
(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get
the old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy
activity copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure
SQL Database to Azure Blob Storage. The stored procedure activity updates the value of
SYS_CHANGE_VERSION for the next pipeline run.
1. In the Data Factory UI, switch to the Edit tab. Click + (plus) in the left pane, and click Pipeline .

2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to IncrementalCopyPipeline .
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity . This activity gets
the change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version .
4. Switch to the Settings in the Proper ties window, and select ChangeTrackingDataset for the Source
Dataset field.

5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity . This activity gets the current
change tracking version.

6. Switch to the Settings in the Proper ties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .

SELECT CHANGE_TRACKING_CURRENT_VERSION() as CurrentChangeTrackingVersion


7. In the Activities toolbox, expand Data Flow , and drag-drop the Copy activity to the pipeline designer
surface. Set the name of the activity to IncrementalCopyActivity . This activity copies the data between
last change tracking version and the current change tracking version to the destination data store.

8. Switch to the Source tab in the Proper ties window, and do the following steps:
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as
CT on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTracking
Version}

9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.

10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached
to the Lookup activity to the Copy activity.

11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer
surface. Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity . This
activity updates the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedSer vice for Linked ser vice .

13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select Update_ChangeTracking_Version .
b. Select Impor t parameter .
c. In the Stored procedure parameters section, specify following values for the parameters:

NAME TYPE VA L UE

CurrentTrackingVersion Int64 @{activity('LookupCurrentChange


TrackingVersionActivity').output.fir
stRow.CurrentChangeTrackingVers
ion}

TableName String @{activity('LookupLastChangeTrac


kingVersionActivity').output.firstR
ow.TableName}
14. Connect the Copy activity to the Stored Procedure Activity . Drag-and-drop the green button
attached to the Copy activity to the Stored Procedure activity.

15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Repor t window by clicking >> .

16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now .

2. In the Pipeline Run window, select Finish .


Monitor the incremental copy pipeline
1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh . The links in the Actions column let you view activity runs associated with the pipeline run and
to rerun the pipeline.
2. To view activity runs associated with the pipeline run, click the View Activity Runs link in the Actions
column. There is only one activity in the pipeline, so you see only one entry in the list. To switch back to
the pipeline runs view, click Pipelines link at the top.

Review the results


You see the second file in the incchgtracking folder of the adftutorial container.

The file should have only the delta data from your database. The record with U is the updated row in the
database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.

==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database
to Azure Blob Storage using change tracking
information using PowerShell
3/5/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking
information in the source database in Azure SQL Database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source, sink, and change tracking datasets.
Create, run, and monitor the full copy pipeline
Add or update data in the source table
Create, run, and monitor the incremental copy pipeline

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In
some cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time
you processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database
and SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with
SQL Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob
Storage. For more concrete information about SQL Change Tracking technology, see Change tracking in SQL
Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking
technology.

NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL
Database as the source data store. You can also use a SQL Server instance.
1. Initial loading of historical data (run once):
a. Enable Change Tracking technology in the source database in Azure SQL Database.
b. Get the initial value of SYS_CHANGE_VERSION in the database as the baseline to capture changed
data.
c. Load full data from the source database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table ,
and then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.

High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data
store (Azure SQL Database) to the destination data store (Azure Blob Storage).

2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see the Create a database in Azure SQL Database article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named
adftutorial .
Create a data source table in your database
1. Launch SQL Ser ver Management Studio , and connect to SQL Database.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your database to create a table named data_source_table as
data source store.

create table data_source_table


(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(1, 'aaaa', 21),
(2, 'bbbb', 24),
(3, 'cccc', 20),
(4, 'dddd', 26),
(5, 'eeee', 22);

4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:

NOTE
Replace <your database name> with the name of your database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three
days or more, some changed data is not included. You need to either change the value of
CHANGE_RETENTION to a bigger number. Alternatively, ensure that your period to load the changed data is
within two days. For more information, see Enable change tracking for a database

ALTER DATABASE <your database name>


SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)

ALTER TABLE data_source_table


ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)

5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
create table table_store_ChangeTracking_version
(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);

DECLARE @ChangeTracking_version BIGINT


SET @ChangeTracking_version = CHANGE_TRACKING_CURRENT_VERSION();

INSERT INTO table_store_ChangeTracking_version


VALUES ('data_source_table', @ChangeTracking_version)

NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change
tracking version is 0.

6. Run the following query to create a stored procedure in your database. The pipeline invokes this stored
procedure to update the change tracking version in the table you created in the previous step.

CREATE PROCEDURE Update_ChangeTracking_Version @CurrentTrackingVersion BIGINT, @TableName varchar(50)


AS

BEGIN

UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName

END

Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

Create a data factory


1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes,
and then run the command. For example: "adfrg" .

$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again

2. Define a variable for the location of the data factory:

$location = "East US"

3. To create the Azure resource group, run the following command:

New-AzResourceGroup $resourceGroupName $location


If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.

4. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique.

$dataFactoryName = "IncCopyChgTrackingDF";

5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your Azure Storage account and your database in Azure SQL Database.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Create a JSON file named AzureStorageLinkedSer vice.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: (Create the
folder if it does not already exist.). Replace <accountName> , <accountKey> with name and key of your
Azure storage account before saving the file.
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}

2. In Azure PowerShell , switch to the C:\ADFTutorials\IncCopyChangeTrackingTutorial folder.


3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service:
AzureStorageLinkedSer vice . In the following example, you pass values for the ResourceGroupName
and DataFactor yName parameters.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureStorageLinkedService" -File ".\AzureStorageLinkedService.json"

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService

Create Azure SQL Database linked service.


In this step, you link your database to the data factory.
1. Create a JSON file named AzureSQLDatabaseLinkedSer vice.json in
C:\ADFTutorials\IncCopyChangeTrackingTutorial folder with the following content: Replace
<ser ver> <database name>, <user id>, and <password> with name of your server, name of your
database, user ID, and password before saving the file.

{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=
<database name>; Persist Security Info=False; User ID=<user name>; Password=<password>;
MultipleActiveResultSets = False; Encrypt = True; TrustServerCertificate = False; Connection Timeout
= 30;"
}
}
}

2. In Azure PowerShell , run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked
service: AzureSQLDatabaseLinkedSer vice .

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "AzureSQLDatabaseLinkedService" -File ".\AzureSQLDatabaseLinkedService.json"

Here is the sample output:


LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService

Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:

{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SourceDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SourceDataset" -File ".\SourceDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a sink dataset


In this step, you create a dataset to represent the data that is copied from the source data store.
1. Create a JSON file named SinkDataset.json in the same folder with the following content:
{
"name": "SinkDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "adftutorial/incchgtracking",
"fileName": "@CONCAT('Incremental-', pipeline().RunId, '.txt')",
"format": {
"type": "TextFormat"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name
is dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset

Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name "SinkDataset" -File ".\SinkDataset.json"

Here is the sample output of the cmdlet:

DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset

Create a change tracking dataset


In this step, you create a dataset for storing the change tracking version.
1. Create a JSON file named ChangeTrackingDataset.json in the same folder with the following content:

{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}

You create the table table_store_ChangeTracking_version as part of the prerequisites.


2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: ChangeTrackingDataset
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "ChangeTrackingDataset" -File ".\ChangeTrackingDataset.json"

Here is the sample output of the cmdlet:

DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset

Create a pipeline for the full copy


In this step, you create a pipeline with a copy activity that copies the entire data from the source data store
(Azure SQL Database) to the destination data store (Azure Blob Storage).
1. Create a JSON file: FullCopyPipeline.json in same folder with the following content:

{
"name": "FullCopyPipeline",
"properties": {
"activities": [{
"name": "FullCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink"
}
},

"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "FullCopyPipeline" -File ".\FullCopyPipeline.json"

Here is the sample output:

PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :
Run the full copy pipeline
Run the pipeline: FullCopyPipeline by using Invoke-AzDataFactor yV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "FullCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the full copy pipeline


1. Log in to Azure portal.
2. Click All ser vices , search with the keyword data factories , and select Data factories .

3. Search for your data factor y in the list of data factories, and select it to launch the Data factory page.

4. In the Data factory page, click Monitor & Manage tile.


5. The Data Integration Application launches in a separate tab. You can see all the pipeline runs and
their statuses. Notice that in the following example, the status of the pipeline run is Succeeded . You can
check parameters passed to the pipeline by clicking link in the Parameters column. If there was an error,
you see a link in the Error column. Click the link in the Actions column.

6. When you click the link in the Actions column, you see the following page that shows all the activity
runs for the pipeline.

7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.

The file should have the data from your database:


1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22

Add more data to the source table


Run the following query against your database to add a row and update a row.

INSERT INTO data_source_table


(PersonID, Name, Age)
VALUES
(6, 'new','50');

UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1

Create a pipeline for the delta copy


In this step, you create a pipeline with the following activities, and run it periodically. The lookup activities get
the old and new SYS_CHANGE_VERSION from Azure SQL Database and pass it to copy activity. The copy
activity copies the inserted/updated/deleted data between the two SYS_CHANGE_VERSION values from Azure
SQL Database to Azure Blob Storage. The stored procedure activity updates the value of
SYS_CHANGE_VERSION for the next pipeline run.
1. Create a JSON file: IncrementalCopyPipeline.json in same folder with the following content:

{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},
"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion
}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {
"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersio
n}",
"type": "INT64"
},
"TableName": {
"value":
"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}

2. Run the Set-AzDataFactoryV2Pipeline cmdlet to create the pipeline: FullCopyPipeline.

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName


-Name "IncrementalCopyPipeline" -File ".\IncrementalCopyPipeline.json"

Here is the sample output:

PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {LookupLastChangeTrackingVersionActivity,
LookupCurrentChangeTrackingVersionActivity, IncrementalCopyActivity,
StoredProceduretoUpdateChangeTrackingActivity}
Parameters :

Run the incremental copy pipeline


Run the pipeline: IncrementalCopyPipeline by using Invoke-AzDataFactor yV2Pipeline cmdlet.

Invoke-AzDataFactoryV2Pipeline -PipelineName "IncrementalCopyPipeline" -ResourceGroup $resourceGroupName -


dataFactoryName $dataFactoryName

Monitor the incremental copy pipeline


1. In the Data Integration Application , refresh the pipeline runs view. Confirm that you see the
IncrementalCopyPipeline in the list. Click the link in the Actions column.

2. When you click the link in the Actions column, you see the following page that shows all the activity
runs for the pipeline.
3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.

The file should have only the delta data from your database. The record with U is the updated row in the
database and I is the one added row.

1,update,10,2,U
6,new,50,1,I

The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.

==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Managed
Instance to Azure Storage using change data
capture (CDC)
7/7/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change data
capture (CDC) information in the source Azure SQL Managed Instance database to an Azure blob storage.
You perform the following steps in this tutorial:
Prepare the source data store
Create a data factory.
Create linked services.
Create source and sink datasets.
Create, debug and run the pipeline to check for changed data
Modify data in the source table
Complete, run and monitor the full incremental copy pipeline

Overview
The Change Data Capture technology supported by data stores such as Azure SQL Managed Instances (MI) and
SQL Server can be used to identify changed data. This tutorial describes how to use Azure Data Factory with
SQL Change Data Capture technology to incrementally load delta data from Azure SQL Managed Instance into
Azure Blob Storage. For more concrete information about SQL Change Data Capture technology, see Change
data capture in SQL Server.

End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Data Capture
technology.

NOTE
Both Azure SQL MI and SQL Server support the Change Data Capture technology. This tutorial uses Azure SQL Managed
Instance as the source data store. You can also use an on-premises SQL Server.

High-level solution
In this tutorial, you create a pipeline that performs the following operations:
1. Create a lookup activity to count the number of changed records in the SQL Database CDC table and pass
it to an IF Condition activity.
2. Create an If Condition to check whether there are changed records and if so, invoke the copy activity.
3. Create a copy activity to copy the inserted/updated/deleted data between the CDC table to Azure Blob
Storage.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure SQL Database Managed Instance . You use the database as the source data store. If you don't
have an Azure SQL Database Managed Instance, see the Create an Azure SQL Database Managed Instance
article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named raw .
Create a data source table in Azure SQL Database
1. Launch SQL Ser ver Management Studio , and connect to your Azure SQL Managed Instances server.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your Azure SQL Managed Instances database to create a table
named customers as data source store.

create table customers


(
customer_id int,
first_name varchar(50),
last_name varchar(50),
email varchar(100),
city varchar(50), CONSTRAINT "PK_Customers" PRIMARY KEY CLUSTERED ("customer_id")
);

4. Enable Change Data Capture mechanism on your database and the source table (customers) by
running the following SQL query:

NOTE
Replace <your source schema name> with the schema of your Azure SQL MI that has the customers table.
Change data capture doesn't do anything as part of the transactions that change the table being tracked.
Instead, the insert, update, and delete operations are written to the transaction log. Data that is deposited in
change tables will grow unmanageably if you do not periodically and systematically prune the data. For more
information, see Enable Change Data Capture for a database

EXEC sys.sp_cdc_enable_db

EXEC sys.sp_cdc_enable_table
@source_schema = 'dbo',
@source_name = 'customers',
@role_name = 'null',
@supports_net_changes = 1

5. Insert data into the customers table by running the following command:

insert into customers


(customer_id, first_name, last_name, email, city)
values
(1, 'Chevy', 'Leward', '[email protected]', 'Reading'),
(2, 'Sayre', 'Ateggart', '[email protected]', 'Portsmouth'),
(3, 'Nathalia', 'Seckom', '[email protected]', 'Portsmouth');
NOTE
No historical changes to the table are captured prior to change data capture being enabled.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factor y :

3. In the New data factor y page, enter ADFTutorialDataFactor y for the name .
The name of the Azure data factory must be globally unique . If you receive the following error, change
the name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name "ADFTutorialDataFactory" is not available.
4. Select V2 for the version .
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. De-select Enable GIT .
9. Click Create .
10. Once the deployment is complete, click on Go to resource
11. After the creation is complete, you see the Data Factor y page as shown in the image.

12. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
13. In the home page, switch to the Manage tab in the left panel as shown in the following image:

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this section, you create linked services to your Azure Storage account and Azure SQL MI.
Create Azure Storage linked service.
In this step, you link your Azure Storage Account to the data factory.
1. Click Connections , and click + New .

2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Click Save .
Create Azure SQL MI Database linked service.
In this step, you link your Azure SQL MI database to the data factory.

NOTE
For those using SQL MI see here for information regarding access via public vs private endpoint. If using private endpoint
one would need to run this pipeline using a self-hosted integration runtime. The same would apply to those running SQL
Server on-prem, in a VM or VNet scenarios.

1. Click Connections , and click + New .


2. In the New Linked Ser vice window, select Azure SQL Database Managed Instance , and click
Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureSqlMI1 for the Name field.
b. Select your SQL server for the Ser ver name field.
c. Select your SQL database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.

Create datasets
In this step, you create datasets to represent data source and data destination.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus) , and click Dataset .

2. Select Azure SQL Database Managed Instance , and click Continue .


3. In the Set proper ties tab, set the dataset name and connection information:
a. Select AzureSqlMI1 for Linked ser vice .
b. Select [dbo].[dbo_customers_CT] for Table name . Note: this table was automatically created when
CDC was enabled on the customers table. Changed data is never queried from this table directly but is
instead extracted through the CDC functions.
Create a dataset to represent data copied to sink data store.
In this step, you create a dataset to represent the data that is copied from the source data store. You created the
data lake container in your Azure Blob Storage as part of the prerequisites. Create the container if it does not
exist (or) set it to the name of an existing one. In this tutorial, the output file name is dynamically generated by
using the trigger time, which will be configured later.
1. In the treeview, click + (plus) , and click Dataset .
2. Select Azure Blob Storage , and click Continue .
3. Select DelimitedText , and click Continue .

4. In the Set Proper ties tab, set the dataset name and connection information:
a. Select AzureStorageLinkedSer vice for Linked ser vice .
b. Enter raw for container part of the filePath .
c. Enable First row as header
d. Click Ok
Create a pipeline to copy the changed data
In this step, you create a pipeline, which first checks the number of changed records present in the change table
using a lookup activity . An IF condition activity checks whether the number of changed records is greater than
zero and runs a copy activity to copy the inserted/updated/deleted data from Azure SQL Database to Azure
Blob Storage. Lastly, a tumbling window trigger is configured and the start and end times will be passed to the
activities as the start and end window parameters.
1. In the Data Factory UI, switch to the Edit tab. Click + (plus) in the left pane, and click Pipeline .
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to IncrementalCopyPipeline .

3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to GetChangeCount . This activity gets the number of records in the
change table for a given time window.
4. Switch to the Settings in the Proper ties window:
a. Specify the SQL MI dataset name for the Source Dataset field.
b. Select the Query option and enter the following into the query box:

DECLARE @from_lsn binary(10), @to_lsn binary(10);


SET @from_lsn =sys.fn_cdc_get_min_lsn('dbo_customers');
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE());
SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_customers(@from_lsn, @to_lsn, 'all')

c. Enable First row only


5. Click the Preview data button to ensure a valid output is obtained by the lookup activity

6. Expand Iteration & conditionals in the Activities toolbox, and drag-drop the If Condition activity to
the pipeline designer surface. Set the name of the activity to HasChangedRows .
7. Switch to the Activities in the Proper ties window:
a. Enter the following Expression

@greater(int(activity('GetChangeCount').output.firstRow.changecount),0)

b. Click on the pencil icon to edit the True condition.


c. Expand General in the Activities toolbox and drag-drop a Wait activity to the pipeline designer
surface. This is a temporary activity in order to debug the If condition and will be changed later in the
tutorial.
d. Click on the IncrementalCopyPipeline breadcrumb to return to the main pipeline.
8. Run the pipeline in Debug mode to verify the pipeline executes successfully.
9. Next, return to the True condition step and delete the Wait activity. In the Activities toolbox, expand
Move & transform , and drag-drop a Copy activity to the pipeline designer surface. Set the name of the
activity to IncrementalCopyActivity .
10. Switch to the Source tab in the Proper ties window, and do the following steps:
11. Specify the SQL MI dataset name for the Source Dataset field.
12. Select Quer y for Use Quer y .
13. Enter the following for Quer y .

DECLARE @from_lsn binary(10), @to_lsn binary(10);


SET @from_lsn =sys.fn_cdc_get_min_lsn('dbo_customers');
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE());
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_customers(@from_lsn, @to_lsn, 'all')

11. Click preview to verify that the query returns the changed rows correctly.
12. Switch to the Sink tab, and specify the Azure Storage dataset for the Sink Dataset field.

13. Click back to the main pipeline canvas and connect the Lookup activity to the If Condition activity one
by one. Drag the green button attached to the Lookup activity to the If Condition activity.

14. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Repor t window by clicking >> .
15. Click Debug to test the pipeline and verify that a file is generated in the storage location.

16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish all button. Wait until you see the Publishing succeeded message.

Configure the tumbling window trigger and CDC window parameters


In this step, you create a tumbling window trigger to run the job on a frequent schedule. You will use the
WindowStart and WindowEnd system variables of the tumbling window trigger and pass them as parameters to
your pipeline to be used in the CDC query.
1. Navigate to the Parameters tab of the IncrementalCopyPipeline pipeline and using the + New
button add two parameters (triggerStar tTime and triggerEndTime ) to the pipeline, which will
represent the tumbling window start and end time. For debugging purposes add default values in the
format YYYY-MM-DD HH24:MI:SS.FFF but ensure the triggerStartTime is not prior to CDC being
enabled on the table, otherwise this will result in an error.
2. Click on the settings tab of the Lookup activity and configure the query to use the start and end
parameters. Copy the following into the query:

@concat('DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);


SET @begin_time = ''',pipeline().parameters.triggerStartTime,''';
SET @end_time = ''',pipeline().parameters.triggerEndTime,''';
SET @from_lsn = sys.fn_cdc_map_time_to_lsn(''smallest greater than or equal'', @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn(''largest less than'', @end_time);
SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_customers(@from_lsn, @to_lsn,
''all'')')

3. Navigate to the Copy activity in the True case of the If Condition activity and click on the Source tab.
Copy the following into the query:

@concat('DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);


SET @begin_time = ''',pipeline().parameters.triggerStartTime,''';
SET @end_time = ''',pipeline().parameters.triggerEndTime,''';
SET @from_lsn = sys.fn_cdc_map_time_to_lsn(''smallest greater than or equal'', @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn(''largest less than'', @end_time);
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_customers(@from_lsn, @to_lsn, ''all'')')

4. Click on the Sink tab of the Copy activity and click Open to edit the dataset properties. Click on the
Parameters tab and add a new parameter called triggerStar t
5. Next, configure the dataset properties to store the data in a customers/incremental subdirectory with
date-based partitions.
a. Click on the Connection tab of the dataset properties and add dynamic content for both the
Director y and the File sections.
b. Enter the following expression in the Director y section by clicking on the dynamic content link under
the textbox:

@concat('customers/incremental/',formatDateTime(dataset().triggerStart,'yyyy/MM/dd'))

c. Enter the following expression in the File section. This will create file names based on the trigger start
date and time, suffixed with the csv extension:

@concat(formatDateTime(dataset().triggerStart,'yyyyMMddHHmmssfff'),'.csv')
d. Navigate back to the Sink settings in Copy activity by clicking on the IncrementalCopyPipeline tab.
e. Expand the dataset properties and enter dynamic content in the triggerStart parameter value with the
following expression:

@pipeline().parameters.triggerStartTime

6. Click Debug to test the pipeline and ensure the folder structure and output file is generated as expected.
Download and open the file to verify the contents.
7. Ensure the parameters are being injected into the query by reviewing the Input parameters of the
pipeline run.

8. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish all button. Wait until you see the Publishing succeeded message.
9. Finally, configure a tumbling window trigger to run the pipeline at a regular interval and set start and end
time parameters.
a. Click the Add trigger button, and select New/Edit
b. Enter a trigger name and specify a start time, which is equal to the end time of the debug window
above.

c. On the next screen, specify the following values for the start and end parameters respectively.

@formatDateTime(trigger().outputs.windowStartTime,'yyyy-MM-dd HH:mm:ss.fff')
@formatDateTime(trigger().outputs.windowEndTime,'yyyy-MM-dd HH:mm:ss.fff')
NOTE
Note the trigger will only run once it has been published. Additionally the expected behavior of tumbling window is to run
all historical intervals from the start date until now. More information regarding tumbling window triggers can be found
here.

10. Using SQL Ser ver Management Studio make some additional changes to the customer table by running
the following SQL:

insert into customers (customer_id, first_name, last_name, email, city) values (4, 'Farlie',
'Hadigate', '[email protected]', 'Reading');
insert into customers (customer_id, first_name, last_name, email, city) values (5, 'Anet', 'MacColm',
'[email protected]', 'Portsmouth');
insert into customers (customer_id, first_name, last_name, email, city) values (6, 'Elonore',
'Bearham', '[email protected]', 'Portsmouth');
update customers set first_name='Elon' where customer_id=6;
delete from customers where customer_id=5;

11. Click the Publish all button. Wait until you see the Publishing succeeded message.
12. After a few minutes the pipeline will have triggered and a new file will have been loaded into Azure Storage
Monitor the incremental copy pipeline
1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh . Hover near the name of the pipeline to access the Rerun action and Consumption report.

2. To view activity runs associated with the pipeline run, click the Pipeline name. If changed data was
detected, there will be three activities including the copy activity otherwise there will only be two entries
in the list. To switch back to the pipeline runs view, click the All Pipelines link at the top.
Review the results
You see the second file in the customers/incremental/YYYY/MM/DD folder of the raw container.

Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
7/20/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you'll use the Azure portal to create a data factory. You'll then use the Copy Data tool to create a
pipeline that incrementally copies new and changed files only, from Azure Blob storage to Azure Blob storage. It
uses LastModifiedDate to determine which files to copy.
After you complete the steps here, Azure Data Factory will scan all the files in the source store, apply the file
filter by LastModifiedDate , and copy to the destination store only files that are new or have been updated since
last time. Note that if Data Factory scans large numbers of files, you should still expect long durations. File
scanning is time consuming, even when the amount of data copied is reduced.

NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you'll complete these tasks:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account : Use Blob storage for the source and sink data stores. If you don't have an Azure
Storage account, follow the instructions in Create a storage account.

Create two containers in Blob storage


Prepare your Blob storage for the tutorial by completing these steps:
1. Create a container named source . You can use various tools to perform this task, like Azure Storage
Explorer.
2. Create a container named destination .

Create a data factory


1. In the left pane, select Create a resource . Select Integration > Data Factor y :
2. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name for your data factory must be globally unique. You might receive this error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yourname ADFTutorialDataFactor y . For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Under Subscription , select the Azure subscription in which you'll create the new data factory.
4. Under Resource Group , take one of these steps:
Select Use existing and then select an existing resource group in the list.
Select Create new and then enter a name for the resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under Version , select V2 .
6. Under Location , select the location for the data factory. Only supported locations appear in the list. The
data stores (for example, Azure Storage and Azure SQL Database) and computes (for example, Azure
HDInsight) that your data factory uses can be in other locations and regions.
7. Select Create .
8. After the data factory is created, the data factory home page appears.
9. To open the Azure Data Factory user interface (UI) on a separate tab, select Open on the Open Azure
Data Factor y Studio tile:
Use the Copy Data tool to create a pipeline
1. On the Azure Data Factory home page, select the Ingest tile to open the Copy Data tool:

2. On the Proper ties page, take the following steps:


a. Under Task type , select Built-in copy task .
b. Under Task cadence or task schedule , select Tumbling window .
c. Under Recurrence , enter 15 Minute(s) .
d. Select Next .
3. On the Source data store page, complete these steps:
a. Select + New connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue :
c. On the New connection (Azure Blob Storage) page, select your Azure subscription from the
Azure subscription list and your storage account from the Storage account name list. Test the
connection and then select Create .
d. Select the newly created connection in the Connection block.
e. In the File or folder section, select Browse and choose the source folder, and then select OK .
f. Under File loading behavior , select Incremental load: LastModifiedDate , and choose
Binar y copy .
g. Select Next .
4. On the Destination data store page, complete these steps:
a. Select the AzureBlobStorage connection that you created. This is the same storage account as
the source data store.
b. In the Folder path section, browse for and select the destination folder, and then select OK .
c. Select Next .
5. On the Settings page, under Task name , enter DeltaCopyFromBlobPipeline , then select Next . Data
Factory creates a pipeline with the specified task name.
6. On the Summar y page, review the settings and then select Next .
7. On the Deployment page, select Monitor to monitor the pipeline (task).
8. Notice that the Monitor tab on the left is automatically selected. The application switches to the Monitor
tab. You see the status of the pipeline. Select Refresh to refresh the list. Select the link under Pipeline
name to view activity run details or to run the pipeline again.

9. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, on the Activity runs page, select the Details link (the eyeglasses icon) in the Activity
name column. For details about the properties, see Copy activity overview.
Because there are no files in the source container in your Blob storage account, you won't see any files
copied to the destination container in the account:

10. Create an empty text file and name it file1.txt . Upload this text file to the source container in your
storage account. You can use various tools to perform these tasks, like Azure Storage Explorer.

11. To go back to the Pipeline runs view, select All pipeline runs link in the breadcrumb menu on the
Activity runs page, and wait for the same pipeline to be automatically triggered again.
12. When the second pipeline run completes, follow the same steps mentioned previously to review the
activity run details.
You'll see that one file (file1.txt) has been copied from the source container to the destination container of
your Blob storage account:

13. Create another empty text file and name it file2.txt . Upload this text file to the source container in your
Blob storage account.
14. Repeat steps 11 and 12 for the second text file. You'll see that only the new file (file2.txt) was copied from
the source container to the destination container of your storage account during this pipeline run.
You can also verify that only one file has been copied by using Azure Storage Explorer to scan the files:

Next steps
Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
7/20/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a
pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to
Azure Blob storage.

NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you perform the following steps:


Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.

Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure storage account : Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source . Create a folder path as 2021/07/15/06 in your container. Create an
empty text file, and name it as file1.txt . Upload the file1.txt to the folder path source/2021/07/15/06
in your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.

NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 6:10 AM on July 15,
2021, you can create the folder path as source/2021/07/15/06/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.

2. Create a container named destination . You can use various tools to perform these tasks, such as Azure
Storage Explorer.
Create a data factory
1. On the left menu, select Create a resource > Integration > Data Factor y :

2. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .


The name for your data factory must be globally unique. You might receive the following error message:
If you receive an error message about the name value, enter a different name for the data factory. For
example, use the name yourname ADFTutorialDataFactor y . For the naming rules for Data Factory
artifacts, see Data Factory naming rules.
3. Select the Azure subscription in which to create the new data factory.
4. For Resource Group , take one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
5. Under version , select V2 for the version.
6. Under location , select the location for the data factory. Only supported locations are displayed in the
drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) that are used by your data factory can be in other locations and regions.
7. Select Create .
8. After creation is finished, the Data Factor y home page is displayed.
9. To launch the Azure Data Factory user interface (UI) in a separate tab, select Open on the Open Azure
Data Factor y Studio tile.
Use the Copy Data tool to create a pipeline
1. On the Azure Data Factory home page, select the Ingest title to launch the Copy Data tool.

2. On the Proper ties page, take the following steps:


a. Under Task type , choose Built-in copy task .
b. Under Task cadence or task schedule , select Tumbling window .
c. Under Recurrence , enter 1 Hour(s) .
d. Select Next .
3. On the Source data store page, complete the following steps:
a. Select + New connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue .
c. On the New connection (Azure Blob Storage) page, enter a name for the connection. Select your
Azure subscription, and select your storage account from the Storage account name list. Test
connection and then select Create .
d. On the Source data store page, select the newly created connection in the Connection section.
e. In the File or folder section, browse and select the source container, then select OK .
f. Under File loading behavior , select Incremental load: time-par titioned folder/file names .
g. Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/ , and change the format as
shown in the following screenshot.
h. Check Binar y copy and select Next .
4. On the Destination data store page, complete the following steps:
a. Select the AzureBlobStorage , which is the same storage account as data source store.
b. Browse and select the destination folder, then select OK .
c. Write the dynamic folder path as destination/{year}/{month}/{day}/{hour}/ , and change the
format as shown in the following screenshot.
d. Select Next .
5. On the Settings page, under Task name , enter DeltaCopyFromBlobPipeline , and then select Next .
The Data Factory UI creates a pipeline with the specified task name.
6. On the Summar y page, review the settings, and then select Next .
7. On the Deployment page, select Monitor to monitor the pipeline (task).
8. Notice that the Monitor tab on the left is automatically selected. You need wait for the pipeline run when
it is triggered automatically (about after one hour). When it runs, select the pipeline name link
DeltaCopyFromBlobPipeline to view activity run details or rerun the pipeline. Select Refresh to
refresh the list.

9. There's only one activity (copy activity) in the pipeline, so you see only one entry. Adjust the column width
of the Source and Destination columns (if necessary) to display more details, you can see the source
file (file1.txt) has been copied from source/2021/07/15/06/ to destination/2021/07/15/06/ with the
same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the
files.

10. Create another empty text file with the new name as file2.txt . Upload the file2.txt file to the folder path
source/2021/07/15/07 in your storage account. You can use various tools to perform these tasks, such
as Azure Storage Explorer.

NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 7:30 AM on July. 15th, 2021, you can create the folder path as
source/2021/07/15/07/ by the rule of {Year}/{Month}/{Day}/{Hour}/.

11. To go back to the Pipeline runs view, select All pipelines runs , and wait for the same pipeline being
triggered again automatically after another one hour.
12. Select the new DeltaCopyFromBlobPipeline link for the second pipeline run when it comes, and do the
same to review details. You will see the source file (file2.txt) has been copied from
source/2021/07/15/07/ to destination/2021/07/15/07/ with the same file name. You can also
verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files in
destination container.

Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Copy data securely from Azure Blob storage to a
SQL database by using private endpoints
7/7/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). The pipeline in this
data factory copies data securely from Azure Blob storage to an Azure SQL database (both allowing access to
only selected networks) by using private endpoints in Azure Data Factory Managed Virtual Network. The
configuration pattern in this tutorial applies to copying from a file-based data store to a relational data store. For
a list of data stores supported as sources and sinks, see the Supported data stores and formats table.

NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you do the following steps:


Create a data factory.
Create a pipeline with a copy activity.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Blob storage as a source data store. If you don't have a storage account,
see Create an Azure storage account for steps to create one. Ensure the storage account allows access only
from selected networks.
Azure SQL Database . You use the database as a sink data store. If you don't have an Azure SQL database,
see Create a SQL database for steps to create one. Ensure the SQL Database account allows access only from
selected networks.
Create a blob and a SQL table
Now, prepare your blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Open Notepad. Copy the following text, and save it as an emp.txt file on your disk:

FirstName,LastName
John,Doe
Jane,Doe

2. Create a container named adftutorial in your blob storage. Create a folder named input in this
container. Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure
Storage Explorer to do these tasks.
Create a sink SQL table
Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

Create a data factory


In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome. Currently, only Microsoft Edge and Google Chrome web
browsers support the Data Factory UI.
2. On the left menu, select Create a resource > Analytics > Data Factor y .
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name of the Azure data factory must be globally unique. If you receive an error message about the
name value, enter a different name for the data factory (for example, yournameADFTutorialDataFactory).
For naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported appear in the
drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in the Notifications center. Select Go to resource to go
to the Data Factor y page.
10. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Factory UI in a separate
tab.

Create an Azure integration runtime in Data Factory Managed Virtual


Network
In this step, you create an Azure integration runtime and enable Data Factory Managed Virtual Network.
1. In the Data Factory portal, go to Manage and select New to create a new Azure integration runtime.
2. On the Integration runtime setup page, choose what integration runtime to create based on required
capabilities. In this tutorial, select Azure, Self-Hosted and then click Continue .
3. Select Azure and then click Continue to create an Azure Integration runtime.

4. Under Vir tual network configuration (Preview) , select Enable .


5. Select Create .

Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start by creating a pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the home page, select Orchestrate .
2. In the properties pane for the pipeline, enter CopyPipeline for the pipeline name.
3. In the Activities tool box, expand the Move and Transform category, and drag the Copy data activity
from the tool box to the pipeline designer surface. Enter CopyFromBlobToSql for the name.

Configure a source

TIP
In this tutorial, you use Account key as the authentication type for your source data store. You can also choose other
supported authentication methods, such as SAS URI ,Ser vice Principal, and Managed Identity if needed. For more
information, see the corresponding sections in Copy and transform data in Azure Blob storage by using Azure Data
Factory.
To store secrets for data stores securely, we also recommend that you use Azure Key Vault. For more information and
illustrations, see Store credentials in Azure Key Vault.

Create a source dataset and linked service


1. Go to the Source tab. Select + New to create a source dataset.
2. In the New Dataset dialog box, select Azure Blob Storage , and then select Continue . The source data
is in Blob storage, so you select Azure Blob Storage for the source dataset.
3. In the Select Format dialog box, select the format type of your data, and then select Continue .
4. In the Set Proper ties dialog box, enter SourceBlobDataset for Name . Select the check box for First
row as header . Under the Linked ser vice text box, select + New .
5. In the New linked ser vice (Azure Blob Storage) dialog box, enter AzureStorageLinkedSer vice as
Name , and select your storage account from the Storage account name list.
6. Make sure you enable Interactive authoring . It might take around one minute to be enabled.

7. Select Test connection . It should fail when the storage account allows access only from Selected
networks and requires Data Factory to create a private endpoint to it that should be approved prior to
using it. In the error message, you should see a link to create a private endpoint that you can follow to
create a managed private endpoint. An alternative is to go directly to the Manage tab and follow
instructions in the next section to create a managed private endpoint.

NOTE
The Manage tab might not be available for all data factory instances. If you don't see it, you can access private
endpoints by selecting Author > Connections > Private Endpoint .

8. Keep the dialog box open, and then go to your storage account.
9. Follow instructions in this section to approve the private link.
10. Go back to the dialog box. Select Test connection again, and select Create to deploy the linked service.
11. After the linked service is created, it goes back to the Set proper ties page. Next to File path , select
Browse .
12. Go to the adftutorial/input folder, select the emp.txt file, and then select OK .
13. Select OK . It automatically goes to the pipeline page. On the Source tab, confirm that
SourceBlobDataset is selected. To preview data on this page, select Preview data .

Create a managed private endpoint


If you didn't select the hyperlink when you tested the connection, follow the path. Now you need to create a
managed private endpoint that you'll connect to the linked service you created.
1. Go to the Manage tab.

NOTE
The Manage tab might not be available for all Data Factory instances. If you don't see it, you can access private
endpoints by selecting Author > Connections > Private Endpoint .

2. Go to the Managed private endpoints section.


3. Select + New under Managed private endpoints .

4. Select the Azure Blob Storage tile from the list, and select Continue .
5. Enter the name of the storage account you created.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the storage account level.

Approval of a private link in a storage account


1. In the storage account, go to Private endpoint connections under the Settings section.
2. Select the check box for the private endpoint you created, and select Approve .
3. Add a description, and select yes .
4. Go back to the Managed private endpoints section of the Manage tab in Data Factory.
5. After about one or two minutes, you should see the approval of your private endpoint appear in the Data
Factory UI.
Configure a sink

TIP
In this tutorial, you use SQL authentication as the authentication type for your sink data store. You can also choose
other supported authentication methods, such as Ser vice Principal and Managed Identity if needed. For more
information, see corresponding sections in Copy and transform data in Azure SQL Database by using Azure Data Factory.
To store secrets for data stores securely, we also recommend that you use Azure Key Vault. For more information and
illustrations, see Store credentials in Azure Key Vault.

Create a sink dataset and linked service


1. Go to the Sink tab, and select + New to create a sink dataset.
2. In the New Dataset dialog box, enter SQL in the search box to filter the connectors. Select Azure SQL
Database , and then select Continue . In this tutorial, you copy data to a SQL database.
3. In the Set Proper ties dialog box, enter OutputSqlDataset for Name . From the Linked ser vice drop-
down list, select + New . A dataset must be associated with a linked service. The linked service has the
connection string that Data Factory uses to connect to the SQL database at runtime. The dataset specifies
the container, folder, and the file (optional) to which the data is copied.
4. In the New linked ser vice (Azure SQL Database) dialog box, take the following steps:
a. Under Name , enter AzureSqlDatabaseLinkedSer vice .
b. Under Ser ver name , select your SQL Server instance.
c. Make sure you enable Interactive authoring .
d. Under Database name , select your SQL database.
e. Under User name , enter the name of the user.
f. Under Password , enter the password for the user.
g. Select Test connection . It should fail because the SQL server allows access only from Selected
networks and requires Data Factory to create a private endpoint to it, which should be approved
prior to using it. In the error message, you should see a link to create a private endpoint that you can
follow to create a managed private endpoint. An alternative is to go directly to the Manage tab and
follow instructions in the next section to create a managed private endpoint.
h. Keep the dialog box open, and then go to your selected SQL server.
i. Follow instructions in this section to approve the private link.
j. Go back to the dialog box. Select Test connection again, and select Create to deploy the linked
service.
5. It automatically goes to the Set Proper ties dialog box. In Table , select [dbo].[emp] . Then select OK .
6. Go to the tab with the pipeline, and in Sink dataset , confirm that OutputSqlDataset is selected.
You can optionally map the schema of the source to the corresponding schema of the destination by following
Schema mapping in copy activity.
Create a managed private endpoint
If you didn't select the hyperlink when you tested the connection, follow the path. Now you need to create a
managed private endpoint that you'll connect to the linked service you created.
1. Go to the Manage tab.
2. Go to the Managed private endpoints section.
3. Select + New under Managed private endpoints .

4. Select the Azure SQL Database tile from the list, and select Continue .
5. Enter the name of the SQL server you selected.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the SQL server level.
Approval of a private link in SQL Server
1. In the SQL server, go to Private endpoint connections under the Settings section.
2. Select the check box for the private endpoint you created, and select Approve .
3. Add a description, and select yes .
4. Go back to the Managed private endpoints section of the Manage tab in Data Factory.
5. It should take one or two minutes for the approval to appear for your private endpoint.
Debug and publish the pipeline
You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory or
your own Azure Repos Git repository.
1. To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output tab
at the bottom of the window.
2. After the pipeline can run successfully, in the top toolbar, select Publish all . This action publishes entities
(datasets and pipelines) you created to Data Factory.
3. Wait until you see the Successfully published message. To see notification messages, select Show
Notifications in the upper-right corner (bell button).
Summary
The pipeline in this sample copies data from Blob storage to SQL Database by using private endpoints in Data
Factory Managed Virtual Network. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Best practices for writing to files to data lake with
data flows
7/2/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you'll learn best practices that can be applied when writing files to ADLS Gen2 or Azure Blob
Storage using data flows. You'll need access to an Azure Blob Storage Account or Azure Data Lake Store Gen2
account for reading a parquet file and then storing the results in folders.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The steps in this tutorial will assume that you have

Create a data factory


In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome . Currently, Data Factory UI is supported only in the Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.To learn about resource groups, see Use
resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in Notifications center. Select Go to resource to
navigate to the Data factory page.
10. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline with a data flow activity


In this step, you'll create a pipeline that contains a data flow activity.
1. On the home page of Azure Data Factory, select Orchestrate .

2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
3. In the factory top bar, slide the Data Flow debug slider on. Debug mode allows for interactive testing of
transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and
users are recommended to turn on debug first if they plan to do Data Flow development. For more
information, see Debug Mode.

4. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.

5. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DeltaLake . Click Finish when done.

Build transformation logic in the data flow canvas


You will take any source data (in this tutorial, we'll use a Parquet file source) and use a sink transformation to
land the data in Parquet format using the most effective mechanisms for data lake ETL.
Tutorial objectives
1. Choose any of your source datasets in a new data flow 1. Use data flows to effectively partition your sink
dataset
2. Land your partitioned data in ADLS Gen2 lake folders
Start from a blank data flow canvas
First, let's set up the data flow environment for each of the mechanisms described below for landing data in
ADLS Gen2
1. Click on the source transformation.
2. Click the new button next to dataset in the bottom panel.
3. Choose a dataset or create a new one. For this demo, we'll use a Parquet dataset called User Data.
4. Add a Derived Column transformation. We'll use this as a way to set your desired folder names dynamically.
5. Add a sink transformation.
Hierarchical folder output
It is very common to use unique values in your data to create folder hierarchies to partition your data in the
lake. This is a very optimal way to organize and process data in the lake and in Spark (the compute engine
behind data flows). However, there will be a small performance cost to organize your output in this way. Expect
to see a small decrease in overall pipeline performance using this mechanism in the sink.
1. Go back to the data flow designer and edit the data flow create above. Click on the sink transformation.
2. Click Optimize > Set partitioning > Key
3. Pick the column(s) you wish to use to set your hierarchical folder structure.
4. Note the example below uses year and month as the columns for folder naming. The results will be folders of
the form releaseyear=1990/month=8 .
5. When accessing the data partitions in a data flow source, you will point to just the top-level folder above
releaseyear and use a wildcard pattern for each subsequent folder, ex: **/**/*.parquet
6. To manipulate the data values, or even if need to generate synthetic values for folder names, use the Derived
Column transformation to create the values you wish to use in your folder names.
Name folder as data values
A slightly better performing sink technique for lake data using ADLS Gen2 that does not offer the same benefit
as key/value partitioning, is Name folder as column data . Whereas the key partitioning style of hierarchical
structure will allow you to process data slices easier, this technique is a flattened folder structure that can write
data quicker.
1. Go back to the data flow designer and edit the data flow create above. Click on the sink transformation.
2. Click Optimize > Set partitioning > Use current partitioning.
3. Click Settings > Name folder as column data.
4. Pick the column that you wish to use for generating folder names.
5. To manipulate the data values, or even if need to generate synthetic values for folder names, use the Derived
Column transformation to create the values you wish to use in your folder names.

Name file as data values


The techniques listed in the above tutorials are good use cases for creating folder categories in your data lake.
The default file naming scheme being employed by those techniques is to use the Spark executor job ID.
Sometimes you may wish to set the name of the output file in a data flow text sink. This technique is only
suggested for use with small files. The process of merging partition files into a single output file is a long-
running process.
1. Go back to the data flow designer and edit the data flow create above. Click on the sink transformation.
2. Click Optimize > Set partitioning > Single partition. It is this single partition requirement that creates a
bottleneck in the execution process as files are merged. This option is only recommended for small files.
3. Click Settings > Name file as column data.
4. Pick the column that you wish to use for generating file names.
5. To manipulate the data values, or even if need to generate synthetic values for file names, use the Derived
Column transformation to create the values you wish to use in your file names.

Next steps
Learn more about data flow sinks.
Dynamically set column names in data flows
6/23/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Many times, when processing data for ETL jobs, you will need to change the column names before writing the
results. Sometimes this is needed to align column names to a well-known target schema. Other times, you may
need to set column names at runtime based on evolving schemas. In this tutorial, you'll learn how to use data
flows to set column names for your destination files and database tables dynamically using external
configuration files and parameters.
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.

Create a data factory


In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome . Currently, Data Factory UI is supported only in the Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.To learn about resource groups, see Use
resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported are displayed in the
drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for example,
Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in Notifications center. Select Go to resource to navigate to
the Data factory page.
10. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline with a data flow activity


In this step, you'll create a pipeline that contains a data flow activity.
1. From the ADF home page, select Create pipeline .
2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
3. In the factory top bar, slide the Data Flow debug slider on. Debug mode allows for interactive testing of
transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and
users are recommended to turn on debug first if they plan to do Data Flow development. For more
information, see Debug Mode.

4. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.

5. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DynaCols . Click Finish when done.

Build dynamic column mapping in data flows


For this tutorial, we're going to use a sample movies rating file and renaming a few of the fields in the source to
a new set of target columns that can change over time. The datasets you'll create below should point to this
movies CSV file in your Blob Storage or ADLS Gen2 storage account. Download the movies file here and store
the file in your Azure storage account.

Tutorial objectives
You'll learn how to dynamically set column names using a data flow
1. Create a source dataset for the movies CSV file.
2. Create a lookup dataset for a field mapping JSON configuration file.
3. Convert the columns from the source to your target column names.
Start from a blank data flow canvas
First, let's set up the data flow environment for each of the mechanisms described below for landing data in
ADLS Gen2.
1. Click on the source transformation and call it movies1 .
2. Click the new button next to dataset in the bottom panel.
3. Choose either Blob or ADLS Gen2 depending on where you stored the moviesDB.csv file from above.
4. Add a 2nd source, which we will use to source the configuration JSON file to lookup field mappings.
5. Call this as columnmappings .
6. For the dataset, point to a new JSON file that will store a configuration for column mapping. You can
paste the into the JSON file for this tutorial example:

[
{"prevcolumn":"title","newcolumn":"movietitle"},
{"prevcolumn":"year","newcolumn":"releaseyear"}
]

7. Set this source settings to array of documents .


8. Add a 3rd source and call it movies2 . Configure this exactly the same as movies1 .
Parameterized column mapping
In this first scenario, you will set output column names in you data flow by setting the column mapping based
on matching incoming fields with a parameter that is a string array of columns and match each array index with
the incoming column ordinal position. When executing this data flow from a pipeline, you will be able to set
different column names on each pipeline execution by sending in this string array parameter to the data flow
activity.

1. Go back to the data flow designer and edit the data flow created above.
2. Click on the parameters tab
3. Create a new parameter and choose string array data type
4. For the default value, enter ['a','b','c']

5. Use the top movies1 source to modify the column names to map to these array values
6. Add a Select transformation. The Select transformation will be used to map incoming columns to new
column names for output.
7. We're going to change the first 3 column names to the new names defined in the parameter
8. To do this, add 3 rule-based mapping entries in the bottom pane
9. For the first column, the matching rule will be position==1 and the name will be $parameter1[1]

10. Follow the same pattern for column 2 and 3


11. Click on the Inspect and Data Preview tabs of the Select transformation to view the new column name
values (a,b,c) replace the original movie, title, genres column names
Create a cached lookup of external column mappings
Next, we'll create a cached sink for a later lookup. The cache will read an external JSON configuration file that
can be used to rename columns dynamically on each pipeline execution of your data flow.
1. Go back to the data flow designer and edit the data flow created above. Add a Sink transformation to the
columnmappings source.
2. Set sink type to Cache .
3. Under Settings, choose prevcolumn as the key column.
Lookup columns names from cached sink
Now that you've stored the configuration file contents in memory, you can dynamically map incoming column
names to new outgoing column names.
1. Go back to the data flow designer and edit the data flow create above. Click on the movies2 source
transformation.
2. Add a Select transformation. This time, we'll use the Select transformation to rename column names based
on the target name in the JSON configuration file that is being stored in the cached sink.
3. Add a rule-based mapping. For the Matching Condition, use this formula:
!isNull(cachedSink#lookup(name).prevcolumn) .
4. For the output column name, use this formula: cachedSink#lookup($$).newcolumn .
5. What we've done is to find all column names that match the prevcolumn property from the external JSON
configuration file and renamed each match to the new newcolumn name.
6. Click on the Data Preview and Inspect tabs in the Select transformation and you should now see the new
column names from the external mapping file.

Next steps
The completed pipeline from this tutorial can be downloaded from here
Learn more about data flow sinks.
Transform data in delta lake using mapping data
flows
7/2/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you'll use the data flow canvas to create data flows that allow you to analyze and transform data
in Azure Data Lake Storage (ADLS) Gen2 and store it in Delta Lake.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file
from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to
your storage account, see Upload blobs with the Azure portal. The examples will be referencing a container
named 'sample-data'.

Create a data factory


In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome . Currently, Data Factory UI is supported only in the Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in Notifications center. Select Go to resource to
navigate to the Data factory page.
10. Select Author & Monitor to launch the Data Factory UI in a separate tab.
Create a pipeline with a data flow activity
In this step, you'll create a pipeline that contains a data flow activity.
1. On the home page, select Orchestrate .

2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
3. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.

4. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DeltaLake . Click Finish when done.

5. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for
interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes
to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.
For more information, see Debug Mode.

Build transformation logic in the data flow canvas


You will generate two data flows in this tutorial. The fist data flow is a simple source to sink to generate a new
Delta Lake from the movies CSV file from above. Lastly, you'll create this flow design below to update data in
Delta Lake.
Tutorial objectives
1. Take the MoviesCSV dataset source from above, form a new Delta Lake from it 1. Build the logic to updated
ratings for 1988 movies to '1'
2. Delete all movies from 1950
3. Insert new movies for 2021 by duplicating the movies from 1960
Start from a blank data flow canvas
1. Click on the source transformation
2. Click new next to dataset in the bottom panel 1 Create a new Linked Service for ADLS Gen2
3. Choose Delimited Text for the dataset type
4. Name the dataset “MoviesCSV”
5. Point to the MoviesCSV file that you uploaded to storage above
6. Set it to be comma delimited and include header on first row
7. Go to the source projection tab and click "Detect data types"
8. Once you have your projection set, you can continue
9. Add a sink transformation
10. Delta is an inline dataset type. You will need to point to your ADLS Gen2 storage account.

11. Choose a folder name in your storage container where you would like ADF to create the Delta Lake
12. Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this
data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.
13. From Factory Resources, click new > Data flow
14. Use the MoviesCSV again as a source and click "Detect data types" again
15. Add a filter transformation to your source transformation in the graph
16. Only allow movie rows that match the three years you are going to work with which will be 1950, 1988,
and 1960
17. Update ratings for each 1988 movie to '1' by now adding a derived column transformation to your filter
transformation
18. In that same derived column, create movies for 2021 by taking an existing year and change the year to
2021. Let’s pick 1960.
19. This is what your three derived columns will look like

20. Update, insert, delete, and upsert policies are created in the alter Row transform. Add an alter row
transformation after your derived column.
21. Your alter row policies should look like this.

22. Now that you’ve set the proper policy for each alter row type, check that the proper update rules have
been set on the sink transformation

23. Here we are using the Delta Lake sink to your ADLS Gen2 data lake and allowing inserts, updates,
deletes.
24. Note that the Key Columns is a composite key made up of the Movie primary key column and year
column. This is because we created fake 2021 movies by duplicating the 1960 rows. This avoids collisions
when looking up the existing rows by providing uniqueness.
Download completed sample
Here is a sample solution for the Delta pipeline with a data flow for update/delete rows in the lake:

Next steps
Learn more about the data flow expression language.
Transform data using mapping data flows
7/2/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you'll use the Azure Data Factory user interface (UX) to create a pipeline that copies and
transforms data from an Azure Data Lake Storage (ADLS) Gen2 source to an ADLS Gen2 sink using mapping
data flow. The configuration pattern in this tutorial can be expanded upon when transforming data using
mapping data flow

NOTE
This tutorial is meant for mapping data flows in general. Data flows are available both in Azure Data Factory and Synapse
Pipelines. If you are new to data flows in Azure Synapse Pipelines, please follow Data Flow using Azure Synapse Pipelines

In this tutorial, you do the following steps:


Create a data factory.
Create a pipeline with a Data Flow activity.
Build a mapping data flow with four transformations.
Test run the pipeline.
Monitor a Data Flow activity

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file
from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to
your storage account, see Upload blobs with the Azure portal. The examples will be referencing a container
named 'sample-data'.

Create a data factory


In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome . Currently, Data Factory UI is supported only in the Microsoft
Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Integration > Data Factor y :
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name of the Azure data factory must be globally unique. If you receive an error message about the
name value, enter a different name for the data factory. (for example, yournameADFTutorialDataFactory).
For naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported are displayed in
the drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in Notifications center. Select Go to resource to
navigate to the Data factory page.
10. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline with a Data Flow activity


In this step, you'll create a pipeline that contains a Data Flow activity.
1. On the home page of Azure Data Factory, select Orchestrate .
2. In the General tab for the pipeline, enter TransformMovies for Name of the pipeline.
3. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.

4. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
TransformMovies . Click Finish when done.

5. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for
interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes
to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.
For more information, see Debug Mode.

Build transformation logic in the data flow canvas


Once you create your Data Flow, you'll be automatically sent to the data flow canvas. In this step, you'll build a
data flow that takes the moviesDB.csv in ADLS storage and aggregates the average rating of comedies from
1910 to 2000. You'll then write this file back to the ADLS storage.
1. In the data flow canvas, add a source by clicking on the Add Source box.
2. Name your source MoviesDB . Click on New to create a new source dataset.

3. Choose Azure Data Lake Storage Gen2 . Click Continue.


4. Choose DelimitedText . Click Continue.

5. Name your dataset MoviesDB . In the linked service dropdown, choose New .
6. In the linked service creation screen, name your ADLS gen2 linked service ADLSGen2 and specify your
authentication method. Then enter your connection credentials. In this tutorial, we're using Account key to
connect to our storage account. You can click Test connection to verify your credentials were entered
correctly. Click Create when finished.
7. Once you're back at the dataset creation screen, enter where your file is located under the File path field.
In this tutorial, the file moviesDB.csv is located in container sample-data. As the file has headers, check
First row as header . Select From connection/store to import the header schema directly from the
file in storage. Click OK when done.
8. If your debug cluster has started, go to the Data Preview tab of the source transformation and click
Refresh to get a snapshot of the data. You can use data preview to verify your transformation is
configured correctly.

9. Next to your source node on the data flow canvas, click on the plus icon to add a new transformation. The
first transformation you're adding is a Filter .
10. Name your filter transformation FilterYears . Click on the expression box next to Filter on to open the
expression builder. Here you'll specify your filtering condition.

11. The data flow expression builder lets you interactively build expressions to use in various
transformations. Expressions can include built-in functions, columns from the input schema, and user-
defined parameters. For more information on how to build expressions, see Data Flow expression builder.
In this tutorial, you want to filter movies of genre comedy that came out between the years 1910 and
2000. As year is currently a string, you need to convert it to an integer using the toInteger() function.
Use the greater than or equals to (>=) and less than or equals to (<=) operators to compare against
literal year values 1910 and 2000. Union these expressions together with the and (&&) operator. The
expression comes out as:
toInteger(year) >= 1910 && toInteger(year) <= 2000

To find which movies are comedies, you can use the rlike() function to find pattern 'Comedy' in the
column genres. Union the rlike expression with the year comparison to get:
toInteger(year) >= 1910 && toInteger(year) <= 2000 && rlike(genres, 'Comedy')

If you've a debug cluster active, you can verify your logic by clicking Refresh to see expression output
compared to the inputs used. There's more than one right answer on how you can accomplish this logic
using the data flow expression language.

Click Save and Finish once you're done with your expression.
12. Fetch a Data Preview to verify the filter is working correctly.

13. The next transformation you'll add is an Aggregate transformation under Schema modifier .

14. Name your aggregate transformation AggregateComedyRatings . In the Group by tab, select year
from the dropdown to group the aggregations by the year the movie came out.
15. Go to the Aggregates tab. In the left text box, name the aggregate column AverageComedyRating .
Click on the right expression box to enter the aggregate expression via the expression builder.

16. To get the average of column Rating , use the avg() aggregate function. As Rating is a string and
avg() takes in a numerical input, we must convert the value to a number via the toInteger() function.
This is expression looks like:
avg(toInteger(Rating))

Click Save and Finish when done.

17. Go to the Data Preview tab to view the transformation output. Notice only two columns are there, year
and AverageComedyRating .

18. Next, you want to add a Sink transformation under Destination .


19. Name your sink Sink . Click New to create your sink dataset.

20. Choose Azure Data Lake Storage Gen2 . Click Continue.


21. Choose DelimitedText . Click Continue.

22. Name your sink dataset MoviesSink . For linked service, choose the ADLS gen2 linked service you
created in step 6. Enter an output folder to write your data to. In this tutorial, we're writing to folder
'output' in container 'sample-data'. The folder doesn't need to exist beforehand and can be dynamically
created. Set First row as header as true and select None for Impor t schema . Click Finish.

Now you've finished building your data flow. You're ready to run it in your pipeline.
Running and monitoring the Data Flow
You can debug a pipeline before you publish it. In this step, you're going to trigger a debug run of the data flow
pipeline. While data preview doesn't write data, a debug run will write data to your sink destination.
1. Go to the pipeline canvas. Click Debug to trigger a debug run.

2. Pipeline debug of Data Flow activities uses the active debug cluster but still take at least a minute to
initialize. You can track the progress via the Output tab. Once the run is successful, click on the
eyeglasses icon to open the monitoring pane.

3. In the monitoring pane, you can see the number of rows and time spent in each transformation step.

4. Click on a transformation to get detailed information about the columns and partitioning of the data.
If you followed this tutorial correctly, you should have written 83 rows and 2 columns into your sink folder. You
can verify the data is correct by checking your blob storage.

Next steps
The pipeline in this tutorial runs a data flow that aggregates the average rating of comedies from 1910 to 2000
and writes the data to ADLS. You learned how to:
Create a data factory.
Create a pipeline with a Data Flow activity.
Build a mapping data flow with four transformations.
Test run the pipeline.
Monitor a Data Flow activity
Learn more about the data flow expression language.
Mapping data flow video tutorials
6/23/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Below is a list of mapping data flow tutorial videos created by the Azure Data Factory team.
As updates are constantly made to the product, some features have added or different functionality in the
current Azure Data Factory user experience.

Getting Started
Getting started with mapping data flows in Azure Data Factory

Debugging and developing mapping data flows


Debugging and testing mapping data flows.
Data exploration
Data preview quick actions
Monitor and manage mapping data flow performance
Benchmark timings
Debugging workflows for data flows
Updated monitoring view

Transformation overviews
Aggregate transformation
Alter row transformation
Derived Column transformation
Join transformation
Self-join pattern
Lookup transformation
Lookup Transformation Updates & Tips
Pivot transformation
Pivot transformation: mapping drifted columns
Select transformation
Select transformation: Rule-based mapping
Select transformation: Large Datasets
Surrogate key transformation
Union transformation
Unpivot transformation
Window Transformation
Filter Transformation
Conditional Split Transformation
Exists Transformation
Dynamic Joins and Dynamic Lookups
Flatten transformation
Transform hierarchical data
Rank transformation
Cached lookup
Row context via Window transformation
Parse transformation
Transform complex data types
Output to next activity

Source and sink


Reading and writing JSONs
Parquet and delimited text files
CosmosDB connector
Infer data types in delimited text files
Reading and writing partitioned files
Transform and create multiple SQL tables
Partition your files in the data lake
Data warehouse loading pattern
Data lake file output options

Optimizing mapping data flows


Data lineage
Iterate files with parameters
Decrease start-up times
SQL DB performance
Logging and auditing
Dynamically optimize data flow cluster size at runtime
Optimize data flow start-up times
Azure Integration Runtimes for Data Flows
Quick cluster start-up time with Azure IR

Mapping data flow scenarios


Fuzzy lookups
Staging data pattern
Clean addresses pattern
Deduplication
Merge files
Slowly changing dimensions type 1: overwrite
Slowly changing dimensions type 2: history
Fact table loading
Transform SQL Server on-prem with delta data loading pattern
Parameterization
Distinct row & row counts
Handling truncation errors
Intelligent data routing
Data masking for sensitive data
Logical Models vs. Physical Models
Detect source data changes
Generic type 2 slowly changing dimension

Data flow expressions


Date/Time expressions
Splitting Arrays and Case Statement
Fun with string interpolation and parameters
Data Flow Script Intro: Copy, Paste, Snippets
Data Quality Expressions
Collect aggregate function

Metadata

Metadata validation rules


Prepare data with data wrangling
6/8/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Data wrangling in data factory allows you to build interactive Power Query mash-ups natively in ADF and then
execute those at scale inside of an ADF pipeline.

NOTE
Power Query activity in ADF is currently available in public preview

Create a Power Query activity


There are two ways to create a Power Query in Azure Data Factory. One way is to click the plus icon and select
Data Flow in the factory resources pane.

NOTE
Previously, the data wrangling feature was located in the data flow workflow. Now, you will build your data wrangling
mash-up from New > Power query

The other method is in the activities pane of the pipeline canvas. Open the Power Quer y accordion and drag
the Power Quer y activity onto the canvas.
Author a Power Query data wrangling activity
Add a Source dataset for your Power Query mash-up. You can either choose an existing dataset or create a
new one. After you have saved your mash-up, you can then add the Power Query data wrangling activity to your
pipeline and select a sink dataset to tell ADF where to land your data. While you can choose one or more source
datasets, only one sink is allowed at this time. Choosing a sink dataset is optional, but at least one source dataset
is required.

Click Create to open the Power Query Online mashup editor.


First, you will choose a dataset source for the mashup editor.
Once you have completed building your Power Query, you can save it and add the mashup as an activity to your
pipeline. That is when you will set the sink dataset properties.

Author your wrangling Power Query using code-free data preparation. For the list of available functions, see
transformation functions. ADF translates the M script into a data flow script so that you can execute your Power
Query at scale using the Azure Data Factory data flow Spark environment.
Running and monitoring a Power Query data wrangling activity
To execute a pipeline debug run of a Power Query activity, click Debug in the pipeline canvas. Once you publish
your pipeline, Trigger now executes an on-demand run of the last published pipeline. Power Query pipelines
can be schedule with all existing Azure Data Factory triggers.

Go to the Monitor tab to visualize the output of a triggered Power Query activity run.
Next steps
Learn how to create a mapping data flow.
Transform data in the cloud by using a Spark
activity in Azure Data Factory
7/2/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data
by using a Spark activity and an on-demand Azure HDInsight linked service.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure storage account . You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.

NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a
premium or blob only storage account.

Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark .py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/mine
craftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfil
es/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark .
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark .py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstor y.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstor y.txt file to the inputfiles subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Select New on the left menu, select Data + Analytics , and then select Data Factor y .
3. In the New data factor y pane, enter ADFTutorialDataFactor y under Name .
The name of the Azure data factory must be globally unique. If you see the following error, change the
name of the data factory. (For example, use <yourname>ADFTutorialDataFactor y ). For naming rules
for Data Factory artifacts, see the Data Factory - naming rules article.

4. For Subscription , select your Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version , select V2 .
7. For Location , select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create .
9. After the creation is complete, you see the Data factor y page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


You author two linked services in this section:
An Azure Storage linked ser vice that links an Azure storage account to the data factory. This storage is
used by the on-demand HDInsight cluster. It also contains the Spark script to be run.
An on-demand HDInsight linked ser vice . Azure Data Factory automatically creates an HDInsight cluster
and runs the Spark program. It then deletes the HDInsight cluster after the cluster is idle for a preconfigured
time.
Create an Azure Storage linked service
1. On the home page, switch to the Manage tab in the left panel.

2. Select Connections at the bottom of the window, and then select + New .
3. In the New Linked Ser vice window, select Data Store > Azure Blob Storage , and then select
Continue .
4. For Storage account name , select the name from the list, and then select Save .
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Ser vice window, select Compute > Azure HDInsight , and then select Continue .
3. In the New Linked Ser vice window, complete the following steps:
a. For Name , enter AzureHDInsightLinkedSer vice .
b. For Type , confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Ser vice , select AzureBlobStorage1 . You created this linked service
earlier. If you used a different name, specify the right name here.
d. For Cluster type , select spark .
e. For Ser vice principal id , enter the ID of the service principal that has permission to create an
HDInsight cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal. The Ser vice principal id is equivalent to the Application ID, and a
Ser vice principal key is equivalent to the value for a Client secret.
f. For Ser vice principal key , enter the key.
g. For Resource group , select the same resource group that you used when you created the data
factory. The Spark cluster is created in this resource group.
h. Expand OS type .
i. Enter a name for Cluster user name .
j. Enter the Cluster password for the user.
k. Select Finish .
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-
demand HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its
primary storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more
information, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.

2. In the Activities toolbox, expand HDInsight . Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedSer vice (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Ser vice , select AzureBlobStorage1 .
b. Select Browse Storage .

c. Browse to the adftutorial/spark/script folder, select WordCount_Spark .py , and then select Finish .
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow) button to
close the validation window.

6. Select Publish All . The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Add Trigger on the toolbar, and then select Trigger Now .

Monitor the pipeline run


1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 20 minutes to
create a Spark cluster.
2. Select Refresh periodically to check the status of the pipeline run.

3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.
You can switch back to the pipeline runs view by selecting the All Pipeline Runs link at the top.

Verify the output


Verify that the output file is created in the spark/otuputfiles/wordcount folder of the adftutorial container.

The file should have each word from the input text file and the number of times the word appeared in the file.
For example:

(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)

Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked
service. You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual
network, advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity
in Azure Data Factory
3/5/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Spark
Activity and an on-demand HDInsight linked service. You perform the following steps in this tutorial:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure Storage account . You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark .py with the following content:
import sys
from operator import add

from pyspark.sql import SparkSession

def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/mine
craftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfil
es/wordcount")

spark.stop()

if __name__ == "__main__":
main()

2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark .
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark .py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstor y.txt with some text. The spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.

Author linked services


You author two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is used
by the on-demand HDInsight cluster. It also contains the Spark script to be executed.
An On-Demand HDInsight Linked Service. Azure Data Factory automatically creates a HDInsight cluster, run
the Spark program, and then deletes the HDInsight cluster after it's idle for a pre-configured time.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedSer vice.json .
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>"
}
}
}

Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedSer vice.json .

{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}

Update values for the following properties in the linked service definition:
hostSubscriptionId . Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant . Replace <tenantID> with ID of your Azure tenant.
ser vicePrincipalId , ser vicePrincipalKey . Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member
of the Contributor role of the subscription or the resource Group in which the cluster is created. See create
Azure Active Directory application and service principal for details. The Ser vice principal id is equivalent to
the Application ID and a Ser vice principal key is equivalent to the value for a Client secret.
clusterResourceGroup . Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-
Demand HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as
its primary storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more
information, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save
it as MySparkOnDemandPipeline.json .

{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


rootPath points to the spark folder of the adftutorial container.
entryFilePath points to the WordCount_Spark.py file in the script sub folder of the spark folder.

Create a data factory


You have authored linked service and pipeline definitions in JSON files. Now, let's create a data factory, and
deploy the linked Service and pipeline JSON files by using PowerShell cmdlets. Run the following PowerShell
commands one by one:
1. Set variables one by one.
Resource Group Name

$resourceGroupName = "ADFTutorialResourceGroup"

Data Factor y Name. Must be globally unique


$dataFactoryName = "MyDataFactory09102017"

Pipeline name

$pipelineName = "MySparkOnDemandPipeline" # Name of the pipeline

2. Launch PowerShell . Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

3. Create the resource group: ADFTutorialResourceGroup.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

4. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName


$resourceGroupName

Execute the following command to see the output:

$df

5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"

6. Run the following command to deploy an on-demand Spark linked service:


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -Name "MyOnDemandSparkLinkedService" -File "MyOnDemandSparkLinkedService.json"

7. Run the following command to deploy a pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -


Name $pipelineName -File "MySparkOnDemandPipeline.json"

Start and monitor a pipeline run


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

3. Here is the output of the sample run:


Pipeline run status: In Progress
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName :
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output :
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MySparkOnDemandPipeline' run finished. Result:


ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : MyDataFactory09102017
ActivityName : MySparkActivity
PipelineRunId : 94e71d08-a6fa-4191-b7d1-cf8c71cb4794
PipelineName : MySparkOnDemandPipeline
Input : {rootPath, entryFilePath, getDebugInfo, sparkJobLinkedService}
Output : {clusterInUse, jobId, ExecutionProgress, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 9/20/2017 6:33:47 AM
ActivityRunEnd : 9/20/2017 6:46:30 AM
DurationInMs : 763466
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"clusterInUse": "https://ADFSparkSamplexxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.azurehdinsight.net/"
"jobId": "0"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MySparkActivity"

4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.

Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
7/2/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks
notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks
notebook during execution.
You perform the following steps in this tutorial:
Create a data factory.
Create a pipeline that uses Databricks Notebook Activity.
Trigger a pipeline run.
Monitor the pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Prerequisites
Azure Databricks workspace . Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it
using Azure Data Factory.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Select Create a resource on the left menu, select Analytics , and then select Data Factor y .
3. In the New data factor y pane, enter ADFTutorialDataFactor y under Name .
The name of the Azure data factory must be globally unique. If you see the following error, change the
name of the data factory. (For example, use <yourname>ADFTutorialDataFactor y ). For naming rules
for Data Factory artifacts, see the Data Factory - naming rules article.
4. For Subscription , select your Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing and select an existing resource group from the drop-down list.
Select Create new and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version , select V2 .
7. For Location , select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create .
9. After the creation is complete, you see the Data factor y page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.

Create linked services


In this section, you author a Databricks linked service. This linked service contains the connection information to
the Databricks cluster:
Create an Azure Databricks linked service
1. On the home page, switch to the Manage tab in the left panel.

2. Select Connections at the bottom of the window, and then select + New .
3. In the New Linked Ser vice window, select Compute > Azure Databricks , and then select Continue .
4. In the New Linked Ser vice window, complete the following steps:
a. For Name , enter AzureDatabricks_LinkedSer vice
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster , select New job cluster
d. For Domain/ Region , info should auto-populate
e. For Access Token , generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version , select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type , select Standard_D3_v2 under General Purpose (HDD) category for
this tutorial.
h. For Workers , enter 2 .
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline . Later you pass this parameter to the Databricks
Notebook Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name '.
3. In the Activities toolbox, expand Databricks . Drag the Notebook activity from the Activities toolbox
to the pipeline designer surface.

4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedSer vice (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path . Let’s create a notebook and specify the path here. You
get the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial .

c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:

# Creating widgets for leveraging parameters, and printing the parameters

dbutils.widgets.text("input", "","")
y = dbutils.widgets.get("input")
print ("Param -\'input':")
print (y)

e. The Notebook Path in this case is /adftutorial/mynotebook


5. Switch back to the Data Factor y UI authoring tool . Navigate to Settings Tab under the Notebook1
Activity .
a. Add Parameter to the Notebook activity. You use the same parameter that you added earlier to the
Pipeline .
b. Name the parameter as input and provide the value as expression @pipeline().parameters.name .
6. To validate the pipeline, select the Validate button on the toolbar. To close the validation window, select
the >> (right arrow) button.

7. Select Publish All . The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now .

The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to
create a Databricks job cluster, where the notebook is executed.

2. Select Refresh periodically to check the status of the pipeline run.


3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace , go to Clusters and you can see the Job status as
pending execution, running, or terminated.

You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.

Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned
how to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory using the Azure
portal
7/2/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on
a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure Storage account . You create a hive script, and upload it to the Azure storage. The output from
the Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Vir tual Network . If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration
of Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the
previous step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a
sample configuration of HDInsight in a virtual network.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
A vir tual machine . Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state
string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts .
4. Upload the hivescript.hql file to the hivescripts subfolder.

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Log in to the Azure portal.
3. Click New on the left menu, click Data + Analytics , and click Data Factor y .
4. In the New data factor y page, enter ADFTutorialHiveFactor y for the name .
The name of the Azure data factory must be globally unique . If you receive the following error, change
the name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again.
See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name “MyAzureSsisDataFactory” is not available
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version .
8. Select the location for the data factory. Only locations that are supported for creation of data factories
are shown in the list.
9. Select Pin to dashboard .
10. Click Create .
11. On the dashboard, you see the following tile with status: Deploying data factor y .
12. After the creation is complete, you see the Data Factor y page as shown in the image.

13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the home page, switch to the Manage tab in the left panel as shown in the following image:

Create a self-hosted integration runtime


As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR) in the
same virtual network. In this section, you create a new VM, join it to the same virtual network, and install self-
hosted IR on it. The self-hosted IR allows Data Factory service to dispatch processing requests to a compute
service such as HDInsight inside a virtual network. It also allows you to move data to/from data stores inside a
virtual network to Azure. You use a self-hosted IR when the data store or compute is in an on-premises
environment as well.
1. In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration
Runtimes tab, and click + New button on the toolbar.
2. In the Integration Runtime Setup window, Select Perform data movement and dispatch
activities to external computes option, and click Next .

3. Select Private Network , and click Next .


4. Enter MySelfHostedIR for Name , and click Next .

5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep
the window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in
the previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.

3. Click Launch Configuration Manager . You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factor y UI , you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.

Create linked services


You author and deploy two Linked Services in this section:
An Azure Storage Linked Ser vice that links an Azure Storage account to the data factory. This storage is
the primary storage used by your HDInsight cluster. In this case, you use this Azure Storage account to store
the Hive script and output of the script.
An HDInsight Linked Ser vice . Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Create Azure Storage linked service
1. Switch to the Linked Ser vices tab, and click New .

2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .

3. In the New Linked Ser vice window, do the following steps:


a. Enter AzureStorageLinkedSer vice for Name .
b. Select MySelfHostedIR for Connect via integration runtime .
c. Select your Azure storage account for Storage account name .
d. To test the connection to storage account, click Test connection .
e. Click Save .

Create HDInsight linked service


1. Click New again to create another linked service.
2. Switch to the Compute tab, select Azure HDInsight , and click Continue .

3. In the New Linked Ser vice window, do the following steps:


a. Enter AzureHDInsightLinkedSer vice for Name .
b. Select Bring your own HDInsight .
c. Select your HDInsight cluster for Hdi cluster .
d. Enter the user name for the HDInsight cluster.
e. Enter the password for the user.
This article assumes that you have access to the cluster over the internet. For example, that you can connect to
the cluster at https://clustername.azurehdinsight.net . This address uses the public gateway, which is not
available if you have used network security groups (NSGs) or user-defined routes (UDRs) to restrict access from
the internet. For Data Factory to be able to submit jobs to HDInsight cluster in Azure Virtual Network, you need
to configure your Azure Virtual Network such a way that the URL can be resolved to the private IP address of
gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with name
starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL
https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . If you don’t have a DNS server
in your Azure Virtual Network, you can temporarily work around by editing the hosts file
(C:\Windows\System32\drivers\etc) of all VMs that registered as self-hosted integration runtime nodes
by adding an entry similar to the following one:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasbs://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline .

2. In the Activities toolbox, expand HDInsight , and drag-drop Hive activity to the pipeline designer
surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedSer vice for
HDInsight Linked Ser vice .

4. Switch to the Scripts tab, and do the following steps:


a. Select AzureStorageLinkedSer vice for Script Linked Ser vice .
b. For File Path , click Browse Storage .
c. In the Choose a file or folder window, navigate to hivescripts folder of the adftutorial
container, select hivescript.hql , and click Finish .

d. Confirm that you see adftutorial/hivescripts/hivescript.hql for File Path .

e. In the Script tab , expand Advanced section.


f. Click Auto-fill from script for Parameters .
g. Enter the value for the Output parameter in the following format:
wasbs://<Blob Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ . For example:
wasbs://[email protected]/outputfolder/ .
5. To publish artifacts to Data Factory, click Publish .

Trigger a pipeline run


1. First, validate the pipeline by clicking the Validate button on the toolbar. Close the Pipeline Validation
Output window by clicking right-arrow (>>) .
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

Monitor the pipeline run


1. Switch to the Monitor tab on the left. You see a pipeline run in the Pipeline Runs list.

2. To refresh the list, click Refresh .


3. To view activity runs associated with the pipeline runs, click View activity runs in the Action column.
Other action links are for stopping/rerunning the pipeline.

4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive . To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
5/28/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you use Azure PowerShell to create a Data Factory pipeline that transforms data using Hive
Activity on a HDInsight cluster that is in an Azure Virtual Network (VNet). You perform the following steps in this
tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure Storage account . You create a hive script, and upload it to the Azure storage. The output from
the Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Vir tual Network . If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration
of Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the
previous step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a
sample configuration of HDInsight in a virtual network.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:

DROP TABLE IF EXISTS HiveSampleOut;


CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state
string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';

INSERT OVERWRITE TABLE HiveSampleOut


Select
clientid,
market,
devicemodel,
state
FROM hivesampletable

2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts .
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use
an existing resource group if you like.

$resourceGroupName = "ADFTutorialResourceGroup"

2. Specify the data factory name. Must be globally unique.

$dataFactoryName = "MyDataFactory09142017"

3. Specify a name for the pipeline.

$pipelineName = "MyHivePipeline" #

4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when
the Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.

$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"

5. Launch PowerShell . Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:

Connect-AzAccount

Run the following command to view all the subscriptions for this account:

Get-AzSubscription

Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:

Select-AzSubscription -SubscriptionId "<SubscriptionId>"

6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.

New-AzResourceGroup -Name $resourceGroupName -Location "East Us"

7. Create the data factory.

$df = Set-AzDataFactoryV2 -Location EastUS -Name $dataFactoryName -ResourceGroupName


$resourceGroupName
Execute the following command to see the output:

$df

Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName -Type SelfHosted

This command creates a logical registration of the self-hosted integration runtime.


2. Use PowerShell to retrieve authentication keys to register the self-hosted integration runtime. Copy one
of the keys for registering the self-hosted integration runtime.

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName | ConvertTo-Json

Here is the sample output:

{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}

Note down the value of AuthKey1 without quotation mark.


3. Create an Azure VM and join it into the same virtual network that contains your HDInsight cluster. For
details, see How to create virtual machines. Join them into an Azure Virtual Network.
4. On the Azure VM, download self-hosted integration runtime. Use the Authentication Key obtained in the
previous step to manually register the self-hosted integration runtime.
You see the following message when the self-hosted integration runtime is registered successfully:

You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep
the Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedSer vice.json .

{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>"
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedSer vice.json .
{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}

Update values for the following properties in the linked service definition:
userName . Name of the cluster login user that you specified when creating the cluster.
password . The password for the user.
clusterUri . Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over
the internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs)
or user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such
a way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster
URL https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the
recommended approach. If you don’t have a DNS server in your Azure Virtual Network, you can
temporarily work around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all
VMs that registered as self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net

Create linked services


In the PowerShell, switch to the folder where you created JSON files, and run the following command to deploy
the linked services:
1. In the PowerShell, switch to the folder where you created JSON files.
2. Run the following command to create an Azure Storage linked service.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "MyStorageLinkedService" -File "MyStorageLinkedService.json"
3. Run the following command to create an Azure HDInsight linked service.

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -Name "MyHDInsightLinkedService" -File "MyHDInsightLinkedService.json"

Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json .

{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}

Note the following points:


scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
Switch to the folder where you created JSON files, and run the following command to deploy the pipeline:

Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name


$pipelineName -File "MyHivePipeline.json"

Start the pipeline


1. Start a pipeline run. It also captures the pipeline run ID for future monitoring.
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName $pipelineName

2. Run the following script to continuously check the pipeline run status until it finishes.

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Here is the output of the sample run:


Pipeline run status: In Progress

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :

Pipeline ' MyHivePipeline' run finished. Result:

ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"logLocation": "wasbs://[email protected]/HiveQueryJobs/000000000-0000-
47c3-9b28-1cdc7f3f2ba2/18_09_2017_06_58_18_023/Status"
"clusterInUse": "https://adfv2HivePrivate.azurehdinsight.net"
"jobId": "job_1505387997356_0024"
"ExecutionProgress": "Succeeded"
"effectiveIntegrationRuntime": "MySelfhostedIR"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MyHiveActivity"

3. Check the outputfolder folder for new file created as the Hive query result, it should look like the
following sample output:

8 en-US SCH-i500 California


23 en-US Incredible Pennsylvania
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
212 en-US SCH-i500 New York
246 en-US SCH-i500 District Of Columbia
246 en-US SCH-i500 District Of Columbia
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Author and setup self-hosted integration runtime
Author and deploy linked services.
Author and deploy a pipeline that contains a Hive activity.
Start a pipeline run.
Monitor the pipeline run
verify the output.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data securely by using mapping data flow
7/2/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you'll use the Data Factory user interface (UI) to create a pipeline that copies and transforms data
from an Azure Data Lake Storage Gen2 source to a Data Lake Storage Gen2 sink (both allowing access to only
selected networks) by using mapping data flow in Data Factory Managed Virtual Network. You can expand on
the configuration pattern in this tutorial when you transform data by using mapping data flow.
In this tutorial, you do the following steps:
Create a data factory.
Create a pipeline with a data flow activity.
Build a mapping data flow with four transformations.
Test run the pipeline.
Monitor a data flow activity.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Data Lake Storage as source and sink data stores. If you don't have a
storage account, see Create an Azure storage account for steps to create one. Ensure the storage account
allows access only from selected networks.
The file that we'll transform in this tutorial is moviesDB.csv, which can be found at this GitHub content site. To
retrieve the file from GitHub, copy the contents to a text editor of your choice to save it locally as a .csv file. To
upload the file to your storage account, see Upload blobs with the Azure portal. The examples will reference a
container named sample-data .

Create a data factory


In this step, you create a data factory and open the Data Factory UI to create a pipeline in the data factory.
1. Open Microsoft Edge or Google Chrome. Currently, only Microsoft Edge and Google Chrome web
browsers support the Data Factory UI.
2. On the left menu, select Create a resource > Analytics > Data Factor y .
3. On the New data factor y page, under Name , enter ADFTutorialDataFactor y .
The name of the data factory must be globally unique. If you receive an error message about the name
value, enter a different name for the data factory (for example, yournameADFTutorialDataFactory). For
naming rules for Data Factory artifacts, see Data Factory naming rules.
4. Select the Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
6. Under Version , select V2 .
7. Under Location , select a location for the data factory. Only locations that are supported appear in the
drop-down list. Data stores (for example, Azure Storage and Azure SQL Database) and computes (for
example, Azure HDInsight) used by the data factory can be in other regions.
8. Select Create .
9. After the creation is finished, you see the notice in the Notifications center. Select Go to resource to go
to the Data Factor y page.
10. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create an Azure IR in Data Factory Managed Virtual Network


In this step, you create an Azure IR and enable Data Factory Managed Virtual Network.
1. In the Data Factory portal, go to Manage , and select New to create a new Azure IR.

2. On the Integration runtime setup page, choose what integration runtime to create based on required
capabilities. In this tutorial, select Azure, Self-Hosted and then click Continue .
3. Select Azure and then click Continue to create an Azure Integration runtime.
4. Under Vir tual network configuration (Preview) , select Enable .
5. Select Create .

Create a pipeline with a data flow activity


In this step, you'll create a pipeline that contains a data flow activity.
1. On the home page of Azure Data Factory, select Orchestrate .
2. In the properties pane for the pipeline, enter TransformMovies for the pipeline name.
3. In the Activities pane, expand Move and Transform . Drag the Data Flow activity from the pane to the
pipeline canvas.
4. In the Adding data flow pop-up, select Create new data flow and then select Mapping Data Flow .
Select OK when you're finished.

5. Name your data flow TransformMovies in the properties pane.


6. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for
interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes
to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.
For more information, see Debug Mode.

Build transformation logic in the data flow canvas


After you create your data flow, you'll be automatically sent to the data flow canvas. In this step, you'll build a
data flow that takes the moviesDB.csv file in Data Lake Storage and aggregates the average rating of comedies
from 1910 to 2000. You'll then write this file back to Data Lake Storage.
Add the source transformation
In this step, you set up Data Lake Storage Gen2 as a source.
1. In the data flow canvas, add a source by selecting the Add Source box.
2. Name your source MoviesDB . Select New to create a new source dataset.
3. Select Azure Data Lake Storage Gen2 , and then select Continue .
4. Select DelimitedText , and then select Continue .
5. Name your dataset MoviesDB . In the linked service drop-down, select New .
6. In the linked service creation screen, name your Data Lake Storage Gen2 linked service ADLSGen2 and
specify your authentication method. Then enter your connection credentials. In this tutorial, we're using
Account key to connect to our storage account.
7. Make sure you enable Interactive authoring . It might take a minute to be enabled.
8. Select Test connection . It should fail because the storage account doesn't enable access into it without
the creation and approval of a private endpoint. In the error message, you should see a link to create a
private endpoint that you can follow to create a managed private endpoint. An alternative is to go directly
to the Manage tab and follow instructions in this section to create a managed private endpoint.
9. Keep the dialog box open, and then go to your storage account.
10. Follow instructions in this section to approve the private link.
11. Go back to the dialog box. Select Test connection again, and select Create to deploy the linked service.
12. On the dataset creation screen, enter where your file is located under the File path field. In this tutorial,
the file moviesDB.csv is located in the container sample-data . Because the file has headers, select the
First row as header check box. Select From connection/store to import the header schema directly
from the file in storage. Select OK when you're finished.
13. If your debug cluster has started, go to the Data Preview tab of the source transformation and select
Refresh to get a snapshot of the data. You can use the data preview to verify your transformation is
configured correctly.

Create a managed private endpoint


If you didn't use the hyperlink when you tested the preceding connection, follow the path. Now you need to
create a managed private endpoint that you'll connect to the linked service you created.
1. Go to the Manage tab.
NOTE
The Manage tab might not be available for all Data Factory instances. If you don't see it, you can access private
endpoints by selecting Author > Connections > Private Endpoint .

2. Go to the Managed private endpoints section.


3. Select + New under Managed private endpoints .

4. Select the Azure Data Lake Storage Gen2 tile from the list, and select Continue .
5. Enter the name of the storage account you created.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the storage account level.
Approval of a private link in a storage account
1. In the storage account, go to Private endpoint connections under the Settings section.
2. Select the check box by the private endpoint you created, and select Approve .

3. Add a description, and select yes .


4. Go back to the Managed private endpoints section of the Manage tab in Data Factory.
5. After about a minute, you should see the approval appear for your private endpoint.
Add the filter transformation
1. Next to your source node on the data flow canvas, select the plus icon to add a new transformation. The
first transformation you'll add is a Filter .

2. Name your filter transformation FilterYears . Select the expression box next to Filter on to open the
expression builder. Here you'll specify your filtering condition.
3. The data flow expression builder lets you interactively build expressions to use in various
transformations. Expressions can include built-in functions, columns from the input schema, and user-
defined parameters. For more information on how to build expressions, see Data flow expression builder.
In this tutorial, you want to filter movies in the comedy genre that came out between the years
1910 and 2000. Because the year is currently a string, you need to convert it to an integer by using
the toInteger() function. Use the greater than or equal to (>=) and less than or equal to (<=)
operators to compare against the literal year values 1910 and 2000. Union these expressions
together with the and (&&) operator. The expression comes out as:
toInteger(year) >= 1910 && toInteger(year) <= 2000

To find which movies are comedies, you can use the rlike() function to find the pattern 'Comedy'
in the column genres. Union the rlike expression with the year comparison to get:
toInteger(year) >= 1910 && toInteger(year) <= 2000 && rlike(genres, 'Comedy')

If you have a debug cluster active, you can verify your logic by selecting Refresh to see the
expression output compared to the inputs used. There's more than one right answer on how you
can accomplish this logic by using the data flow expression language.

Select Save and finish after you're finished with your expression.
4. Fetch a Data Preview to verify the filter is working correctly.

Add the aggregate transformation


1. The next transformation you'll add is an Aggregate transformation under Schema modifier .

2. Name your aggregate transformation AggregateComedyRating . On the Group by tab, select year
from the drop-down box to group the aggregations by the year the movie came out.

3. Go to the Aggregates tab. In the left text box, name the aggregate column AverageComedyRating .
Select the right expression box to enter the aggregate expression via the expression builder.
4. To get the average of column Rating , use the avg() aggregate function. Because Rating is a string and
avg() takes in a numerical input, we must convert the value to a number via the toInteger() function.
This expression looks like:
avg(toInteger(Rating))

5. Select Save and finish after you're finished.

6. Go to the Data Preview tab to view the transformation output. Notice only two columns are there, year
and AverageComedyRating .
Add the sink transformation
1. Next, you want to add a Sink transformation under Destination .

2. Name your sink Sink . Select New to create your sink dataset.
3. On the New dataset page, select Azure Data Lake Storage Gen2 and then select Continue .
4. On the Select format page, select DelimitedText and then select Continue .
5. Name your sink dataset MoviesSink . For linked service, choose the same ADLSGen2 linked service you
created for source transformation. Enter an output folder to write your data to. In this tutorial, we're
writing to the folder output in the container sample-data . The folder doesn't need to exist beforehand
and can be dynamically created. Select the First row as header check box, and select None for Impor t
schema . Select OK .

Now you've finished building your data flow. You're ready to run it in your pipeline.

Run and monitor the data flow


You can debug a pipeline before you publish it. In this step, you trigger a debug run of the data flow pipeline.
While the data preview doesn't write data, a debug run will write data to your sink destination.
1. Go to the pipeline canvas. Select Debug to trigger a debug run.
2. Pipeline debugging of data flow activities uses the active debug cluster but still takes at least a minute to
initialize. You can track the progress via the Output tab. After the run is successful, select the eyeglasses
icon for run details.
3. On the details page, you can see the number of rows and the time spent on each transformation step.

4. Select a transformation to get detailed information about the columns and partitioning of the data.
If you followed this tutorial correctly, you should have written 83 rows and 2 columns into your sink folder. You
can verify the data is correct by checking your blob storage.

Summary
In this tutorial, you used the Data Factory UI to create a pipeline that copies and transforms data from a Data
Lake Storage Gen2 source to a Data Lake Storage Gen2 sink (both allowing access to only selected networks) by
using mapping data flow in Data Factory Managed Virtual Network.
Branching and chaining activities in an Azure Data
Factory pipeline using the Azure portal
7/2/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create a Data Factory pipeline that showcases some of the control flow features. This pipeline
does a simple copy from a container in Azure Blob Storage to another container in the same storage account. If
the copy activity succeeds, the pipeline sends details of the successful copy operation (such as the amount of
data written) in a success email. If the copy activity fails, the pipeline sends details of copy failure (such as the
error message) in a failure email. Throughout the tutorial, you see how to pass parameters.
A high-level overview of the scenario:

You perform the following steps in this tutorial:


Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a Copy activity and a Web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses Azure portal. You can use other mechanisms to interact with Azure Data Factory, refer to
"Quickstarts" in the table of contents.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account . You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database . You use the database as sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.

John,Doe
Jane,Doe

2. Use tools such as Azure Storage Explorer do the following steps:


a. Create the adfv2branch container.
b. Create input folder in the adfv2branch container.
c. Upload input.txt file to the container.

Create email workflow endpoints


To trigger sending an email from the pipeline, you use Logic Apps to define the workflow. For details on creating
a Logic App workflow, see How to create a logic app.
Success email workflow
Create a Logic App workflow named CopySuccessEmail . Define the workflow trigger as
When an HTTP request is received , and add an action of Office 365 Outlook – Send an email .

For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}

The Request in the Logic App Designer should look like the following image:

For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Fail email workflow


Follow the same steps to create another Logic Apps workflow of CopyFailEmail . In the request trigger, the
Request Body JSON schema is the same. Change the format of your email like the Subject to tailor toward a
failure email. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your failure email workflow:

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

You should now have two workflow URLs:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a data factory


1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. On the left menu, select Create a resource > Data + Analytics > Data Factor y :
3. In the New data factor y page, enter ADFTutorialDataFactor y for the name .
The name of the Azure data factory must be globally unique . If you receive the following error, change
the name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name “ADFTutorialDataFactory” is not available.
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version .
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. Select Pin to dashboard .
9. Click Create .
10. On the dashboard, you see the following tile with status: Deploying data factor y .

11. After the creation is complete, you see the Data Factor y page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the home page of Data Factory UI, click the Orchestrate tile.

2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add
the following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.

3. In the Activities toolbox, expand Data Flow , and drag-drop Copy activity to the pipeline designer
surface.
4. In the Proper ties window for the Copy activity at the bottom, switch to the Source tab, and click +
New . You create a source dataset for the copy activity in this step.

5. In the New Dataset window, select Azure Blob Storage , and click Finish .
6. You see a new tab titled AzureBlob1 . Change the name of the dataset to SourceBlobDataset .
7. Switch to the Connection tab in the Proper ties window, and click New for the Linked ser vice . You
create a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure storage account for the Storage account name .
c. Click Save .
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use
the sourceBlobContainer pipeline parameter to set the folder path for the dataset.
10. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is
selected for Source Dataset .

13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset . You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.

14. In the New Dataset window, select Azure Blob Storage , and click Finish .
15. In the General settings page for the dataset, enter SinkBlobDataset for Name .
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedSer vice for LinkedSer vice .
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the
current pipeline run for the file name. For the supported list of system variables and expressions,
see System variables and Expression language.

17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity . The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web
Activity. This pipeline uses a Web Activity to call the Logic Apps email workflow.
18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL , specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method .
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json .
e. Specify the following JSON for Body .

{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}

The message body contains the following properties:


Message – Passing value of @{activity('Copy1').output.dataWritten . Accesses a property of
the previous copy activity and passes the value of dataWritten. For the failure case, pass the
error output instead of @{activity('CopyBlobtoBlob').error.message .
Data Factory Name – Passing value of @{pipeline().DataFactory} This is a system variable,
allowing you to access the corresponding data factory name. For a list of system variables,
see System Variables article.
Pipeline Name – Passing value of @{pipeline().Pipeline} . This is also a system variable,
allowing you to access the corresponding pipeline name.
Receiver – Passing value of "@pipeline().parameters.receiver"). Accessing the pipeline
parameters.

19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.

20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity .
21. Switch to the Settings tab, and do the following steps:
a. For URL , specify URL for the logic apps workflow that sends the failure email.
b. Select POST for Method .
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json .
e. Specify the following JSON for Body .

{
"message": "@{activity('Copy1').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
22. Select Copy activity in the pipeline designer, and click +-> button, and select Error .

23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity . You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.

25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All . Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now .

2. In the Pipeline Run window, do the following steps:


a. Enter adftutorial/adfv2branch/input for the sourceBlobContainer parameter.
b. Enter adftutorial/adfv2branch/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver .
d. Click Finish
Monitor the successful pipeline run
1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.

2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the
list.

Trigger a pipeline run that fails


1. Switch to the Edit tab on the left.
2. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now .
3. In the Pipeline Run window, do the following steps:
a. Enter adftutorial/dummy/input for the sourceBlobContainer parameter. Ensure that the dummy
folder does not exist in the adftutorial container.
b. Enter adftutorial/dummy/output for the sinkBlobContainer parameter.
c. Enter an email address of the receiver .
d. Click Finish .

Monitor the failed pipeline run


1. To monitor the pipeline run, switch to the Monitor tab on the left. You see the pipeline run that was
triggered manually by you. Use the Refresh button to refresh the list.
2. Click Error link for the pipeline run to see details about the error.

3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.

4. Click Error link in the Actions column to see details about the error.

Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
4/22/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you create a Data Factory pipeline that showcases some control flow features. This pipeline
copies from a container in Azure Blob Storage to another container in the same storage account. If the copy
activity succeeds, the pipeline sends details of the successful copy operation in an email. That information could
include the amount of data written. If the copy activity fails, it sends details of the copy failure, such as the error
message, in an email. Throughout the tutorial, you see how to pass parameters.
This graphic provides an overview of the scenario:

This tutorial shows you how to do the following tasks:


Create a data factory
Create an Azure Storage linked service
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Use parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
This tutorial uses .NET SDK. You can use other mechanisms to interact with Azure Data Factory. For Data Factory
quickstarts, see 5-Minute Quickstarts.
If you don't have an Azure subscription, create a free account before you begin.

Prerequisites
Azure Storage account. You use blob storage as a source data store. If you don't have an Azure storage
account, see Create a storage account.
Azure Storage Explorer. To install this tool, see Azure Storage Explorer.
Azure SQL Database. You use the database as a sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database.
Visual Studio. This article uses Visual Studio 2019.
Azure .NET SDK. Download and install the Azure .NET SDK.
For a list of Azure regions in which Data Factory is currently available, see Products available by region. The data
stores and computes can be in other regions. The stores include Azure Storage and Azure SQL Database. The
computes include HDInsight, which Data Factory uses.
Create an application as described in Create an Azure Active Directory application. Assign the application to the
Contributor role by following instructions in the same article. You'll need several values for later parts of this
tutorial, such as Application (client) ID and Director y (tenant) ID .
Create a blob table
1. Open a text editor. Copy the following text and save it locally as input.txt.

Ethel|Berg
Tamika|Walsh

2. Open Azure Storage Explorer. Expand your storage account. Right-click Blob Containers and select
Create Blob Container .
3. Name the new container adfv2branch and select Upload to add your input.txt file to the container.

Create Visual Studio project


Create a C# .NET console application:
1. Start Visual Studio and select Create a new project .
2. In Create a new project , choose Console App (.NET Framework) for C# and select Next .
3. Name the project ADFv2BranchTutorial.
4. Select .NET version 4.5.2 or above and then select Create .
Install NuGet packages
1. Select Tools > NuGet Package Manager > Package Manager Console .
2. In the Package Manager Console , run the following commands to install packages. Refer to
Microsoft.Azure.Management.DataFactory nuget package for details.

Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -IncludePrerelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory

Create a data factory client


1. Open Program.cs and add the following statements:
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;

2. Add these static variables to the Program class. Replace place-holders with your own values.

// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";

static string region = "East US";


static string dataFactoryName = "<Data factory name>";

// Specify the source Azure Blob information


static string storageAccount = "<Azure Storage account name>";
static string storageKey = "<Azure Storage account key>";
// confirm that you have the input.txt file placed in th input folder of the adfv2branch container.
static string inputBlobPath = "adfv2branch/input";
static string inputBlobName = "input.txt";
static string outputBlobPath = "adfv2branch/output";
static string emailReceiver = "<specify email address of the receiver>";

static string storageLinkedServiceName = "AzureStorageLinkedService";


static string blobSourceDatasetName = "SourceStorageDataset";
static string blobSinkDatasetName = "SinkStorageDataset";
static string pipelineName = "Adfv2TutorialBranchCopy";

static string copyBlobActivity = "CopyBlobtoBlob";


static string sendFailEmailActivity = "SendFailEmailActivity";
static string sendSuccessEmailActivity = "SendSuccessEmailActivity";

3. Add the following code to the method. This code creates an instance of
Main
DataFactoryManagementClient class. You then use this object to create data factory, linked service, datasets,
and pipeline. You can also use this object to monitor the pipeline run details.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Create a data factory


1. Add a CreateOrUpdateDataFactory method to your Program.cs file:
static Factory CreateOrUpdateDataFactory(DataFactoryManagementClient client)
{
Console.WriteLine("Creating data factory " + dataFactoryName + "...");
Factory resource = new Factory
{
Location = region
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource, client.SerializationSettings));

Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}

while (client.Factories.Get(resourceGroup, dataFactoryName).ProvisioningState ==


"PendingCreation")
{
System.Threading.Thread.Sleep(1000);
}
return response;
}

2. Add the following line to the Main method that creates a data factory:

Factory df = CreateOrUpdateDataFactory(client);

Create an Azure Storage linked service


1. Add a StorageLinkedServiceDefinition method to your Program.cs file:

static LinkedServiceResource StorageLinkedServiceDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");
AzureStorageLinkedService storageLinkedService = new AzureStorageLinkedService
{
ConnectionString = new SecureString("DefaultEndpointsProtocol=https;AccountName=" +
storageAccount + ";AccountKey=" + storageKey)
};
Console.WriteLine(SafeJsonConvert.SerializeObject(storageLinkedService,
client.SerializationSettings));
LinkedServiceResource linkedService = new LinkedServiceResource(storageLinkedService,
name:storageLinkedServiceName);
return linkedService;
}

2. Add the following line to the Main method that creates an Azure Storage linked service:

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));

For more information about supported properties and details, see Linked service properties.

Create datasets
In this section, you create two datasets, one for the source and one for the sink.
Create a dataset for a source Azure Blob
Add a method that creates an Azure blob dataset. For more information about supported properties and details,
see Azure Blob dataset properties.
Add a SourceBlobDatasetDefinition method to your Program.cs file:

static DatasetResource SourceBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSourceDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sourceBlobContainer" },
FileName = inputBlobName,
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name:blobSourceDatasetName);
return dataset;
}

You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service supported in the previous step. The Blob dataset describes the location of the blob to copy from:
FolderPath and FileName.
Notice the use of parameters for the FolderPath. sourceBlobContainer is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>

Create a dataset for a sink Azure Blob


1. Add a SourceBlobDatasetDefinition method to your Program.cs file:

static DatasetResource SinkBlobDatasetDefinition(DataFactoryManagementClient client)


{
Console.WriteLine("Creating dataset " + blobSinkDatasetName + "...");
AzureBlobDataset blobDataset = new AzureBlobDataset
{
FolderPath = new Expression { Value = "@pipeline().parameters.sinkBlobContainer" },
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
DatasetResource dataset = new DatasetResource(blobDataset, name: blobSinkDatasetName);
return dataset;
}

2. Add the following code to the Main method that creates both Azure Blob source and sink datasets.

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,


SourceBlobDatasetDefinition(client));

client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,


SinkBlobDatasetDefinition(client));

Create a C# class: EmailRequest


In your C# project, create a class named EmailRequest . This class defines what properties the pipeline sends in
the body request when sending an email. In this tutorial, the pipeline sends four properties from the pipeline to
the email:
Message. Body of the email. For a successful copy, this property contains the amount of data written. For a
failed copy, this property contains details of the error.
Data factory name. Name of the data factory.
Pipeline name. Name of the pipeline.
Receiver. Parameter that passes through. This property specifies the receiver of the email.

class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;

[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;

[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;

[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;

public EmailRequest(string input, string df, string pipeline, string receiverName)


{
message = input;
dataFactoryName = df;
pipelineName = pipeline;
receiver = receiverName;
}
}

Create email workflow endpoints


To trigger sending an email, you use Logic Apps to define the workflow. For details on creating a Logic Apps
workflow, see How to create a Logic App.
Success email workflow
In the Azure portal, create a Logic Apps workflow named CopySuccessEmail. Define the workflow trigger as
When an HTTP request is received . For your request trigger, fill in the Request Body JSON Schema with the
following JSON:

{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
Your workflow looks something like the following example:

This JSON content aligns with the EmailRequest class you created in the previous section.
Add an action of Office 365 Outlook – Send an email . For the Send an email action, customize how you wish
to format the email, using the properties passed in the request Body JSON schema. Here's an example:
After you save the workflow, copy and save the HTTP POST URL value from the trigger.

Fail email workflow


Clone CopySuccessEmail as another Logic Apps workflow named CopyFailEmail. In the request trigger, the
Request Body JSON schema is the same. Change the format of your email like the Subject to tailor toward a
failure email. Here is an example:
After you save the workflow, copy and save the HTTP POST URL value from the trigger.
You should now have two workflow URLs, like the following examples:

//Success Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

//Fail Request Url


https://prodxxx.eastus.logic.azure.com:443/workflows/000000/triggers/manual/paths/invoke?api-version=2016-
10-01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=000000

Create a pipeline
Go back to your project in Visual Studio. We'll now add the code that creates a pipeline with a copy activity and
DependsOn property. In this tutorial, the pipeline contains one activity, a copy activity, which takes in the Blob
dataset as a source and another Blob dataset as a sink. If the copy activity succeeds or fails, it calls different
email tasks.
In this pipeline, you use the following features:
Parameters
Web activity
Activity dependency
Using output from an activity as an input to another activity
1. Add this method to your project. The following sections provide in more detail.

static PipelineResource PipelineDefinition(DataFactoryManagementClient client)


{
{
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource resource = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String
} },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String }
},
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }

},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/ma
nual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manu
al/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource,
client.SerializationSettings));
return resource;
}

2. Add the following line to the Main method that creates the pipeline:

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName,


PipelineDefinition(client));

Parameters
The first section of our pipeline code defines parameters.
sourceBlobContainer . The source blob dataset consumes this parameter in the pipeline.
sinkBlobContainer . The sink blob dataset consumes this parameter in the pipeline.
receiver . The two Web activities in the pipeline that send success or failure emails to the receiver use this
parameter.

Parameters = new Dictionary<string, ParameterSpecification>


{
{ "sourceBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "sinkBlobContainer", new ParameterSpecification { Type = ParameterType.String } },
{ "receiver", new ParameterSpecification { Type = ParameterType.String } }
},

Web activity
The Web activity allows a call to any REST endpoint. For more information about the activity, see Web activity in
Azure Data Factory. This pipeline uses a web activity to call the Logic Apps email workflow. You create two web
activities: one that calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow .

new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}

In the Url property, paste the HTTP POST URL endpoints from your Logic Apps workflows. In the Body
property, pass an instance of the EmailRequest class. The email request contains the following properties:
Message. Passes value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten . For the failure case, pass the error output instead
of @{activity('CopyBlobtoBlob').error.message .
Data Factory Name. Passes value of @{pipeline().DataFactory} This system variable allows you to access the
corresponding data factory name. For a list of system variables, see System Variables.
Pipeline Name. Passes value of @{pipeline().Pipeline} . This system variable allows you to access the
corresponding pipeline name.
Receiver. Passes value of "@pipeline().parameters.receiver" . Accesses the pipeline parameters.
This code creates a new Activity Dependency that depends on the previous copy activity.

Create a pipeline run


Add the following code to the Main method that triggers a pipeline run.

// Create a pipeline run


Console.WriteLine("Creating pipeline run...");
Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);

Main class
Your final Main method should look like this.

// Authenticate and create a data factory management client


var context = new AuthenticationContext("https://login.windows.net/" + tenantID);
ClientCredential cc = new ClientCredential(applicationId, authenticationKey);
AuthenticationResult result = context.AcquireTokenAsync("https://management.azure.com/", cc).Result;
ServiceClientCredentials cred = new TokenCredentials(result.AccessToken);
var client = new DataFactoryManagementClient(cred) { SubscriptionId = subscriptionId };

Factory df = CreateOrUpdateDataFactory(client);

client.LinkedServices.CreateOrUpdate(resourceGroup, dataFactoryName, storageLinkedServiceName,


StorageLinkedServiceDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSourceDatasetName,
SourceBlobDatasetDefinition(client));
client.Datasets.CreateOrUpdate(resourceGroup, dataFactoryName, blobSinkDatasetName,
SinkBlobDatasetDefinition(client));

client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, PipelineDefinition(client));

Console.WriteLine("Creating pipeline run...");


Dictionary<string, object> arguments = new Dictionary<string, object>
{
{ "sourceBlobContainer", inputBlobPath },
{ "sinkBlobContainer", outputBlobPath },
{ "receiver", emailReceiver }
};

CreateRunResponse runResponse = client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup,


dataFactoryName, pipelineName, arguments).Result.Body;
Console.WriteLine("Pipeline run ID: " + runResponse.RunId);
Build and run your program to trigger a pipeline run!

Monitor a pipeline run


1. Add the following code to the Main method:

// Monitor the pipeline run


Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress")
System.Threading.Thread.Sleep(15000);
else
break;
}

This code continuously checks the status of the run until it finishes copying the data.
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of
the data read/written:

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

List<ActivityRun> activityRuns = client.ActivityRuns.ListByPipelineRun(


resourceGroup, dataFactoryName, runResponse.RunId, DateTime.UtcNow.AddMinutes(-10),
DateTime.UtcNow.AddMinutes(10)).ToList();

if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);

Console.WriteLine("\nPress any key to exit...");


Console.ReadKey();

Run the code


Build and start the application, then verify the pipeline execution.
The application displays the progress of creating data factory, linked service, datasets, pipeline, and pipeline run.
It then checks the pipeline run status. Wait until you see the copy activity run details with data read/written size.
Then, use tools such as Azure Storage Explorer to check the blob was copied to outputBlobPath from
inputBlobPath as you specified in variables.
Your output should resemble the following sample:

Creating data factory DFTutorialTest...


{
"location": "East US"
}
Creating linked service AzureStorageLinkedService...
{
"type": "AzureStorage",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***"
}
}
Creating dataset SourceStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sourceBlobContainer"
},
"fileName": "input.txt"
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating dataset SinkStorageDataset...
{
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"type": "Expression",
"value": "@pipeline().parameters.sinkBlobContainer"
}
},
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "AzureStorageLinkedService"
}
}
Creating pipeline Adfv2TutorialBranchCopy...
{
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [
{
"type": "DatasetReference",
"referenceName": "SourceStorageDataset"
}
],
"outputs": [
{
"type": "DatasetReference",
"referenceName": "SinkStorageDataset"
}
],
"name": "CopyBlobtoBlob"
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
"name": "SendSuccessEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Succeeded"
]
}
]
},
{
"type": "WebActivity",
"typeProperties": {
"method": "POST",
"url": "https://xxx.eastus.logic.azure.com:443/workflows/... ",
"body": {
"message": "@{activity('CopyBlobtoBlob').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
},
"name": "SendFailEmailActivity",
"dependsOn": [
{
"activity": "CopyBlobtoBlob",
"dependencyConditions": [
"Failed"
]
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
},
"receiver": {
"type": "String"
}
}
}
}
Creating pipeline run...
Pipeline run ID: 00000000-0000-0000-0000-0000000000000
Checking pipeline run status...
Status: InProgress
Status: InProgress
Status: Succeeded
Checking copy activity run details...
{
"dataRead": 20,
"dataWritten": 20,
"copyDuration": 4,
"throughput": 0.01,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
}
{}

Press any key to exit...


Next steps
You did the following tasks in this tutorial:
Create a data factory
Create an Azure Storage linked service
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Use parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now continue to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Provision the Azure-SSIS integration runtime in
Azure Data Factory
7/20/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This tutorial provides steps for using the Azure portal to provision an Azure-SQL Server Integration Services
(SSIS) integration runtime (IR) in Azure Data Factory (ADF). An Azure-SSIS IR supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
After an Azure-SSIS IR is provisioned, you can use familiar tools to deploy and run your packages in Azure.
These tools are already Azure-enabled and include SQL Server Data Tools (SSDT), SQL Server Management
Studio (SSMS), and command-line utilities like dtutil and AzureDTExec.
For conceptual information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.
In this tutorial, you complete the following steps:
Create a data factory.
Provision an Azure-SSIS integration runtime.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database ser ver (optional) . If you don't already have a database server, create one in the
Azure portal before you get started. Data Factory will in turn create an SSISDB instance on this database
server.
We recommend that you create the database server in the same Azure region as the integration runtime.
This configuration lets the integration runtime write execution logs into SSISDB without crossing Azure
regions.
Keep these points in mind:
Based on the selected database server, the SSISDB instance can be created on your behalf as a
single database, as part of an elastic pool, or in a managed instance. It can be accessible in a public
network or by joining a virtual network. For guidance in choosing the type of database server to
host SSISDB, see Compare SQL Database and SQL Managed Instance.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Create an Azure-SSIS IR in a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a managed instance with private endpoint to host SSISDB. For more
information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see New-
AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure Active Directory (Azure AD) authentication with the specified
system/user-assigned managed identity for your data factory. For the latter, you need to add the
specified system/user-assigned managed identity for your data factory into an Azure AD group
with access permissions to the database server. For more information, see Create an Azure-SSIS IR
with Azure AD authentication.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.

NOTE
For a list of Azure regions in which Data Factory and an Azure-SSIS IR are currently available, see Data Factory and SSIS IR
availability by region.

Create a data factory


To create your data factory via the Azure portal, follow the step-by-step instructions in Create a data factory via
the UI. Select Pin to dashboard while doing so, to allow quick access after its creation.
After your data factory is created, open its overview page in the Azure portal. Select the Author & Monitor tile
to open the Let's get star ted page on a separate tab. There, you can continue to create your Azure-SSIS IR.

Create an Azure-SSIS integration runtime


From the Data Factory overview
1. On the home page, select the Configure SSIS tile.

2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the authoring UI
1. In the Azure Data Factory UI, switch to the Manage tab, and then switch to the Integration runtimes
tab to view existing integration runtimes in your data factory.

2. Select New to create an Azure-SSIS IR and open the Integration runtime setup pane.

3. In the Integration runtime setup pane, select the Lift-and-shift existing SSIS packages to
execute in Azure tile, and then select Continue .
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.

Provision an Azure-SSIS integration runtime


The Integration runtime setup pane has three pages where you successively configure general, deployment,
and advanced settings.
General settings page
On the General settings page of Integration runtime setup pane, complete the following steps.

1. For Name , enter the name of your integration runtime.


2. For Description , enter the description of your integration runtime.
3. For Location , select the location of your integration runtime. Only supported locations are displayed. We
recommend that you select the same location of your database server to host SSISDB.
4. For Node Size , select the size of node in your integration runtime cluster. Only supported node sizes are
displayed. Select a large node size (scale up) if you want to run many compute-intensive or memory-
intensive packages.
5. For Node Number , select the number of nodes in your integration runtime cluster. Only supported node
numbers are displayed. Select a large cluster with many nodes (scale out) if you want to run many
packages in parallel.
6. For Edition/License , select the SQL Server edition for your integration runtime: Standard or Enterprise.
Select Enterprise if you want to use advanced features on your integration runtime.
7. For Save Money , select the Azure Hybrid Benefit option for your integration runtime: Yes or No . Select
Yes if you want to bring your own SQL Server license with Software Assurance to benefit from cost
savings with hybrid use.
8. Select Continue .
Deployment settings page
On the Deployment settings page of Integration runtime setup pane, you have the options to create
SSISDB and or Azure-SSIS IR package stores.
Creating SSISDB
On the Deployment settings page of Integration runtime setup pane, if you want to deploy your packages
into SSISDB (Project Deployment Model), select the Create SSIS catalog (SSISDB) hosted by Azure SQL
Database ser ver/Managed Instance to store your projects/packages/environments/execution logs
check box. Alternatively, if you want to deploy your packages into file system, Azure Files, or SQL Server
database (MSDB) hosted by Azure SQL Managed Instance (Package Deployment Model), no need to create
SSISDB nor select the check box.
Regardless of your deployment model, if you want to use SQL Server Agent hosted by Azure SQL Managed
Instance to orchestrate/schedule your package executions, it's enabled by SSISDB, so select the check box
anyway. For more information, see Schedule SSIS package executions via Azure SQL Managed Instance Agent.
If you select the check box, complete the following steps to bring your own database server to host SSISDB that
we'll create and manage on your behalf.
1. For Subscription , select the Azure subscription that has your database server to host SSISDB.
2. For Location , select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
3. For Catalog Database Ser ver Endpoint , select the endpoint of your database server to host SSISDB.
Based on the selected database server, the SSISDB instance can be created on your behalf as a single
database, as part of an elastic pool, or in a managed instance. It can be accessible in a public network or
by joining a virtual network. For guidance in choosing the type of database server to host SSISDB, see
Compare SQL Database and SQL Managed Instance.
If you select an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a
managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual network. For more
information, see Create an Azure-SSIS IR in a virtual network.
4. Select either the Use AAD authentication with the system managed identity for Data Factor y or
Use AAD authentication with a user-assigned managed identity for Data Factor y check box to
choose Azure AD authentication method for Azure-SSIS IR to access your database server that hosts
SSISDB. Don't select any of the check boxes to choose SQL authentication method instead.
If you select any of the check boxes, you'll need to add the specified system/user-assigned managed
identity for your data factory into an Azure AD group with access permissions to your database server. If
you select the Use AAD authentication with a user-assigned managed identity for Data Factor y
check box, you can then select any existing credentials created using your specified user-assigned
managed identities or create new ones. For more information, see Create an Azure-SSIS IR with Azure AD
authentication.
5. For Admin Username , enter the SQL authentication username for your database server that hosts
SSISDB.
6. For Admin Password , enter the SQL authentication password for your database server that hosts
SSISDB.
7. Select the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover check
box to configure a dual standby Azure SSIS IR pair that works in sync with Azure SQL Database/Managed
Instance failover group for business continuity and disaster recovery (BCDR).
If you select the check box, enter a name to identify your pair of primary and secondary Azure-SSIS IRs in
the Dual standby pair name text box. You need to enter the same pair name when creating your
primary and secondary Azure-SSIS IRs.
For more information, see Configure your Azure-SSIS IR for BCDR.
8. For Catalog Database Ser vice Tier , select the service tier for your database server to host SSISDB.
Select the Basic, Standard, or Premium tier, or select an elastic pool name.
Select Test connection when applicable and if it's successful, select Continue .
Creating Azure-SSIS IR package stores
On the Deployment settings page of Integration runtime setup pane, if you want to manage your
packages that are deployed into MSDB, file system, or Azure Files (Package Deployment Model) with Azure-SSIS
IR package stores, select the Create package stores to manage your packages that are deployed into
file system/Azure Files/SQL Ser ver database (MSDB) hosted by Azure SQL Managed Instance check
box.
Azure-SSIS IR package store allows you to import/export/delete/run packages and monitor/stop running
packages via SSMS similar to the legacy SSIS package store. For more information, see Manage SSIS packages
with Azure-SSIS IR package stores.
If you select this check box, you can add multiple package stores to your Azure-SSIS IR by selecting New .
Conversely, one package store can be shared by multiple Azure-SSIS IRs.
On the Add package store pane, complete the following steps.
1. For Package store name , enter the name of your package store.
2. For Package store linked ser vice , select your existing linked service that stores the access information
for file system/Azure Files/Azure SQL Managed Instance where your packages are deployed or create a
new one by selecting New . On the New linked ser vice pane, complete the following steps.

NOTE
You can use either Azure File Storage or File System linked services to access Azure Files. If you use Azure
File Storage linked service, Azure-SSIS IR package store supports only Basic (not Account key nor SAS URI )
authentication method for now.
a. For Name , enter the name of your linked service.
b. For Description , enter the description of your linked service.
c. For Type , select Azure File Storage , Azure SQL Managed Instance , or File System .
d. You can ignore Connect via integration runtime , since we always use your Azure-SSIS IR to
fetch the access information for package stores.
e. If you select Azure File Storage , for Authentication method , select Basic , and then complete
the following steps.
a. For Account selection method , select From Azure subscription or Enter manually .
b. If you select From Azure subscription , select the relevant Azure subscription , Storage
account name , and File share .
c. If you select Enter manually , enter
\\<storage account name>.file.core.windows.net\<file share name> for Host ,
Azure\<storage account name>for Username , and <storage account key> for Password or
select your Azure Key Vault where it's stored as a secret.
f. If you select Azure SQL Managed Instance , complete the following steps.
a. Select Connection string or your Azure Key Vault where it's stored as a secret.
b. If you select Connection string , complete the following steps.
a. For Account selection method , if you choose From Azure subscription , select
the relevant Azure subscription , Ser ver name , Endpoint type and Database
name . If you choose Enter manually , complete the following steps.
a. For Fully qualified domain name , enter
<server name>.<dns prefix>.database.windows.net or
<server name>.public.<dns prefix>.database.windows.net,3342 as the private
or public endpoint of your Azure SQL Managed Instance, respectively. If you
enter the private endpoint, Test connection isn't applicable, since ADF UI
can't reach it.
b. For Database name , enter msdb .
b. For Authentication type , select SQL Authentication , Managed Identity ,
Ser vice Principal , or User-Assigned Managed Identity .
If you select SQL Authentication , enter the relevant Username and
Password or select your Azure Key Vault where it's stored as a secret.
If you select Managed Identity , grant the system managed identity for your
ADF access to your Azure SQL Managed Instance.
If you select Ser vice Principal , enter the relevant Ser vice principal ID and
Ser vice principal key or select your Azure Key Vault where it's stored as a
secret.
If you select User-Assigned Managed Identity , grant the specified user-
assigned managed identity for your ADF access to your Azure SQL Managed
Instance. You can then select any existing credentials created using your
specified user-assigned managed identities or create new ones.
g. If you select File system , enter the UNC path of folder where your packages are deployed for
Host , as well as the relevant Username and Password or select your Azure Key Vault where it's
stored as a secret.
h. Select Test connection when applicable and if it's successful, select Create .
3. Your added package stores will appear on the Deployment settings page. To remove them, select their
check boxes, and then select Delete .
Select Test connection when applicable and if it's successful, select Continue .
Advanced settings page
On the Advanced settings page of Integration runtime setup pane, complete the following steps.
1. For Maximum Parallel Executions Per Node , select the maximum number of packages to run
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number if you want to use more than one core to run a single large package that's
compute or memory intensive. Select a high number if you want to run one or more small packages in a
single core.
2. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box to choose whether you want to add
standard/express custom setups on your Azure-SSIS IR. For more information, see Custom setup for an
Azure-SSIS IR.
3. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to create
cer tain network resources, and optionally bring your own static public IP addresses check box
to choose whether you want to join your Azure-SSIS IR to a virtual network.
Select it if you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
without configuring a self-hosted IR. For more information, see Create an Azure-SSIS IR in a virtual
network.
4. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS Integration
Runtime check box to choose whether you want to configure a self-hosted IR as proxy for your Azure-
SSIS IR. For more information, see Set up a self-hosted IR as proxy.
5. Select Continue .
On the Summar y page of Integration runtime setup pane, review all provisioning settings, bookmark the
recommended documentation links, and select Create to start the creation of your integration runtime.
NOTE
Excluding any custom setup time, this process should finish within 5 minutes.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.

Connections pane
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .

You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.

Deploy SSIS packages


If you use SSISDB, you can deploy your packages into it and run them on your Azure-SSIS IR by using the
Azure-enabled SSDT or SSMS tools. These tools connect to your database server via its server endpoint:
For an Azure SQL Database server, the server endpoint format is <server name>.database.windows.net .
For a managed instance with private endpoint, the server endpoint format is
<server name>.<dns prefix>.database.windows.net .
For a managed instance with public endpoint, the server endpoint format is
<server name>.public.<dns prefix>.database.windows.net,3342 .

If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.
See also the following SSIS documentation:
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Schedule package executions in Azure
Connect to on-premises data sources with Windows authentication

Next steps
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize an Azure-SSIS IR
Set up an Azure-SSIS IR in Azure Data Factory by
using PowerShell
7/20/2021 • 16 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This tutorial provides steps for using PowerShell to provision an Azure-SQL Server Integration Services (SSIS)
Integration Runtime (IR) in Azure Data Factory (ADF). An Azure-SSIS IR supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
After an Azure-SSIS IR is provisioned, you can use familiar tools to deploy and run your packages in Azure.
These tools are already Azure-enabled and include SQL Server Data Tools (SSDT), SQL Server Management
Studio (SSMS), and command-line utilities like dtutil and AzureDTExec.
For conceptual information on Azure-SSIS IRs, see Azure-SSIS integration runtime overview.

NOTE
This article demonstrates using Azure PowerShell to set up an Azure-SSIS IR. To use the Azure portal or an Azure Data
Factory app to set up the Azure-SSIS IR, see Tutorial: Set up an Azure-SSIS IR.

In this tutorial, you will:


Create a data factory.
Create an Azure-SSIS Integration Runtime.
Start the Azure-SSIS Integration Runtime.
Review the complete script.
Deploy SSIS packages.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database ser ver or managed instance (optional) . If you don't already have a database
server, create one in the Azure portal before you get started. Data Factory will in turn create an SSISDB
instance on this database server.
We recommend that you create the database server in the same Azure region as the integration runtime.
This configuration lets the integration runtime write execution logs into SSISDB without crossing Azure
regions.
Keep these points in mind:
Based on the selected database server, the SSISDB instance can be created on your behalf as a
single database, as part of an elastic pool, or in a managed instance. It can be accessible in a public
network or by joining a virtual network. For guidance in choosing the type of database server to
host SSISDB, see Compare SQL Database and SQL Managed Instance.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Create an Azure-SSIS IR in a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a managed instance with private endpoint to host SSISDB. For more
information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see New-
AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure AD authentication with the specified system/user-assigned managed
identity for your data factory. For the latter, you need to add the specified system/user-assigned
managed identity for your data factory into an Azure AD group with access permissions to the
database server. For more information, see Create an Azure-SSIS IR with Azure AD authentication.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.
Azure PowerShell . To run a PowerShell script to set up your Azure-SSIS IR, follow the instructions in
Install and configure Azure PowerShell.

NOTE
For a list of Azure regions in which Azure Data Factory and Azure-SSIS IR are currently available, see Azure Data Factory
and Azure-SSIS IR availability by region.

Open the Windows PowerShell ISE


Open the Windows PowerShell Integrated Scripting Environment (ISE) with administrator permissions.

Create variables
Copy the following script to the ISE. Specify values for the variables.
### Azure Data Factory info
# If your input contains a PSH special character (for example, "$"), precede it with the escape character
"`" (for example, "`$")
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
# Data factory name - Must be globally unique
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS Integration Runtime info; this is a Data Factory compute resource for running SSIS packages
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, although Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script

### SSISDB info


$SSISDBServerEndpoint = "[your server name.database.windows.net or managed instance name.public.DNS
prefix.database.windows.net,3342 or leave it empty if you're not using SSISDB]" # WARNING: If you use
SSISDB, please ensure that there is no existing SSISDB on your database server, so we can prepare and manage
one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify
"S0", "S1", "S2", "S3", etc., see https://docs.microsoft.com/azure/sql-database/sql-database-resource-
limits-database-server
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for SQL Database or leave it empty for SQL Managed Instance]"

### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access

Sign in and select your subscription


To sign in and select your Azure subscription, add the following code to the script:
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Validate the connection to your database server


To validate the connection, add the following script:

# Validate only if you're using SSISDB


if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to
proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}

To create an Azure SQL Database instance as part of the script, see the following example. Set values for the
variables that haven't been defined already (for example, SSISDBServerName, FirewallIPAddress).

New-AzSqlServer -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-Location $DataFactoryLocation `
-SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -
ArgumentList $SSISDBServerAdminUserName, $(ConvertTo-SecureString -String $SSISDBServerAdminPassword -
AsPlainText -Force))

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName `


-ServerName $SSISDBServerName `
-FirewallRuleName "ClientIPAddress_$today" -StartIpAddress $FirewallIPAddress -EndIpAddress
$FirewallIPAddress

New-AzSqlServerFirewallRule -ResourceGroupName $ResourceGroupName -ServerName $SSISDBServerName -


AllowAllAzureIPs

Create a resource group


Create an Azure resource group by using the New-AzResourceGroup command. A resource group is a logical
container to which Azure resources are deployed and managed as a group.
If your resource group already exists, don't copy this code to your script.

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName


Create a data factory
Run the following command to create a data factory:

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

Create an Azure-SSIS Integration Runtime


To create an Azure-SSIS Integration Runtime that runs SSIS packages in Azure, run the following commands. If
you're not using SSISDB, you can omit the CatalogServerEndpoint, CatalogPricingTier, and
CatalogAdminCredential parameters.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode

# Add CatalogServerEndpoint, CatalogPricingTier, and CatalogAdminCredential parameters if you're using


SSISDB
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier `
-CatalogAdminCredential $serverCreds
}

# Add custom setup parameters if you use standard/express custom setups


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
if(![string]::IsNullOrEmpty($ExpressCustomSetup))
{
if($ExpressCustomSetup -eq "RunCmdkey")
{
$addCmdkeyArgument = "YourFileShareServerName or YourAzureStorageAccountName.file.core.windows.net"
$userCmdkeyArgument = "YourDomainName\YourUsername or azure\YourAzureStorageAccountName"
$passCmdkeyArgument = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourPassword or YourAccessKey")
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.CmdkeySetup($addCmdkeyArgument,
$userCmdkeyArgument, $passCmdkeyArgument)
}
if($ExpressCustomSetup -eq "SetEnvironmentVariable")
{
$variableName = "YourVariableName"
$variableName = "YourVariableName"
$variableValue = "YourVariableValue"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.EnvironmentVariableSetup($variableName, $variableValue)
}
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
{
$moduleVersion = "YourAzModuleVersion"
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.AzPowerShellSetup($moduleVersion)
}
if($ExpressCustomSetup -eq "SentryOne.TaskFactory")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.SQLPhonetics.NET")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.HEDDA.IO")
{
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup)
}
if($ExpressCustomSetup -eq "KingswaySoft.IntegrationToolkit")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "KingswaySoft.ProductivityPack")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "Theobald.XtractIS")
{
$jsonData = Get-Content -Raw -Path YourLicenseFile.json
$jsonData = $jsonData -replace '\s',''
$jsonData = $jsonData.replace('"','\"')
$licenseKey = New-Object Microsoft.Azure.Management.DataFactory.Models.SecureString($jsonData)
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "AecorSoft.IntegrationService")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Standard")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Extended")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
# Create an array of one or more express custom setups
$setups = New-Object System.Collections.ArrayList
$setups.Add($setup)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-ExpressCustomSetup $setups
}

# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data access
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName

if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}

Start the Azure-SSIS Integration Runtime


To start the Azure-SSIS IR, run the following commands:

write-host("##### Starting #####")


Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

NOTE
Excluding any custom setup time, this process should finish within 5 minutes.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.

Full script
The PowerShell script in this section configures an instance of Azure-SSIS IR that runs SSIS packages. After you
run this script successfully, you can deploy and run SSIS packages in Azure.
1. Open the ISE.
2. At the ISE command prompt, run the following command:

Set-ExecutionPolicy Unrestricted -Scope CurrentUser

3. Copy the PowerShell script in this section to the ISE.


4. Provide appropriate values for all parameters at the beginning of the script.
5. Run the script.

### Azure Data Factory info


# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
# Data factory name - Must be globally unique
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS Integration Runtime info - This is a Data Factory compute resource for running SSIS packages
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x the number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script

### SSISDB info


$SSISDBServerEndpoint = "[your server name.database.windows.net or managed instance name.public.DNS
prefix.database.windows.net,3342 or leave it empty if you're not using SSISDB]" # WARNING: If you want to
use SSISDB, ensure that there is no existing SSISDB on your database server, so we can prepare and manage
one on your behalf
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication]"
# For the basic pricing tier, specify "Basic", not "B" - For standard/premium/elastic pool tiers, specify
"S0", "S1", "S2", "S3", etc., see https://docs.microsoft.com/azure/sql-database/sql-database-resource-
limits-database-server
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for SQL Database or leave it empty for SQL Managed Instance]"

### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access

### Sign in and select subscription


Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

### Validate the connection to database server


# Validate only if you're using SSISDB
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you want to
proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}

### Create a data factory


Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `
-Location $DataFactoryLocation `
-Name $DataFactoryName

### Create an integration runtime


Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode

# Add CatalogServerEndpoint, CatalogPricingTier, and CatalogAdminCredential parameters if you're using


SSISDB
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier `
-CatalogAdminCredential $serverCreds
}

# Add custom setup parameters if you use standard/express custom setups


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
if(![string]::IsNullOrEmpty($ExpressCustomSetup))
{
if($ExpressCustomSetup -eq "RunCmdkey")
{
$addCmdkeyArgument = "YourFileShareServerName or YourAzureStorageAccountName.file.core.windows.net"
$userCmdkeyArgument = "YourDomainName\YourUsername or azure\YourAzureStorageAccountName"
$passCmdkeyArgument = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourPassword or YourAccessKey")
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.CmdkeySetup($addCmdkeyArgument,
$userCmdkeyArgument, $passCmdkeyArgument)
}
if($ExpressCustomSetup -eq "SetEnvironmentVariable")
{
$variableName = "YourVariableName"
$variableValue = "YourVariableValue"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.EnvironmentVariableSetup($variableName, $variableValue)
}
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
{
$moduleVersion = "YourAzModuleVersion"
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.AzPowerShellSetup($moduleVersion)
}
if($ExpressCustomSetup -eq "SentryOne.TaskFactory")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.SQLPhonetics.NET")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.HEDDA.IO")
{
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup)
}
if($ExpressCustomSetup -eq "KingswaySoft.IntegrationToolkit")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "KingswaySoft.ProductivityPack")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "Theobald.XtractIS")
{
$jsonData = Get-Content -Raw -Path YourLicenseFile.json
$jsonData = $jsonData -replace '\s',''
$jsonData = $jsonData.replace('"','\"')
$licenseKey = New-Object Microsoft.Azure.Management.DataFactory.Models.SecureString($jsonData)
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "AecorSoft.IntegrationService")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Standard")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Extended")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
# Create an array of one or more express custom setups
$setups = New-Object System.Collections.ArrayList
$setups.Add($setup)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-ExpressCustomSetup $setups
}

# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data access
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName

if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}

### Start integration runtime


write-host("##### Starting #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

Monitor and manage your Azure-SSIS IR


For information about monitoring and managing the Azure-SSIS IR, see:
Monitor the Azure-SSIS IR
Manage the Azure-SSIS IR

Deploy SSIS packages


If you use SSISDB, you can deploy your packages into it and run them on your Azure-SSIS IR by using the
Azure-enabled SSDT or SSMS tools. These tools connect to your database server via its server endpoint:
For an Azure SQL Database server, the server endpoint format is <server name>.database.windows.net .
For a managed instance with private endpoint, the server endpoint format is
<server name>.<dns prefix>.database.windows.net .
For a managed instance with public endpoint, the server endpoint format is
<server name>.public.<dns prefix>.database.windows.net,3342 .

If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.
For more SSIS documentation, see:
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Connect to on-premises data sources with Windows authentication
Schedule package executions in Azure

Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS Integration Runtime.
Start the Azure-SSIS Integration Runtime.
Review the complete script.
Deploy SSIS packages.
To learn about customizing your Azure-SSIS Integration Runtime, see the following article:
Customize your Azure-SSIS IR
Configure an Azure-SQL Server Integration
Services (SSIS) integration runtime (IR) to join a
virtual network
3/5/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This tutorial provides basic steps for using the Azure portal to configure an Azure-SQL Server Integration
Services (SSIS) integration runtime (IR) to join a virtual network.
The steps include:
Configure a virtual network.
Join the Azure-SSIS IR to a virtual network from Azure Data Factory portal.

Prerequisites
Azure-SSIS integration runtime . If you do not have an Azure-SSIS integration runtime, provision an
Azure-SSIS integration runtime in Azure Data Factory before begin.
User permission . The user who creates the Azure-SSIS IR must have the role assignment at least on
Azure Data Factory resource with one of the options below:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/* permission,
which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission. If you also want to bring your own
public IP addresses for Azure-SSIS IR while joining it to an Azure Resource Manager virtual network,
please also include Microsoft.Network/publicIPAddresses/*/join/action permission in the role.
Vir tual network .
If you do not have a virtual network, create a virtual network using the Azure portal.
Make sure that the virtual network's resource group can create and delete certain Azure network
resources.
The Azure-SSIS IR needs to create certain network resources under the same resource group as
the virtual network. These resources include:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer
A network security group, with the name *<Guid>-azurebatch-
cloudservicenetworksecuritygroup
An Azure public IP address, with the name -azurebatch-cloudservicepublicip
Those resources will be created when your Azure-SSIS IR starts. They'll be deleted when your
Azure-SSIS IR stops. To avoid blocking your Azure-SSIS IR from stopping, don't reuse these
network resources in your other resources.
Make sure that you have no resource lock on the resource group/subscription to which the virtual
network belongs. If you configure a read-only/delete lock, starting and stopping your Azure-SSIS
IR will fail, or it will stop responding.
Make sure that you don't have an Azure Policy assignment that prevents the following resources
from being created under the resource group/subscription to which the virtual network belongs:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Below network configuration scenarios are not covered in this tutorial:
If you bring your own public IP addresses for the Azure-SSIS IR.
If you use your own Domain Name System (DNS) server.
If you use a network security group (NSG) on the subnet.
If you use Azure ExpressRoute or a user-defined route (UDR).
If you use customized Azure-SSIS IR.
For more info, check virtual network configuration.

Configure a virtual network


Use the Azure portal to configure a virtual network before you try to join an Azure-SSIS IR to it.
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. Sign in to the Azure portal.
3. Select More ser vices . Filter for and select Vir tual networks .
4. Filter for and select your virtual network in the list.
5. On the Vir tual network page, select Proper ties .
6. Select the copy button for RESOURCE ID to copy the resource ID for the virtual network to the
clipboard. Save the ID from the clipboard in OneNote or a file.
7. On the left menu, select Subnets .
Ensure that the subnet you select has enough available address space for the Azure-SSIS IR to use.
Leave available IP addresses for at least two times the IR node number. Azure reserves some IP
addresses within each subnet. These addresses can't be used. The first and last IP addresses of the
subnets are reserved for protocol conformance, and three more addresses are used for Azure
services. For more information, see Are there any restrictions on using IP addresses within these
subnets?
Don't select the GatewaySubnet to deploy an Azure-SSIS IR. It's dedicated for virtual network
gateways.
Don't use a subnet that is exclusively occupied by other Azure services (for example, SQL Database
SQL Managed Instance, App Service, and so on).
8. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory portal,
the Azure Batch provider is automatically registered for you.)
a. In the Azure portal, on the left menu, select Subscriptions .
b. Select your subscription.
c. On the left, select Resource providers , and confirm that Microsoft.Batch is a registered
provider.
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.

Join the Azure-SSIS IR to a virtual network


After you've configured your Azure Resource Manager virtual network or classic virtual network, you can join
the Azure-SSIS IR to the virtual network:
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. In the Azure portal, on the left menu, select Data factories . If you don't see Data factories on the
menu, select More ser vices , and then in the INTELLIGENCE + ANALYTICS section, select Data
factories .

3. Select your data factory with the Azure-SSIS IR in the list. You see the home page for your data factory.
Select the Author & Monitor tile. You see the Data Factory UI on a separate tab.
4. In the Data Factory UI, switch to the Edit tab, select Connections , and switch to the Integration
Runtimes tab.

5. If your Azure-SSIS IR is running, in the Integration Runtimes list, in the Actions column, select the
Stop button for your Azure-SSIS IR. You can't edit your Azure-SSIS IR until you stop it.

6. In the Integration Runtimes list, in the Actions column, select the Edit button for your Azure-SSIS IR.

7. On the integration runtime setup panel, advance through the General Settings and SQL Settings
sections by selecting the Next button.
8. On the Advanced Settings section:
a. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to
create cer tain network resources, and optionally bring your own static public IP
addresses check box.
b. For Subscription , select the Azure subscription that has your virtual network.
c. For Location , the same location of your integration runtime is selected.
d. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
e. For VNet Name , select the name of your virtual network. It should be the same one used for SQL
Database with virtual network service endpoints or SQL Managed Instance with private endpoint
to host SSISDB. Or it should be the same one connected to your on-premises network. Otherwise,
it can be any virtual network to bring your own static public IP addresses for Azure-SSIS IR.
f. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for SQL Datbase with virtual network service endpoints to host SSISDB. Or it should be a
different subnet from the one used for your SQL Managed Instance with private endpoint to host
SSISDB. Otherwise, it can be any subnet to bring your own static public IP addresses for Azure-
SSIS IR.
g. Select VNet Validation . If the validation is successful, select Continue .
9. On the Summar y section, review all settings for your Azure-SSIS IR. Then select Update .
10. Start your Azure-SSIS IR by selecting the Star t button in the Actions column for your Azure-SSIS IR. It
takes about 20 to 30 minutes to start the Azure-SSIS IR that joins a virtual network.

Next Steps
Learn more about joining Azure-SSIS IR to a virtual network.
Push Data Factory lineage data to Azure Purview
(Preview)
5/6/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this tutorial, you'll use the Data Factory user interface (UI) to create a pipeline that run activities and report
lineage data to Azure Purview account. Then you can view all the lineage information in your Azure Purview
account.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure Data Factor y . If you don't have an Azure Data Factory, see Create an Azure Data Factory.
Azure Pur view account . The Purview account captures all lineage data generated by data factory. If you
don't have an Azure Purview account, see Create an Azure Purview.

Run Data Factory activities and push lineage data to Azure Purview
Step 1: Connect Data Factory to your Purview account
Log in to your Purview account in Purview portal, go to Management Center . Choose Data Factor y in
External connections and click New button to create a connection to a new Data Factory.

In the popup page, you can choose the Data Factory you want to connect to this Purview account.
You can check the status after creating the connection. Connected means the connection between Data Factory
and this Purview is successfully connected.

NOTE
You need to be assigned any of below roles in the Purview account and Data Factory Contributor role to create the
connection between Data Factory and Azure Purview.
Owner
User Access Administrator

Step 2: Run Copy and Dataflow activities in Data Factory


You can create pipelines, Copy activities and Dataflow activities in Data Factory. You don't need any additional
configuration for lineage data capture. The lineage data will automatically be captured during the activities
execution.

If you don't know how to create Copy and Dataflow activities, see Copy data from Azure Blob storage to a
database in Azure SQL Database by using Azure Data Factory and Transform data using mapping data flows.
Step 3: Run Execute SSIS Package activities in Data Factory
You can create pipelines, Execute SSIS Package activities in Data Factory. You don't need any additional
configuration for lineage data capture. The lineage data will automatically be captured during the activities
execution.

If you don't know how to create Execute SSIS Package activities, see Run SSIS Packages in Azure.
Step 4: View lineage information in your Purview account
Go back to your Purview Account. In the home page, select Browse assets . Choose the asset you want, and
click Lineage tab. You will see all the lineage information.

You can see lineage data for Copy activity.

You also can see lineage data for Dataflow activity.


NOTE
For the lineage of Dataflow activity, we only support source and sink. The lineage for Dataflow transformation is not
supported yet.

You also can see lineage data for Execute SSIS Package activity.

NOTE
For the lineage of Execute SSIS Package activity, we only support source and destination. The lineage for transformation is
not supported yet.

Next steps
Catalog lineage user guide
Connect Data Factory to Azure Purview
Data integration using Azure Data Factory and
Azure Data Share
7/2/2021 • 22 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


As customers embark on their modern data warehouse and analytics projects, they require not only more data
but also more visibility into their data across their data estate. This workshop dives into how improvements to
Azure Data Factory and Azure Data Share simplify data integration and management in Azure.
From enabling code-free ETL/ELT to creating a comprehensive view over your data, improvements in Azure Data
Factory will empower your data engineers to confidently bring in more data, and thus more value, to your
enterprise. Azure Data Share will allow you to do business to business sharing in a governed manner.
In this workshop, you'll use Azure Data Factory (ADF) to ingest data from Azure SQL Database into Azure Data
Lake Storage Gen2 (ADLS Gen2). Once you land the data in the lake, you'll transform it via mapping data flows,
data factory's native transformation service, and sink it into Azure Synapse Analytics. Then, you'll share the table
with transformed data along with some additional data using Azure Data Share.
The data used in this lab is New York City taxi data. To import it into your database in SQL Database, download
the taxi-data bacpac file.

Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database : If you don't have a SQL DB, learn how to create a SQL DB account
Azure Data Lake Storage Gen2 storage account : If you don't have an ADLS Gen2 storage account,
learn how to create an ADLS Gen2 storage account.
Azure Synapse Analytics : If you don't have an Azure Synapse Analytics, learn how to create an Azure
Synapse Analytics instance.
Azure Data Factor y : If you haven't created a data factory, see how to create a data factory.
Azure Data Share : If you haven't created a data share, see how to create a data share.

Set up your Azure Data Factory environment


In this section, you'll learn how to access the Azure Data Factory user experience (ADF UX) from the Azure portal.
Once in the ADF UX, you'll configure three linked service for each of the data stores we are using: Azure SQL DB,
ADLS Gen2, and Azure Synapse Analytics.
In Azure Data Factory linked services define the connection information to external resources. Azure Data
Factory currently supports over 85 connectors.
Open the Azure Data Factory UX
1. Open the Azure portal in either Microsoft Edge or Google Chrome.
2. Using the search bar at the top of the page, search for 'Data Factories'
3. Click on your data factory resource to open up its resource blade.

4. Click on Author and Monitor to open up the ADF UX. The ADF UX can also be accessed at
adf.azure.com.

5. You'll be redirected to the homepage of the ADF UX. This page contains quick-starts, instructional videos,
and links to tutorials to learn data factory concepts. To start authoring, click on the pencil icon in left side-
bar.
Create an Azure SQL Database linked service
1. To create a linked service, select Manage hub in the left side-bar, on the Connections pane, select
Linked ser vices and then select New to add a new linked service.

2. The first linked service you'll configure is an Azure SQL DB. You can use the search bar to filter the data
store list. Click on the Azure SQL Database tile and click continue.
3. In the SQL DB configuration pane, enter 'SQLDB' as your linked service name. Enter in your credentials to
allow data factory to connect to your database. If you're using SQL authentication, enter in the server
name, the database, your user name and password. You can verify your connection information is correct
by clicking Test connection . Click Create when finished.
Create an Azure Synapse Analytics linked service
1. Repeat the same process to add an Azure Synapse Analytics linked service. In the connections tab, click
New . Select the Azure Synapse Analytics tile and click continue.
2. In the linked service configuration pane, enter 'SQLDW' as your linked service name. Enter in your
credentials to allow data factory to connect to your database. If you're using SQL authentication, enter in
the server name, the database, your user name and password. You can verify your connection
information is correct by clicking Test connection . Click Create when finished.
Create an Azure Data Lake Storage Gen2 linked service
1. The last linked service needed for this lab is an Azure Data Lake Storage gen2. In the connections tab,
click New . Select the Azure Data Lake Storage Gen2 tile and click continue.
2. In the linked service configuration pane, enter 'ADLSGen2' as your linked service name. If you're using
Account key authentication, select your ADLS Gen2 storage account from the Storage account name
dropdown. You can verify your connection information is correct by clicking Test connection . Click
Create when finished.

Turn on data flow debug mode


In section Transform data using mapping data flow , you'll be building mapping data flows. A best practice before
building mapping data flows is to turn on debug mode, which allows you to test transformation logic in seconds
on an active spark cluster.
To turn on debug, click the Data flow debug slider in the top bar of data flow canvas or pipeline canvas when
you have Data flow activities. Click ok when the confirmation dialog pop-ups. The cluster will take about 5-7
minutes to start up. Continue on to Ingest data from Azure SQL DB into ADLS Gen2 using the copy activity while
it is initializing.
Ingest data using the copy activity
In this section, you'll create a pipeline with a copy activity that ingests one table from an Azure SQL DB into an
ADLS Gen2 storage account. You'll learn how to add a pipeline, configure a dataset and debug a pipeline via the
ADF UX. The configuration pattern used in this section can be applied to copying from a relational data store to
a file-based data store.
In Azure Data Factory, a pipeline is a logical grouping of activities that together perform a task. An activity
defines an operation to perform on your data. A dataset points to the data you wish to use in a linked service.
Create a pipeline with a copy activity
1. In the factory resources pane, click on the plus icon to open the new resource menu. Select Pipeline .

2. In the General tab of the pipeline canvas, name your pipeline something descriptive such as
'IngestAndTransformTaxiData'.
3. In the activities pane of the pipeline canvas, open the Move and Transform accordion and drag the
Copy data activity onto the canvas. Give the copy activity a descriptive name such as 'IngestIntoADLS'.

Configure Azure SQL DB source dataset


1. Click on the Source tab of the copy activity. To create a new dataset, click New . Your source will be the
table 'dbo.TripData' located in the linked service 'SQLDB' configured earlier.
2. Search for Azure SQL Database and click continue.

3. Call your dataset 'TripData'. Select 'SQLDB' as your linked service. Select table name 'dbo.TripData' from
the table name dropdown. Import the schema From connection/store . Click OK when finished.
You have successfully created your source dataset. Make sure in the source settings, the default value Table is
selected in the use query field.
Configure ADLS Gen2 sink dataset
1. Click on the Sink tab of the copy activity. To create a new dataset, click New .

2. Search for Azure Data Lake Storage Gen2 and click continue.
3. In the select format pane, select DelimitedText as you're writing to a csv file. Click continue.

4. Name your sink dataset 'TripDataCSV'. Select 'ADLSGen2' as your linked service. Enter where you want to
write your csv file. For example, you can write your data to file trip-data.csv in container
staging-container . Set First row as header to true as you want your output data to have headers.
Since no file exists in the destination yet, set Impor t schema to None . Click OK when finished.
Test the copy activity with a pipeline debug run
1. To verify your copy activity is working correctly, click Debug at the top of the pipeline canvas to execute a
debug run. A debug run allows you to test your pipeline either end-to-end or until a breakpoint before
publishing it to the data factory service.

2. To monitor your debug run, go to the Output tab of the pipeline canvas. The monitoring screen will
autorefresh every 20 seconds or when you manually click the refresh button. The copy activity has a
special monitoring view, which can be access by clicking the eye-glasses icon in the Actions column.
3. The copy monitoring view gives the activity's execution details and performance characteristics. You can
see information such as data read/written, rows read/written, files read/written, and throughput. If you
have configured everything correctly, you should see 49,999 rows written into one file in your ADLS sink.

4. Before moving on to the next section, it's suggested that you publish your changes to the data factory
service by clicking Publish all in the factory top bar. While not covered in this lab, Azure Data Factory
supports full git integration. Git integration allows for version control, iterative saving in a repository, and
collaboration on a data factory. For more information, see source control in Azure Data Factory.
Transform data using mapping data flow
Now that you have successfully copied data into Azure Data Lake Storage, it is time to join and aggregate that
data into a data warehouse. We will use mapping data flow, Azure Data Factory's visually designed
transformation service. Mapping data flows allow users to develop transformation logic code-free and execute
them on spark clusters managed by the ADF service.
The data flow created in this step inner joins the 'TripDataCSV' dataset created in the previous section with a
table 'dbo.TripFares' stored in 'SQLDB' based on four key columns. Then the data gets aggregated based upon
column payment_type to calculate the average of certain fields and written in an Azure Synapse Analytics table.
Add a data flow activity to your pipeline
1. In the activities pane of the pipeline canvas, open the Move and Transform accordion and drag the
Data flow activity onto the canvas.

2. In the side pane that opens, select Create new data flow and choose Mapping data flow . Click OK .
3. You'll be directed to the data flow canvas where you'll be building your transformation logic. In the
general tab, name your data flow 'JoinAndAggregateData'.

Configure your trip data csv source


1. The first thing you want to do is configure your two source transformations. The first source will point to
the 'TripDataCSV' DelimitedText dataset. To add a source transformation, click on the Add Source box in
the canvas.
2. Name your source 'TripDataCSV' and select the 'TripDataCSV' dataset from the source drop-down. If you
remember, you didn't import a schema initially when creating this dataset as there was no data there.
Since trip-data.csv exists now, click Edit to go to the dataset settings tab.

3. Go to tab Schema and click Impor t schema . Select From connection/store to import directly from
the file store. 14 columns of type string should appear.
4. Go back to data flow 'JoinAndAggregateData'. If your debug cluster has started (indicated by a green
circle next to the debug slider), you can get a snapshot of the data in the Data Preview tab. Click
Refresh to fetch a data preview.

NOTE
Data preview does not write data.

Configure your trip fares SQL DB source


1. The second source you're adding will point at the SQL DB table 'dbo.TripFares'. Under your 'TripDataCSV'
source, there will be another Add Source box. Click it to add a new source transformation.

2. Name this source 'TripFaresSQL'. Click New next to the source dataset field to create a new SQL DB
dataset.

3. Select the Azure SQL Database tile and click continue. Note: You may notice many of the connectors in
data factory are not supported in mapping data flow. To transform data from one of these sources, ingest
it into a supported source using the copy activity.
4. Call your dataset 'TripFares'. Select 'SQLDB' as your linked service. Select table name 'dbo.TripFares' from
the table name dropdown. Import the schema From connection/store . Click OK when finished.

5. To verify your data, fetch a data preview in the Data Preview tab.
Inner join TripDataCSV and TripFaresSQL
1. To add a new transformation, click the plus icon in the bottom-right corner of 'TripDataCSV'. Under
Multiple inputs/outputs , select Join .
2. Name your join transformation 'InnerJoinWithTripFares'. Select 'TripFaresSQL' from the right stream
dropdown. Select Inner as the join type. To learn more about the different join types in mapping data
flow, see join types.
Select which columns you wish to match on from each stream via the Join conditions dropdown. To
add an additional join condition, click on the plus icon next to an existing condition. By default, all join
conditions are combined with an AND operator, which means all conditions must be met for a match. In
this lab, we want to match on columns medallion , hack_license , vendor_id , and pickup_datetime

3. Verify you successfully joined 25 columns together with a data preview.


Aggregate by payment_type
1. After you complete your join transformation, add an aggregate transformation by clicking the plus icon
next to 'InnerJoinWithTripFares. Choose Aggregate under Schema modifier .

2. Name your aggregate transformation 'AggregateByPaymentType'. Select payment_type as the group by


column.
3. Go to the Aggregates tab. Here, you'll specify two aggregations:
The average fare grouped by payment type
The total trip distance grouped by payment type
First, you'll create the average fare expression. In the text box labeled Add or select a column , enter
'average_fare'.

4. To enter an aggregation expression, click the blue box labeled Enter expression . This will open up the
data flow expression builder, a tool used to visually create data flow expressions using input schema,
built-in functions and operations, and user-defined parameters. For more information on the capabilities
of the expression builder, see the expression builder documentation.
To get the average fare, use the avg() aggregation function to aggregate the total_amount column cast
to an integer with toInteger() . In the data flow expression language, this is defined as
avg(toInteger(total_amount)) . Click Save and finish when you're done.
5. To add an additional aggregation expression, click on the plus icon next to average_fare . Select Add
column .

6. In the text box labeled Add or select a column , enter 'total_trip_distance'. As in the last step, open the
expression builder to enter in the expression.
To get the total trip distance, use the sum() aggregation function to aggregate the trip_distance
column cast to an integer with toInteger() . In the data flow expression language, this is defined as
sum(toInteger(trip_distance)) . Click Save and finish when you're done.

7. Test your transformation logic in the Data Preview tab. As you can see, there are significantly fewer
rows and columns than previously. Only the three groups by and aggregation columns defined in this
transformation continue downstream. As there are only five payment type groups in the sample, only five
rows are outputted.
Configure you Azure Synapse Analytics sink
1. Now that we have finished our transformation logic, we are ready to sink our data in an Azure Synapse
Analytics table. Add a sink transformation under the Destination section.

2. Name your sink 'SQLDWSink'. Click New next to the sink dataset field to create a new Azure Synapse
Analytics dataset.

3. Select the Azure Synapse Analytics tile and click continue.


4. Call your dataset 'AggregatedTaxiData'. Select 'SQLDW' as your linked service. Select Create new table
and name the new table dbo.AggregateTaxiData. Click OK when finished

5. Go to the Settings tab of the sink. Since we are creating a new table, we need to select Recreate table
under table action. Unselect Enable staging , which toggles whether we are inserting row-by-row or in
batch.
You have successfully created your data flow. Now it's time to run it in a pipeline activity.
Debug your pipeline end-to -end
1. Go back to the tab for the IngestAndTransformData pipeline. Notice the green box on the
'IngestIntoADLS' copy activity. Drag it over to the 'JoinAndAggregateData' data flow activity. This creates
an 'on success', which causes the data flow activity to only run if the copy is successful.

2. As we did for the copy activity, click Debug to execute a debug run. For debug runs, the data flow activity
will use the active debug cluster instead of spinning up a new cluster. This pipeline will take a little over a
minute to execute.

3. Like the copy activity, the data flow has a special monitoring view accessed by the eyeglasses icon on
completion of the activity.
4. In the monitoring view, you can see a simplified data flow graph along with the execution times and rows
at each execution stage. If done correctly, you should have aggregated 49,999 rows into five rows in this
activity.

5. You can click a transformation to get additional details on its execution such as partitioning information
and new/updated/dropped columns.
You have now completed the data factory portion of this lab. Publish your resources if you wish to
operationalize them with triggers. You successfully ran a pipeline that ingested data from Azure SQL Database to
Azure Data Lake Storage using the copy activity and then aggregated that data into an Azure Synapse Analytics.
You can verify the data was successfully written by looking at the SQL Server itself.

Share data using Azure Data Share


In this section, you'll learn how to set up a new data share using the Azure portal. This will involve creating a
new data share that will contain datasets from Azure Data Lake Store Gen2 and Azure Synapse Analytics. You'll
then configure a snapshot schedule, which will give the data consumers an option to automatically refresh the
data being shared with them. Then, you'll invite recipients to your data share.
Once you have created a data share, you'll then switch hats and become the data consumer. As the data
consumer, you'll walk through the flow of accepting a data share invitation, configuring where you'd like the
data to be received and mapping datasets to different storage locations. Then you'll trigger a snapshot, which
will copy the data shared with you into the destination specified.
Sharing data (Data Provider flow)
1. Open the Azure portal in either Microsoft Edge or Google Chrome.
2. Using the search bar at the top of the page, search for Data Shares
3. Select the data share account with 'Provider' in the name. For example, DataProvider0102 .
4. Select Star t sharing your data

5. Select +Create to start configuring your new data share.


6. Under Share name , specify a name of your choice. This is the share name that will be seen by your data
consumer, so be sure to give it a descriptive name such as TaxiData.
7. Under Description , put in a sentence, which describes the contents of the data share. The data share will
contain world-wide taxi trip data that is stored in a number of stores including Azure Synapse Analytics
and Azure Data Lake Store.
8. Under Terms of use , specify a set of terms that you would like your data consumer to adhere to. Some
examples include "Do not distribute this data outside your organization" or "Refer to legal agreement".
9. Select Continue .
10. Select Add datasets
11. Select Azure Synapse Analytics to select a table from Azure Synapse Analytics that your ADF
transformations landed in.

12. You'll be given a script to run before you can proceed. The script provided creates a user in the SQL
database to allow the Azure Data Share MSI to authenticate on its behalf.

IMPORTANT
Before running the script, you must set yourself as the Active Directory Admin for the SQL Server.

1. Open a new tab and navigate to the Azure portal. Copy the script provided to create a user in the
database that you want to share data from. Do this by logging into the EDW database using Query
Explorer (preview) using AAD authentication.
You'll need to modify the script so that the user created is contained within brackets. Eg:
create user [dataprovider-xxxx] from external login; exec sp_addrolemember db_owner, [dataprovider-
xxxx];
2. Switch back to Azure Data Share where you were adding datasets to your data share.
3. Select EDW , then select AggregatedTaxiData for the table.
4. Select Add dataset
We now have a SQL table that is part of our dataset. Next, we will add additional datasets from Azure
Data Lake Store.
5. Select Add dataset and select Azure Data Lake Store Gen2
6. Select Next
7. Expand wwtaxidata. Expand Boston Taxi Data. Notice that you can share down to the file level.
8. Select the Boston Taxi Data folder to add the entire folder to your data share.
9. Select Add datasets
10. Review the datasets that have been added. You should have a SQL table and an ADLS Gen2 folder added
to your data share.
11. Select Continue
12. In this screen, you can add recipients to your data share. The recipients you add will receive invitations to
your data share. For the purpose of this lab, you must add in 2 e-mail addresses:
a. The e-mail address of the Azure subscription you're in.

b. Add in the fictional data consumer named [email protected].


13. In this screen, you can configure a Snapshot Setting for your data consumer. This will allow them to
receive regular updates of your data at an interval defined by you.
14. Check Snapshot Schedule and configure an hourly refresh of your data by using the Recurrence drop
down.
15. Select Create .
You now have an active data share. Lets review what you can see as a data provider when you create a
data share.
16. Select the data share that you created, titled DataProvider . You can navigate to it by selecting Sent
Shares in Data Share .
17. Click on Snapshot schedule. You can disable the snapshot schedule if you choose.
18. Next, select the Datasets tab. You can add additional datasets to this data share after it has been created.
19. Select the Share subscriptions tab. No share subscriptions exist yet because your data consumer hasn't
yet accepted your invitation.
20. Navigate to the Invitations tab. Here, you'll see a list of pending invitation(s).

21. Select the invitation to [email protected]. Select Delete. If your recipient hasn't yet accepted the
invitation, they will no longer be able to do so.
22. Select the Histor y tab. Nothing is displayed as yet because your data consumer hasn't yet accepted your
invitation and triggered a snapshot.
Receiving data (Data consumer flow)
Now that we have reviewed our data share, we are ready to switch context and wear our data consumer hat.
You should now have an Azure Data Share invitation in your inbox from Microsoft Azure. Launch Outlook Web
Access (outlook.com) and log in using the credentials supplied for your Azure subscription.
In the e-mail that you should have received, click on "View invitation >". At this point, you're going to be
simulating the data consumer experience when accepting a data providers invitation to their data share.
You may be prompted to select a subscription. Make sure you select the subscription you have been working in
for this lab.
1. Click on the invitation titled DataProvider.
2. In this Invitation screen, you'll notice various details about the data share that you configured earlier as a
data provider. Review the details and accept the terms of use if provided.
3. Select the Subscription and Resource Group that already exists for your lab.
4. For Data share account , select DataConsumer . You can also create a new data share account.
5. Next to Received share name , you'll notice the default share name is the name that was specified by
the data provider. Give the share a friendly name that describes the data you're about to receive, e.g
TaxiDataShare .
6. You can choose to Accept and configure now or Accept and configure later . If you choose to accept
and configure now, you'll specify a storage account where all data should be copied. If you choose to
accept and configure later, the datasets in the share will be unmapped and you'll need to manually map
them. We will opt for that later.
7. Select Accept and configure later .
In configuring this option, a share subscription is created but there is nowhere for the data to land since
no destination has been mapped.
Next, we will configure dataset mappings for the data share.
8. Select the Received Share (the name you specified in step 5).
Trigger snapshot is greyed out but the share is Active.
9. Select the Datasets tab. Notice that each dataset is Unmapped, which means that it has no destination to
copy data to.
10. Select the Azure Synapse Analytics Table and then select + Map to Target .
11. On the right-hand side of the screen, select the Target Data Type drop down.
You can map the SQL data to a wide range of data stores. In this case, we'll be mapping to an Azure SQL
Database.

(Optional) Select Azure Data Lake Store Gen2 as the target data type.
(Optional) Select the Subscription, Resource Group and Storage account you have been working in.
(Optional) You can choose to receive the data into your data lake in either csv or parquet format.
12. Next to Target data type , select Azure SQL Database.
13. Select the Subscription, Resource Group and Storage account you have been working in.
14. Before you can proceed, you'll need to create a new user in the SQL Server by running the script
provided. First, copy the script provided to your clipboard.
15. Open a new Azure portal tab. Don't close your existing tab as you'll need to come back to it in a moment.
16. In the new tab you opened, navigate to SQL databases .
17. Select the SQL database (there should only be one in your subscription). Be careful not to select the data
warehouse.
18. Select Quer y editor (preview)
19. Use AAD authentication to log in to Query editor.
20. Run the query provided in your data share (copied to clipboard in step 14).
This command allows the Azure Data Share service to use Managed Identities for Azure Services to
authenticate to the SQL Server to be able to copy data into it.
21. Go back to the original tab, and select Map to target .
22. Next, select the Azure Data Lake Gen2 folder that is part of the dataset and map it to an Azure Blob
Storage account.
With all datasets mapped, you're now ready to start receiving data from the data provider.

23. Select Details .


Notice that Trigger snapshot is no longer greyed out, since the data share now has destinations to copy
into.
24. Select Trigger snapshot -> Full Copy.
This will start copying data into your new data share account. In a real world scenario, this data would be
coming from a third party.
It will take approximately 3-5 minutes for the data to come across. You can monitor progress by clicking
on the Histor y tab.
While you wait, navigate to the original data share (DataProvider) and view the status of the Share
Subscriptions and Histor y tab. Notice that there is now an active subscription, and as a data provider,
you can also monitor when the data consumer has started to receive the data shared with them.
25. Navigate back to the Data consumer's data share. Once the status of the trigger is successful, navigate to
the destination SQL database and data lake to see that the data has landed in the respective stores.
Congratulations, you have completed the lab!
Tutorial: How to access on-premises SQL Server
from Data Factory Managed VNet using Private
Endpoint
6/10/2021 • 7 minutes to read • Edit Online

This tutorial provides steps for using the Azure portal to setup Private Link Service and access on-premises SQL
Server from Managed VNet using Private Endpoint.

NOTE
The solution presented in this article describes SQL Server connectivity, but you can use a similar approach to connect
and query other available on-premises connectors that are supported in Azure Data Factory.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Vir tual Network . If you don’t have a Virtual Network, create one following Create Virtual Network.
Vir tual network to on-premises network . Create a connection between virtual network and on-
premises network either using ExpressRoute or VPN.
Data Factor y with Managed VNet enabled . If you don’t have a Data Factory or Managed VNet is not
enabled, create one following Create Data Factory with Managed VNet.

Create subnets for resources


Use the por tal to create subnets in your vir tual network .

SUB N ET DESC RIP T IO N

be-subnet subnet for backend servers


SUB N ET DESC RIP T IO N

fe-subnet subnet for standard internal load balancer

pls-subnet subnet for Private Link Service

Create a standard load balancer


Use the portal to create a standard internal load balancer.
1. On the top left-hand side of the screen, select Create a resource > Networking > Load Balancer .
2. In the Basics tab of the Create load balancer page, enter, or select the following information:

SET T IN G VA L UE

Subscription Select your subscription.

Resource group Select your resource group.

Name Enter myLoadBalancer .

Region Select East US.

Type Select Internal.

SKU Select Standard .

Virtual network Select your virtual network.

Subnet Select fe-subnet created in the previous step.

IP address assignment Select Dynamic.

Availability zone Select Zone-redundant .

3. Accept the defaults for the remaining settings, and then select Review + create .
4. In the Review + create tab, select Create .
Create load balancer resources
Create a backend pool
A backend address pool contains the IP addresses of the virtual (NICs) connected to the load balancer.
Create the backend address pool myBackendPool to include virtual machines for load-balancing internet
traffic.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from the
resources list.
2. Under Settings , select Backend pools , then select Add .
3. On the Add a backend pool page, for name, type myBackendPool , as the name for your backend pool,
and then select Add .
Create a health probe
The load balancer monitors the status of your app with a health probe.
The health probe adds or removes VMs from the load balancer based on their response to health checks.
Create a health probe named myHealthProbe to monitor the health of the VMs.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Health probes , then select Add .
SET T IN G VA L UE

Name Enter myHealthProbe .

Protocol Select TCP .

Port Enter 22.

Interval Enter 15 for number of Inter val in seconds between


probe attempts.

Unhealthy threshold Select 2 for number of Unhealthy threshold or


consecutive probe failures that must occur before a VM
is considered unhealthy.

3. Leave the rest the defaults and select OK .


Create a load balancer rule
A load balancer rule is used to define how traffic is distributed to the VMs. You define the frontend IP
configuration for the incoming traffic and the backend IP pool to receive the traffic. The source and destination
port are defined in the rule.
In this section, you'll create a load balancer rule:
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Load-balancing rules , then select Add .
3. Use these values to configure the load-balancing rule:

SET T IN G VA L UE

Name Enter myRule .

IP Version Select IPv4 .

Frontend IP address Select LoadBalancerFrontEnd .

Protocol Select TCP .

Port Enter 1433 .

Backend port Enter 1433 .

Backend pool Select myBackendPool.

Health probe Select myHealthProbe .

Idle timeout (minutes) Move the slider to 15 minutes.

TCP reset Select Disabled .

4. Leave the rest of the defaults and then select OK .


Create a private link service
In this section, you'll create a Private Link service behind a standard load balancer.
1. On the upper-left part of the page in the Azure portal, select Create a resource .
2. Search for Private Link in the Search the Marketplace box.
3. Select Create .
4. In Over view under Private Link Center , select the blue Create private link ser vice button.
5. In the Basics tab under Create private link ser vice , enter, or select the following information:

SET T IN G VA L UE

Project details

Subscription Select your subscription.

Resource Group Select your resource group.

Instance details

Name Enter myPrivateLinkSer vice .

Region Select East US.

6. Select the Outbound settings tab or select Next: Outbound settings at the bottom of the page.
7. In the Outbound settings tab, enter or select the following information:

SET T IN G VA L UE

Load balancer Select myLoadBalancer .

Load balancer frontend IP address Select LoadBalancerFrontEnd .

Source NAT subnet Select pls-subnet .

Enable TCP proxy V2 Leave the default of No .

Private IP address settings

Leave the default settings.

8. Select the Access security tab or select Next: Access security at the bottom of the page.
9. Leave the default of Role-based access control only in the Access security tab.
10. Select the Tags tab or select Next: Tags at the bottom of the page.
11. Select the Review + create tab or select Next: Review + create at the bottom of the page.
12. Select Create in the Review + create tab.

Create backend servers


1. On the upper-left side of the portal, select Create a resource > Compute > Vir tual machine .
2. In Create a vir tual machine , type or select the values in the Basics tab:

SET T IN G VA L UE

Project details

Subscription Select your Azure subscription.

Resource Group Select your resource group.

Instance details

Virtual machine name Enter myVM1 .

Region Select East US.

Availability Options Select Availability zones .

Availability zone Select 1 .

Image Select Ubuntu Ser ver 18.04LTS – Gen1 .

Azure Spot instance Select No .

Size Choose VM size or take default setting.

Administrator account

Username Enter a username.

SSH public key source Generate new key pair.

Key pair name mySSHKey.

Inbound por t rules

Public inbound ports None

3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:

SET T IN G VA L UE

Network interface

Virtual network Select your virtual network.

Subnet be-subnet .

Public IP Select None .


SET T IN G VA L UE

NIC network security group Select None .

Load balancing

Place this virtual machine behind an existing load Select Yes .


balancing solution?

Load balancing settings

Load balancing options Select Azure load balancing .

Select a load balancer Select myLoadBalancer .

Select a backend pool Select myBackendPool.

5. Select Review + create .


6. Review the settings, and then select Create .
7. You can repeat step 1 to 6 to have more than 1 backend server VM for HA.

Creating Forwarding Rule to Endpoint


1. Login and copy script ip_fwd.sh to your backend server VMs.
2. Run the script on with the following options:
sudo ./ip_fwd.sh -i eth0 -f 1433 -a <FQDN/IP> -b 1433
<FQDN/IP> is your target SQL Server IP.

NOTE
FQDN doesn't work for on-premises SQL Server unless you add a record in Azure DNS zone.

3. Run below command and check the iptables in your backend server VMs. You can see one record in your
iptables with your target IP.
sudo iptables -t nat -v -L PREROUTING -n --line-number
NOTE
If you have more than one SQL Server or data sources, you need to define multiple load balancer rules and IP
table records with different ports. Otherwise, there will be some conflict. For example,

P O R T I N L O AD B AL ANCER B ACK END P O R T I N L O AD CO M M AND R U N I N


RULE B AL ANCE R U L E B ACK END SER VER VM

SQL Ser ver 1 1433 1433 sudo ./ip_fwd.sh -i eth0 -f


1433 -a <FQDN/IP> -b
1433

SQL Ser ver 2 1434 1434 sudo ./ip_fwd.sh -i eth0 -f


1434 -a <FQDN/IP> -b
1433

Create a Private Endpoint to Private Link Service


1. Select All services in the left-hand menu, select All resources, and then select your data factory from the
resources list.
2. Select Author & Monitor to launch the Data Factory UI in a separate tab.
3. Go to the Manage tab and then go to the Managed private endpoints section.
4. Select + New under Managed private endpoints .
5. Select the Private Link Ser vice tile from the list and select Continue .
6. Enter the name of private endpoint and select myPrivateLinkSer vice in private link service list.
7. Add FQDN of your target on-premises SQL Server and NAT IPs of your private link Service.
8. Create private endpoint.

Create a linked service and test the connection


1. Go to the Manage tab and then go to the Managed private endpoints section.
2. Select + New under Linked Ser vice .
3. Select the SQL Ser ver tile from the list and select Continue .
4. Enable Interactive Authoring .
5. Input the FQDN of your on-premises SQL Server, user name and password .
6. Then click Test connection .
Troubleshooting
Go to the backend server VM, confirm telnet the SQL Server works: telnet < FQDN > 1433 .

Next steps
Advance to the following tutorial to learn about accessing Microsoft Azure SQL Managed Instance from Data
Factory Managed VNet using Private Endpoint:
Access SQL Managed Instance from Data Factory Managed VNet
Tutorial: How to access SQL Managed Instance from
Data Factory Managed VNET using Private
Endpoint
6/10/2021 • 7 minutes to read • Edit Online

This tutorial provides steps for using the Azure portal to setup Private Link Service and access SQL Managed
Instance from Managed VNET using Private Endpoint.

Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Vir tual Network . If you don’t have a Virtual Network, create one following Create Virtual Network.
Vir tual network to on-premises network . Create a connection between virtual network and on-
premises network either using ExpressRoute or VPN.
Data Factor y with Managed VNET enabled . If you don’t have a Data Factory or Managed VNET is not
enabled, create one following Create Data Factory with Managed VNET.

Create subnets for resources


Use the por tal to create subnets in your vir tual network .

SUB N ET DESC RIP T IO N

be-subnet subnet for backend servers

fe-subnet subnet for standard internal load balancer

pls-subnet subnet for Private Link Service


Create a standard load balancer
Use the portal to create a standard internal load balancer.
1. On the top left-hand side of the screen, select Create a resource > Networking > Load Balancer .
2. In the Basics tab of the Create load balancer page, enter, or select the following information:

SET T IN G VA L UE

Subscription Select your subscription.

Resource group Select your resource group.

Name Enter myLoadBalancer .

Region Select East US.

Type Select Internal.

SKU Select Standard .

Virtual network Select your virtual network.

Subnet Select fe-subnet created in the previous step.

IP address assignment Select Dynamic.

Availability zone Select Zone-redundant .

3. Accept the defaults for the remaining settings, and then select Review + create .
4. In the Review + create tab, select Create .
Create load balancer resources
Create a backend pool
A backend address pool contains the IP addresses of the virtual (NICs) connected to the load balancer.
Create the backend address pool myBackendPool to include virtual machines for load-balancing internet
traffic.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from the
resources list.
2. Under Settings , select Backend pools , then select Add .
3. On the Add a backend pool page, for name, type myBackendPool , as the name for your backend pool,
and then select Add .
Create a health probe
The load balancer monitors the status of your app with a health probe.
The health probe adds or removes VMs from the load balancer based on their response to health checks.
Create a health probe named myHealthProbe to monitor the health of the VMs.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Health probes , then select Add .
SET T IN G VA L UE

Name Enter myHealthProbe .

Protocol Select TCP .

Port Enter 22.

Interval Enter 15 for number of Inter val in seconds between


probe attempts.

Unhealthy threshold Select 2 for number of Unhealthy threshold or


consecutive probe failures that must occur before a VM
is considered unhealthy.

3. Leave the rest the defaults and select OK .


Create a load balancer rule
A load balancer rule is used to define how traffic is distributed to the VMs. You define the frontend IP
configuration for the incoming traffic and the backend IP pool to receive the traffic. The source and destination
port are defined in the rule.
In this section, you'll create a load balancer rule:
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Load-balancing rules , then select Add .
3. Use these values to configure the load-balancing rule:

SET T IN G VA L UE

Name Enter myRule .

IP Version Select IPv4 .

Frontend IP address Select LoadBalancerFrontEnd .

Protocol Select TCP .

Port Enter 1433 .

Backend port Enter 1433 .

Backend pool Select myBackendPool.

Health probe Select myHealthProbe .

Idle timeout (minutes) Move the slider to 15 minutes.

TCP reset Select Disabled .

4. Leave the rest of the defaults and then select OK .


Create a private link service
In this section, you'll create a Private Link service behind a standard load balancer.
1. On the upper-left part of the page in the Azure portal, select Create a resource .
2. Search for Private Link in the Search the Marketplace box.
3. Select Create .
4. In Over view under Private Link Center , select the blue Create private link ser vice button.
5. In the Basics tab under Create private link ser vice , enter, or select the following information:

SET T IN G VA L UE

Project details

Subscription Select your subscription.

Resource Group Select your resource group.

Instance details

Name Enter myPrivateLinkSer vice .

Region Select East US.

6. Select the Outbound settings tab or select Next: Outbound settings at the bottom of the page.
7. In the Outbound settings tab, enter or select the following information:

SET T IN G VA L UE

Load balancer Select myLoadBalancer .

Load balancer frontend IP address Select LoadBalancerFrontEnd .

Source NAT subnet Select pls-subnet .

Enable TCP proxy V2 Leave the default of No .

Private IP address settings

Leave the default settings.

8. Select the Access security tab or select Next: Access security at the bottom of the page.
9. Leave the default of Role-based access control only in the Access security tab.
10. Select the Tags tab or select Next: Tags at the bottom of the page.
11. Select the Review + create tab or select Next: Review + create at the bottom of the page.
12. Select Create in the Review + create tab.

Create backend servers


1. On the upper-left side of the portal, select Create a resource > Compute > Vir tual machine .
2. In Create a vir tual machine , type or select the values in the Basics tab:

SET T IN G VA L UE

Project details

Subscription Select your Azure subscription.

Resource Group Select your resource group.

Instance details

Virtual machine name Enter myVM1 .

Region Select East US.

Availability Options Select Availability zones .

Availability zone Select 1 .

Image Select Ubuntu Ser ver 18.04LTS – Gen1 .

Azure Spot instance Select No .

Size Choose VM size or take default setting.

Administrator account

Username Enter a username.

SSH public key source Generate new key pair.

Key pair name mySSHKey.

Inbound por t rules

Public inbound ports None.

3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:

SET T IN G VA L UE

Network interface

Virtual network Select your virtual network.

Subnet be-subnet .

Public IP Select None .


SET T IN G VA L UE

NIC network security group Select None .

Load balancing

Place this virtual machine behind an existing load Select Yes .


balancing solution?

Load balancing settings

Load balancing options Select Azure load balancing .

Select a load balancer Select myLoadBalancer .

Select a backend pool Select myBackendPool.

5. Select Review + create .


6. Review the settings, and then select Create .
7. You can repeat step 1 to 6 to have more than 1 backend server VM for HA.

Creating Forwarding Rule to Endpoint


1. Login and copy script ip_fwd.sh to your backend server VMs.
2. Run the script on with the following options:
sudo ./ip_fwd.sh -i eth0 -f 1433 -a <FQDN/IP> -b 1433
<FQDN/IP> is the host of your SQL Managed Instance.

3. Run below command and check the iptables in your backend server VMs. You can see one record in your
iptables with your target IP.
sudo iptables -t nat -v -L PREROUTING -n --line-number
NOTE
Note: If you have more than one SQL MI or other data sources, you need to define multiple load balancer rules
and IP table records with different ports. Otherwise, there will be some conflict. For example,

P O R T I N L O AD B AL ANCER B ACK END P O R T I N L O AD CO M M AND R U N I N


RULE B AL ANCE R U L E B ACK END SER VER VM

SQL MI 1 1433 1433 sudo ./ip_fwd.sh -i eth0 -f


1433 -a <FQDN/IP> -b
1433

SQL MI 2 1434 1434 sudo ./ip_fwd.sh -i eth0 -f


1434 -a <FQDN/IP> -b
1433

Create a Private Endpoint to Private Link Service


1. Select All services in the left-hand menu, select All resources, and then select your data factory from the
resources list.
2. Select Author & Monitor to launch the Data Factory UI in a separate tab.
3. Go to the Manage tab and then go to the Managed private endpoints section.
4. Select + New under Managed private endpoints .
5. Select the Private Link Ser vice tile from the list and select Continue .
6. Enter the name of private endpoint and select myPrivateLinkSer vice in private link service list.
7. Add FQDN of your target SQL Managed Instance and NAT IPs of your private link Service.
8. Create private endpoint.

Create a linked service and test the connection


1. Go to the Manage tab and then go to the Managed private endpoints section.
2. Select + New under Linked Ser vice .
3. Select the Azure SQL Database Managed Instance tile from the list and select Continue .
4. Enable Interactive Authoring .
5. Input the Host of your SQL Managed Instance, user name and password .

NOTE
Please input SQL Managed Instance host manually. Otherwise it’s not full qualified domain name in the selection
list.

6. Then click Test connection .


Next steps
Advance to the following tutorial to learn about accessing on premises SQL Server from Data Factory Managed
VNET using Private Endpoint:
Access on premises SQL Server from Data Factory Managed VNET
Azure PowerShell samples for Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The following table includes links to sample Azure PowerShell scripts for Azure Data Factory.

SC RIP T DESC RIP T IO N

Copy data

Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure
Storage Blob Storage to another folder in the same Blob Storage.

Copy data from SQL Server to Azure Blob Storage This PowerShell script copies data from a SQL Server
database to an Azure blob storage.

Bulk copy This sample PowerShell script copies data from multiple
tables in a database in Azure SQL Database to Azure
Synapse Analytics.

Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after
the initial full copy of data from the source to the sink.

Transform data

Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.

Lift and shift SSIS packages to Azure

Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
6/24/2021 • 16 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article helps you understand pipelines and activities in Azure Data Factory and use them to construct end-
to-end data-driven workflows for your data movement and data processing scenarios.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. For example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a
mapping data flow to analyze the log data. The pipeline allows you to manage the activities as a set instead of
each one individually. You deploy and schedule the pipeline instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to
copy data from SQL Server to an Azure Blob Storage. Then, use a data flow activity or a Databricks Notebook
activity to process and transform data from the blob storage to an Azure Synapse Analytics pool on top of which
business intelligence reporting solutions are built.
Data Factory has three groupings of activities: data movement activities, data transformation activities, and
control activities. An activity can take zero or more input datasets and produce one or more output datasets. The
following diagram shows the relationship between pipeline, activity, and dataset in Data Factory:

An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output
for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents.
After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an
input/output dataset of a Copy Activity or an HDInsightHive Activity. For more information about datasets, see
Datasets in Azure Data Factory article.

Data movement activities


Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the
data stores listed in the table in this section. Data from any source can be written to any sink. Click a data store
to learn how to copy data to and from that store.

SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y


C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Azure Azure Blob ✓ ✓ ✓ ✓


storage

Azure Cognitive ✓ ✓ ✓
Search index

Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data Lake ✓ ✓ ✓ ✓


Storage Gen1

Azure Data Lake ✓ ✓ ✓ ✓


Storage Gen2

Azure Database ✓ ✓ ✓
for MariaDB

Azure Database ✓ ✓ ✓ ✓
for MySQL

Azure Database ✓ ✓ ✓ ✓
for PostgreSQL

Azure Databricks ✓ ✓ ✓ ✓
Delta Lake

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓ ✓
Managed
Instance

Azure Synapse ✓ ✓ ✓ ✓
Analytics

Azure Table ✓ ✓ ✓ ✓
storage

Database Amazon Redshift ✓ ✓ ✓

DB2 ✓ ✓ ✓

Drill ✓ ✓ ✓

Google ✓ ✓ ✓
BigQuery

Greenplum ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache Impala ✓ ✓ ✓

Informix ✓ ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft Access ✓ ✓ ✓

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓

SAP Business ✓ ✓
Warehouse via
Open Hub

SAP Business ✓ ✓
Warehouse via
MDX

SAP HANA ✓ ✓ ✓

SAP table ✓ ✓

Snowflake ✓ ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓ ✓

Vertica ✓ ✓ ✓

NoSQL Cassandra ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓ ✓

MongoDB Atlas ✓ ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

Amazon S3 ✓ ✓ ✓
Compatible
Storage

File system ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google Cloud ✓ ✓ ✓
Storage

HDFS ✓ ✓ ✓

Oracle Cloud ✓ ✓ ✓
Storage

SFTP ✓ ✓ ✓ ✓

Generic Generic HTTP ✓ ✓ ✓


protocol

Generic OData ✓ ✓ ✓

Generic ODBC ✓ ✓ ✓

Generic REST ✓ ✓ ✓ ✓

Ser vices and Amazon ✓ ✓ ✓


apps Marketplace
Web Service

Concur (Preview) ✓ ✓ ✓

Dataverse ✓ ✓ ✓ ✓

Dynamics 365 ✓ ✓ ✓ ✓

Dynamics AX ✓ ✓ ✓

Dynamics CRM ✓ ✓ ✓ ✓

Google AdWords ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

HubSpot ✓ ✓ ✓

Jira ✓ ✓ ✓

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Microsoft 365 ✓ ✓ ✓

Oracle Eloqua ✓ ✓ ✓
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle Service ✓ ✓ ✓
Cloud (Preview)

PayPal (Preview) ✓ ✓ ✓

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service Cloud

Salesforce ✓ ✓ ✓
Marketing Cloud

SAP Cloud for ✓ ✓ ✓ ✓


Customer (C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓

SharePoint ✓ ✓ ✓
Online List

Shopify (Preview) ✓ ✓ ✓

Square (Preview) ✓ ✓ ✓

Web table ✓ ✓
(HTML table)
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Xero ✓ ✓ ✓

Zoho (Preview) ✓ ✓ ✓

NOTE
If a connector is marked Preview, you can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, contact Azure support.

For more information, see Copy Activity - Overview article.

Data transformation activities


Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.

DATA T RA N SF O RM AT IO N A C T IVIT Y C O M P UT E EN VIRO N M EN T

Data Flow Azure Databricks managed by Azure Data Factory

Azure Function Azure Functions

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Azure Machine Learning Studio (classic) activities: Batch Azure VM


Execution and Update Resource

Stored Procedure Azure SQL, Azure Synapse Analytics, or SQL Server

U-SQL Azure Data Lake Analytics

Custom Activity Azure Batch

Databricks Notebook Azure Databricks

Databricks Jar Activity Azure Databricks

Databricks Python Activity Azure Databricks

For more information, see the data transformation activities article.


Control flow activities
The following control flow activities are supported:

C O N T RO L A C T IVIT Y DESC RIP T IO N

Append Variable Add a value to an existing array variable.

Execute Pipeline Execute Pipeline activity allows a Data Factory pipeline to


invoke another pipeline.

Filter Apply a filter expression to an input array

For Each ForEach Activity defines a repeating control flow in your


pipeline. This activity is used to iterate over a collection and
executes specified activities in a loop. The loop
implementation of this activity is similar to the Foreach
looping structure in programming languages.

Get Metadata GetMetadata activity can be used to retrieve metadata of


any data in Azure Data Factory.

If Condition Activity The If Condition can be used to branch based on condition


that evaluates to true or false. The If Condition activity
provides the same functionality that an if statement provides
in programming languages. It evaluates a set of activities
when the condition evaluates to true and another set of
activities when the condition evaluates to false.

Lookup Activity Lookup Activity can be used to read or look up a record/


table name/ value from any external source. This output can
further be referenced by succeeding activities.

Set Variable Set the value of an existing variable.

Until Activity Implements Do-Until loop that is similar to Do-Until looping


structure in programming languages. It executes a set of
activities in a loop until the condition associated with the
activity evaluates to true. You can specify a timeout value for
the until activity in Data Factory.

Validation Activity Ensure a pipeline only continues execution if a reference


dataset exists, meets a specified criteria, or a timeout has
been reached.

Wait Activity When you use a Wait activity in a pipeline, the pipeline waits
for the specified time before continuing with execution of
subsequent activities.

Web Activity Web Activity can be used to call a custom REST endpoint
from a Data Factory pipeline. You can pass datasets and
linked services to be consumed and accessed by the activity.

Webhook Activity Using the webhook activity, call an endpoint, and pass a
callback URL. The pipeline run waits for the callback to be
invoked before proceeding to the next activity.
Pipeline JSON
Here is how a pipeline is defined in JSON format:

{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
},
"concurrency": <your max pipeline concurrency>,
"annotations": [
]
}
}

TA G DESC RIP T IO N TYPE REQ UIRED

name Name of the pipeline. String Yes


Specify a name that
represents the action that
the pipeline performs.
Maximum number
of characters: 140
Must start with a
letter, number, or an
underscore (_)
Following characters
are not allowed: “.”,
"+", "?", "/", "
<",">","*"," %","
&",":"," "

description Specify the text describing String No


what the pipeline is used
for.

activities The activities section can Array Yes


have one or more activities
defined within it. See the
Activity JSON section for
details about the activities
JSON element.

parameters The parameters section List No


can have one or more
parameters defined within
the pipeline, making your
pipeline flexible for reuse.
TA G DESC RIP T IO N TYPE REQ UIRED

concurrency The maximum number of Number No


concurrent runs the pipeline
can have. By default, there
is no maximum. If the
concurrency limit is reached,
additional pipeline runs are
queued until earlier ones
complete

annotations A list of tags associated Array No


with the pipeline

Activity JSON
The activities section can have one or more activities defined within it. There are two main types of activities:
Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the following top-level
structure:

{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}

Following table describes properties in the activity JSON definition:

TA G DESC RIP T IO N REQ UIRED

name Name of the activity. Specify a name Yes


that represents the action that the
activity performs.
Maximum number of
characters: 55
Must start with a letter-
number, or an underscore (_)
Following characters are not
allowed: “.”, "+", "?", "/", "
<",">","*"," %"," &",":"," "

description Text describing what the activity or is Yes


used for
TA G DESC RIP T IO N REQ UIRED

type Type of the activity. See the Data Yes


Movement Activities, Data
Transformation Activities, and Control
Activities sections for different types of
activities.

linkedServiceName Name of the linked service used by the Yes for HDInsight Activity, Azure
activity. Machine Learning Studio (classic) Batch
Scoring Activity, Stored Procedure
An activity may require that you Activity.
specify the linked service that links to
the required compute environment. No for all others

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for an
activity, click links to the activity in the
previous section.

policy Policies that affect the run-time No


behavior of the activity. This property
includes a timeout and retry behavior.
If it isn't specified, default values are
used. For more information, see
Activity policy section.

dependsOn This property is used to define activity No


dependencies, and how subsequent
activities depend on previous activities.
For more information, see Activity
dependency

Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are only
available for execution activities.
Activity policy JSON definition
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity",
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}

JSO N N A M E DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

timeout Specifies the timeout for the Timespan No. Default timeout is 7
activity to run. days.

retry Maximum retry attempts Integer No. Default is 0

retryIntervalInSeconds The delay between retry Integer No. Default is 30 seconds


attempts in seconds

secureOutput When set to true, the Boolean No. Default is false.


output from activity is
considered as secure and
aren't logged for
monitoring.

Control activity
Control activities have the following top-level structure:

{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}

TA G DESC RIP T IO N REQ UIRED


TA G DESC RIP T IO N REQ UIRED

name Name of the activity. Specify a name Yes


that represents the action that the
activity performs.
Maximum number of
characters: 55
Must start with a letter
number, or an underscore (_)
Following characters are not
allowed: “.”, "+", "?", "/", "
<",">","*"," %"," &",":"," "

description Text describing what the activity or is Yes


used for

type Type of the activity. See the data Yes


movement activities, data
transformation activities, and control
activities sections for different types of
activities.

typeProperties Properties in the typeProperties No


section depend on each type of
activity. To see type properties for an
activity, click links to the activity in the
previous section.

dependsOn This property is used to define Activity No


Dependency, and how subsequent
activities depend on previous activities.
For more information, see activity
dependency.

Activity dependency
Activity Dependency defines how subsequent activities depend on previous activities, determining the condition
of whether to continue executing the next task. An activity can depend on one or multiple previous activities
with different dependency conditions.
The different dependency conditions are: Succeeded, Failed, Skipped, Completed.
For example, if a pipeline has Activity A -> Activity B, the different scenarios that can happen are:
Activity B has dependency condition on Activity A with succeeded : Activity B only runs if Activity A has a
final status of succeeded
Activity B has dependency condition on Activity A with failed : Activity B only runs if Activity A has a final
status of failed
Activity B has dependency condition on Activity A with completed : Activity B runs if Activity A has a final
status of succeeded or failed
Activity B has a dependency condition on Activity A with skipped : Activity B runs if Activity A has a final
status of skipped. Skipped occurs in the scenario of Activity X -> Activity Y -> Activity Z, where each activity
runs only if the previous activity succeeds. If Activity X fails, then Activity Y has a status of "Skipped" because
it never executes. Similarly, Activity Z has a status of "Skipped" as well.
Example: Activity 2 depends on the Activity 1 succeeding
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}

Sample copy pipeline


In the following sample pipeline, there is one activity of type Copy in the activities section. In this sample, the
copy activity copies data from an Azure Blob storage to a database in Azure SQL Database.
{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"policy": {
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to Copy .
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset . See Datasets
article for defining datasets in JSON.
In the typeProper ties section, BlobSource is specified as the source type and SqlSink is specified as the
sink type. In the data movement activities section, click the data store that you want to use as a source or a
sink to learn more about moving data to/from that data store.
For a complete walkthrough of creating this pipeline, see Quickstart: create a data factory.

Sample transformation pipeline


In the following sample pipeline, there is one activity of type HDInsightHive in the activities section. In this
sample, the HDInsight Hive activity transforms data from an Azure Blob storage by running a Hive script file on
an Azure HDInsight Hadoop cluster.
{
"name": "TransformPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"retry": 3
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
]
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to HDInsightHive .
The Hive script file, par titionweblogs.hql , is stored in the Azure Storage account (specified by the
scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted .
The defines section is used to specify the runtime settings that are passed to the hive script as Hive
configuration values (for example, $ {hiveconf:inputtable} , ${hiveconf:partitionedtable} ).

The typeProper ties section is different for each transformation activity. To learn about type properties
supported for a transformation activity, click the transformation activity in the Data transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.

Multiple activities in a pipeline


The previous two sample pipelines have only one activity in them. You can have more than one activity in a
pipeline. If you have multiple activities in a pipeline and subsequent activities are not dependent on previous
activities, the activities may run in parallel.
You can chain two activities by using activity dependency, which defines how subsequent activities depend on
previous activities, determining the condition whether to continue executing the next task. An activity can
depend on one or more previous activities with different dependency conditions.

Scheduling pipelines
Pipelines are scheduled by triggers. There are different types of triggers (Scheduler trigger, which allows
pipelines to be triggered on a wall-clock schedule, as well as the manual trigger, which triggers pipelines on-
demand). For more information about triggers, see pipeline execution and triggers article.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular pipeline in the
trigger definition. Pipelines & triggers have an n-m relationship. Multiple triggers can kick off a single pipeline,
and the same trigger can kick off multiple pipelines. Once the trigger is defined, you must start the trigger to
have it start triggering the pipeline. For more information about triggers, see pipeline execution and triggers
article.
For example, say you have a Scheduler trigger, "Trigger A," that I wish to kick off my pipeline, "MyCopyPipeline."
You define the trigger, as shown in the following example:
Trigger A definition

{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}

Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
How to achieve CI/CD (continuous integration and delivery) using Azure Data Factory
Continuous integration and delivery in Azure Data Factory
Linked services in Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes what linked services are, how they're defined in JSON format, and how they're used in
Azure Data Factory pipelines.
If you're new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together
perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a
copy activity to copy data from SQL Server to Azure Blob storage. Then, you might use a Hive activity that runs a
Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you
might use a second copy activity to copy the output data to Azure Synapse Analytics, on top of which business
intelligence (BI) reporting solutions are built. For more information about pipelines and activities, see Pipelines
and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked ser vice to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For example, an Azure
Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder within that Azure Storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create two linked services:
Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure
Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service).
The Azure Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at
runtime to connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset
specifies the blob container and blob folder that contains the input blobs in your Blob storage. The Azure SQL
Table dataset specifies the SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data
Factory:
Linked service JSON
A linked service in Data Factory is defined in JSON format as follows:

{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

The following table describes properties in the above JSON:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the linked service. See Azure Yes


Data Factory - Naming rules.

type Type of the linked service. For example: Yes


AzureBlobStorage (data store) or
AzureBatch (compute). See the
description for typeProperties.

typeProperties The type properties are different for Yes


each data store or compute.

For the supported data store types


and their type properties, see the
connector overview article. Navigate to
the data store connector article to
learn about type properties specific to
a data store.

For the supported compute types and


their type properties, see Compute
linked services.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

Linked service example


The following linked service is an Azure Blob storage linked service. Notice that the type is set to Azure Blob
storage. The type properties for the Azure Blob storage linked service include a connection string. The Data
Factory service uses this connection string to connect to the data store at runtime.
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Create linked services


Linked services can be created in the Azure Data Factory UX via the management hub and any activities,
datasets, or data flows that reference them.
You can create linked services by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure
Resource Manager Template, and Azure portal.

Data store linked services


You can find the list of data stores supported by Data Factory from connector overview article. Click a data store
to learn the supported connection properties.

Compute linked services


Reference compute environments supported for details about different compute environments you can connect
to from your data factory as well as the different configurations.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
3/22/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure
Data Factory pipelines.
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together
perform a task. The activities in a pipeline define actions to perform on your data. Now, a dataset is a named
view of data that simply points or references the data you want to use in your activities as inputs and outputs.
Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an
Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read
the data.
Before you create a dataset, you must create a linked ser vice to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For example, an Azure
Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder within that Azure Storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create two linked services:
Azure Blob Storage and Azure SQL Database. Then, create two datasets: Delimited Text dataset (which refers to
the Azure Blob Storage linked service, assuming you have text files as source) and Azure SQL Table dataset
(which refers to the Azure SQL Database linked service). The Azure Blob Storage and Azure SQL Database linked
services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and
Azure SQL Database, respectively. The Delimited Text dataset specifies the blob container and blob folder that
contains the input blobs in your Blob storage, along with format-related settings. The Azure SQL Table dataset
specifies the SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data
Factory:

Dataset JSON
A dataset in Data Factory is defined in the following JSON format:

{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema":[

],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The following table describes properties in the above JSON:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the dataset. See Azure Data Yes


Factory - Naming rules.

type Type of the dataset. Specify one of the Yes


types supported by Data Factory (for
example: DelimitedText, AzureSqlTable).

For details, see Dataset types.

schema Schema of the dataset, represents the No


physical data type and shape.

typeProperties The type properties are different for Yes


each type. For details on the
supported types and their properties,
see Dataset type.

When you import the schema of dataset, select the Impor t Schema button and choose to import from the
source or from a local file. In most cases, you'll import the schema directly from the source. But if you already
have a local schema file (a Parquet file or CSV with headers), you can direct Data Factory to base the schema on
that file.
In copy activity, datasets are used in source and sink. Schema defined in dataset is optional as reference. If you
want to apply column/field mapping between source and sink, refer to Schema and type mapping.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data schemas. If
your data has no schema, you can use schema drift for your source and sink. Metadata from the datasets
appears in your source transformation as the source projection. The projection in the source transformation
represents the Data Flow data with defined names and types.

Dataset type
Azure Data Factory supports many different types of datasets, depending on the data stores you use. You can
find the list of data stores supported by Data Factory from Connector overview article. Click a data store to learn
how to create a linked service and a dataset for it.
For example, for a Delimited Text dataset, the dataset type is set to DelimitedText as shown in the following
JSON sample:

{
"name": "DelimitedTextInput",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "input.log",
"folderPath": "inputdata",
"container": "adfgetstarted"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}

Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure Resource
Manager Template, and Azure portal

Current version vs. version 1 datasets


Here are some differences between Data Factory and Data Factory version 1 datasets:
The external property is not supported in the current version. It's replaced by a trigger.
The policy and availability properties are not supported in the current version. The start time for a pipeline
depends on triggers.
Scoped datasets (datasets defined in a pipeline) are not supported in the current version.

Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
5/28/2021 • 16 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a
pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the
pipeline or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely defines
that particular pipeline run.
Pipeline runs are typically instantiated by passing arguments to parameters that you define in the pipeline. You
can execute a pipeline either manually or by using a trigger. This article provides details about both ways of
executing a pipeline.

Manual execution (on-demand)


The manual execution of a pipeline is also referred to as on-demand execution.
For example, say you have a basic pipeline named copyPipeline that you want to execute. The pipeline has a
single activity that copies from an Azure Blob storage source folder to a destination folder in the same storage.
The following JSON definition shows this sample pipeline:
{
"name": "copyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "CopyBlobtoBlob",
"inputs": [
{
"referenceName": "sourceBlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sinkBlobDataset",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
}
}
}
}

In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer . You
pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
REST API
The following sample command shows you how to run your pipeline by using the REST API manually:

POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFa
ctory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview

For a complete sample, see Quickstart: Create a data factory by using the REST API.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

The following sample command shows you how to manually run your pipeline by using Azure PowerShell:

Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "Adfv2QuickStartPipeline" -ParameterFile


.\PipelineParameters.json

You pass parameters in the body of the request payload. In the .NET SDK, Azure PowerShell, and the Python SDK,
you pass values in a dictionary that's passed as an argument to the call:

{
"sourceBlobContainer": "MySourceFolder",
"sinkBlobContainer": "MySinkFolder"
}

The response payload is a unique ID of the pipeline run:

{
"runId": "0448d45a-a0bd-23f3-90a5-bfeea9264aed"
}

For a complete sample, see Quickstart: Create a data factory by using Azure PowerShell.
.NET SDK
The following sample call shows you how to run your pipeline by using the .NET SDK manually:

client.Pipelines.CreateRunWithHttpMessagesAsync(resourceGroup, dataFactoryName, pipelineName, parameters)

For a complete sample, see Quickstart: Create a data factory by using the .NET SDK.

NOTE
You can use the .NET SDK to invoke Data Factory pipelines from Azure Functions, from your web services, and so on.

Trigger execution
Triggers are another way that you can execute a pipeline run. Triggers represent a unit of processing that
determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of
triggers:
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.
Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger).Multiple
triggers can kick off a single pipeline, or a single trigger can kick off multiple pipelines. In the following trigger
definition, the pipelines property refers to a list of pipelines that are triggered by the particular trigger. The
property definition includes values for the pipeline parameters.
Basic trigger definition

{
"properties": {
"name": "MyTrigger",
"type": "<type of trigger>",
"typeProperties": {...},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]
}
}

Schedule trigger
A schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced calendar
options. For example, the trigger supports intervals like "weekly" or "Monday at 5:00 PM and Thursday at 9:00
PM." The schedule trigger is flexible because the dataset pattern is agnostic, and the trigger doesn't discern
between time-series and non-time-series data.
For more information about schedule triggers and, for examples, see Create a trigger that runs a pipeline on a
schedule.

Schedule trigger definition


When you create a schedule trigger, you specify scheduling and recurrence by using a JSON definition.
To have your schedule trigger kick off a pipeline run, include a pipeline reference of the particular pipeline in the
trigger definition. Pipelines and triggers have a many-to-many relationship. Multiple triggers can kick off a
single pipeline. A single trigger can kick off multiple pipelines.
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Year>>,
"interval": <<int>>, // How often to fire
"startTime": <<datetime>>,
"endTime": <<datetime>>,
"timeZone": "UTC",
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-24>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-60>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]}
}

IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.

Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling a trigger:

JSO N P RO P ERT Y DESC RIP T IO N

star tTime A date-time value. For basic schedules, the value of the
star tTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified star tTime value.

endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past.
JSO N P RO P ERT Y DESC RIP T IO N

timeZone The time zone. For a list of supported time zones, see Create
a trigger that runs a pipeline on a schedule.

recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency ,
inter val, endTime , count , and schedule elements. When
a recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.

frequency The unit of frequency at which the trigger recurs. The


supported values include "minute", "hour", "day", "week", and
"month".

inter val A positive integer that denotes the interval for the
frequency value. The frequency value determines how
often the trigger runs. For example, if the inter val is 3 and
the frequency is "week", the trigger recurs every three
weeks.

schedule The recurrence schedule for the trigger. A trigger with a


specified frequency value alters its recurrence based on a
recurrence schedule. The schedule property contains
modifications for the recurrence that are based on minutes,
hours, weekdays, month days, and week number.

Schedule trigger example

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-01T09:00:00-08:00",
"endTime": "2017-11-02T22:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToBlobPipeline"
},
"parameters": {}
},
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToAzureSQLPipeline"
},
"parameters": {}
}
]
}
}

Schema defaults, limits, and examples


JSO N P RO P ERT Y TYPE REQ UIRED DEFA ULT VA L UE VA L ID VA L UES EXA M P L E

star tTime string Yes None ISO 8601 date- "startTime" :


times "2013-01-
09T09:30:00-
08:00"

recurrence object Yes None A recurrence "recurrence"


object : {
"frequency" :
"monthly",
"interval" :
1 }

inter val number No 1 1 to 1000 "interval":10

endTime string Yes None A date-time "endTime" :


value that "2013-02-
09T09:30:00-
represents a time 08:00"
in the future

schedule object No None A schedule "schedule" :


object { "minute" :
[30], "hour"
: [8,17] }

startTime property
The following table shows you how the star tTime property controls a trigger run:

STA RT T IM E VA L UE REC URREN C E W IT H O UT SC H EDUL E REC URREN C E W IT H SC H EDUL E

Star t time is in the past Calculates the first future execution The trigger starts no sooner than the
time after the start time, and runs at specified start time. The first
that time. occurrence is based on the schedule,
calculated from the start time.
Runs subsequent executions calculated
from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.

Star t time is in the future or the Runs once at the specified start time. The trigger starts no sooner than the
current time specified start time. The first
Runs subsequent executions calculated occurrence is based on the schedule,
from the last execution time. calculated from the start time.

Runs subsequent executions based on


the recurrence schedule.

Let's look at an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00, the start time is 2017-04-07 14:00, and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the inter val
property to 2.) Notice that the star tTime value is in the past and occurs before the current time.
Under these conditions, the first execution is 2017-04-09 at 14:00. The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00 PM. The next instance is two days from
that time, which is on 2017-04-09 at 2:00 PM.
The first execution time is the same even whether star tTime is 2017-04-05 14:00 or 2017-04-01 14:00. After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are on 2017-04-11 at 2:00 PM, then on 2017-04-13 at 2:00 PM, then on 2017-04-15 at 2:00 PM, and
so on.
Finally, when hours or minutes aren't set in the schedule for a trigger, the hours or minutes of the first execution
are used as defaults.
schedule property
You can use schedule to limit the number of trigger executions. For example, if a trigger with a monthly
frequency is scheduled to run only on day 31, the trigger runs only in those months that have a thirty-first day.
You can also use schedule to expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the first and second days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting: week number, month day, weekday, hour, minute.
The following table describes the schedule elements in detail:

JSO N EL EM EN T DESC RIP T IO N VA L ID VA L UES

minutes Minutes of the hour at which the - Integer


trigger runs. - Array of integers

hours Hours of the day at which the trigger - Integer


runs. - Array of integers

weekDays Days of the week the trigger runs. The


value can be specified only with a - Monday
weekly frequency. - Tuesday
- Wednesday
- Thursday
- Friday
- Saturday
- Sunday
- Array of day values (maximum array
size is 7)

Day values aren't case-sensitive

monthlyOccurrences Days of the month on which the - Array of monthlyOccurrence


trigger runs. The value can be specified objects:
with a monthly frequency only. { "day": day, "occurrence":
occurrence }
- The day attribute is the day of the
week on which the trigger runs. For
example, a monthlyOccurrences
property with a day value of
{Sunday} means every Sunday of the
month. The day attribute is required.
- The occurrence attribute is the
occurrence of the specified day during
the month. For example, a
monthlyOccurrences property with
day and occurrence values of
{Sunday, -1} means the last Sunday
of the month. The occurrence
attribute is optional.
JSO N EL EM EN T DESC RIP T IO N VA L ID VA L UES

monthDays Day of the month on which the trigger - Any value <= -1 and >= -31
runs. The value can be specified with a - Any value >= 1 and <= 31
monthly frequency only. - Array of values

Tumbling window trigger


Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals.
For more information about tumbling window triggers and, for examples, see Create a tumbling window trigger.

Examples of trigger recurrence schedules


This section provides examples of recurrence schedules. It focuses on the schedule object and its elements.
The examples assume that the inter val value is 1 and that the frequency value is correct according to the
schedule definition. For example, you can't have a frequency value of "day" and also have a monthDays
modification in the schedule object. These kinds of restrictions are described in the table in the preceding
section.

EXA M P L E DESC RIP T IO N

{"hours":[5]} Run at 5:00 AM every day.

{"minutes":[15], "hours":[5]} Run at 5:15 AM every day.

{"minutes":[15], "hours":[5,17]} Run at 5:15 AM and 5:15 PM every day.

{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.

{"minutes":[0,15,30,45]} Run every 15 minutes.

{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour.


13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]}
This trigger runs every hour. The minutes are controlled by
the star tTime value, when a value is specified. If a value
isn't specified, the minutes are controlled by the creation
time. For example, if the start time or creation time
(whichever applies) is 12:25 PM, the trigger runs at 00:25,
01:25, 02:25, ..., and 23:25.

This schedule is equivalent to having a trigger with a


frequency value of "hour", an inter val value of 1, and no
schedule . This schedule can be used with different
frequency and inter val values to create other triggers. For
example, when the frequency value is "month", the
schedule runs only once a month, rather than every day
when the frequency value is "day".
EXA M P L E DESC RIP T IO N

{"minutes":[0]} Run every hour on the hour.

This trigger runs every hour on the hour starting at 12:00


AM, 1:00 AM, 2:00 AM, and so on.

This schedule is equivalent to a trigger with a frequency


value of "hour" and a star tTime value of zero minutes, and
no schedule but a frequency value of "day". If the
frequency value is "week" or "month", the schedule
executes one day a week or one day a month only,
respectively.

{"minutes":[15]} Run at 15 minutes past every hour.

This trigger runs every hour at 15 minutes past the hour


starting at 00:15 AM, 1:15 AM, 2:15 AM, and so on, and
ending at 11:15 PM.

{"hours":[17], "weekDays":["saturday"]} Run at 5:00 PM on Saturdays every week.

{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.

{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.

{"minutes":[0,15,30,45], "weekDays":["monday", Run every 15 minutes on weekdays.


"tuesday", "wednesday", "thursday", "friday"]}

{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}

{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.

{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the twenty-eighth day of every month
(assuming a frequency value of "month").

{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month.

To run a trigger on the last day of a month, use -1 instead of


day 28, 29, 30, or 31.

{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.

{monthDays":[1,14]} Run on the first and fourteenth day of every month at the
specified start time.

{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.
EXA M P L E DESC RIP T IO N

{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.

{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.

{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time.

When there's no fifth Friday in a month, the pipeline doesn't


run. To run the trigger on the last occurring Friday of the
month, consider using -1 instead of 5 for the occurrence
value.

{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}

{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}

Trigger type comparison


The tumbling window trigger and the schedule trigger both operate on time heartbeats. How are they different?

NOTE
The tumbling window trigger run waits for the triggered pipeline run to finish. Its run state reflects the state of the
triggered pipeline run. For example, if a triggered pipeline run is cancelled, the corresponding tumbling window trigger run
is marked cancelled. This is different from the "fire and forget" behavior of the schedule trigger, which is marked successful
as long as a pipeline run started.

The following table provides a comparison of the tumbling window trigger and schedule trigger:

IT EM T UM B L IN G W IN DO W T RIGGER SC H EDUL E T RIGGER

Backfill scenarios Supported. Pipeline runs can be Not supported. Pipeline runs can be
scheduled for windows in the past. executed only on time periods from
the current time and the future.

Reliability 100% reliability. Pipeline runs can be Less reliable.


scheduled for all windows from a
specified start date without gaps.
IT EM T UM B L IN G W IN DO W T RIGGER SC H EDUL E T RIGGER

Retr y capability Supported. Failed pipeline runs have a Not supported.


default retry policy of 0, or a policy
that's specified by the user in the
trigger definition. Automatically retries
when the pipeline runs fail due to
concurrency/server/throttling limits
(that is, status codes 400: User Error,
429: Too many requests, and 500:
Internal Server error).

Concurrency Supported. Users can explicitly set Not supported.


concurrency limits for the trigger.
Allows between 1 and 50 concurrent
triggered pipeline runs.

System variables Along with @trigger().scheduledTime Only supports default


and @trigger().startTime, it also @trigger().scheduledTime and
supports the use of the WindowStar t @trigger().startTime variables.
and WindowEnd system variables.
Users can access
trigger().outputs.windowStartTime
and
trigger().outputs.windowEndTime
as trigger system variables in the
trigger definition. The values are used
as the window start time and window
end time, respectively. For example, for
a tumbling window trigger that runs
every hour, for the window 1:00 AM to
2:00 AM, the definition is
trigger().outputs.windowStartTime
= 2017-09-01T01:00:00Z
and
trigger().outputs.windowEndTime
= 2017-09-01T02:00:00Z
.

Pipeline-to-trigger relationship Supports a one-to-one relationship. Supports many-to-many relationships.


Only one pipeline can be triggered. Multiple triggers can kick off a single
pipeline. A single trigger can kick off
multiple pipelines.

Event-based trigger
An event-based trigger runs pipelines in response to an event. There are two flavors of event based triggers.
Storage event trigger runs a pipeline against events happening in a Storage account, such as the arrival of a
file, or the deletion of a file in Azure Blob Storage account.
Custom event trigger processes and handles custom topics in Event Grid
For more information about event-based triggers, see Storage Event Trigger and Custom Event Trigger.

Next steps
See the following tutorials:
Quickstart: Create a data factory by using the .NET SDK
Create a schedule trigger
Create a tumbling window trigger
Integration runtime in Azure Data Factory
6/18/2021 • 12 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the following
data integration capabilities across different network environments:
Data Flow : Execute a Data Flow in managed Azure compute environment.
Data movement : Copy data across data stores in public network and data stores in private network (on-
premises or virtual private network). It provides support for built-in connectors, format conversion, column
mapping, and performant and scalable data transfer.
Activity dispatch : Dispatch and monitor transformation activities running on a variety of compute services
such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and
more.
SSIS package execution : Natively execute SQL Server Integration Services (SSIS) packages in a managed
Azure compute environment.
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a
compute service. An integration runtime provides the bridge between the activity and linked Services. It's
referenced by the linked service or activity, and provides the compute environment where the activity either
runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the
target data store or compute service in the most performant way while meeting security and compliance needs.
Integration runtimes can be created in the Azure Data Factory UX via the management hub and any activities,
datasets, or data flows that reference them.

Integration runtime types


Data Factory offers three types of Integration Runtime (IR), and you should choose the type that best serve the
data integration capabilities and network environment needs you're looking for. These three types are:
Azure
Self-hosted
Azure-SSIS
The following table describes the capabilities and network support for each of the integration runtime types:

IR T Y P E P UB L IC N ET W O RK P RIVAT E N ET W O RK

Azure Data Flow Data Flow


Data movement Data movement
Activity dispatch Activity dispatch

Self-hosted Data movement Data movement


Activity dispatch Activity dispatch

Azure-SSIS SSIS package execution SSIS package execution

Azure integration runtime


An Azure integration runtime can:
Run Data Flows in Azure
Run copy activity between cloud data stores
Dispatch the following transform activities in public network: Databricks Notebook/ Jar/ Python activity,
HDInsight Hive activity, HDInsight Pig activity, HDInsight MapReduce activity, HDInsight Spark activity,
HDInsight Streaming activity, Azure Machine Learning Studio (classic) Batch Execution activity, Azure Machine
Learning Studio (classic) Update Resource activities, Stored Procedure activity, Data Lake Analytics U-SQL
activity, .NET custom activity, Web activity, Lookup activity, and Get Metadata activity.
Azure IR network environment
Azure Integration Runtime supports connecting to data stores and computes services with public accessible
endpoints. Enabling Managed Virtual Network, Azure Integration Runtime supports connecting to data stores
using private link service in private network environment.
Azure IR compute resource and scaling
Azure integration runtime provides a fully managed, serverless compute in Azure. You don't have to worry
about infrastructure provision, software installation, patching, or capacity scaling. In addition, you only pay for
the duration of the actual utilization.
Azure integration runtime provides the native compute to move data between cloud data stores in a secure,
reliable, and high-performance manner. You can set how many data integration units to use on the copy activity,
and the compute size of the Azure IR is elastically scaled up accordingly without you having to explicitly
adjusting size of the Azure Integration Runtime.
Activity dispatch is a lightweight operation to route the activity to the target compute service, so there isn't need
to scale up the compute size for this scenario.
For information about creating and configuring an Azure IR, see How to create and configure Azure Integration
Runtime.

NOTE
Azure Integration runtime has properties related to Data Flow runtime, which defines the underlying compute
infrastructure that would be used to run the data flows on.

Self-hosted integration runtime


A self-hosted IR is capable of:
Running copy activity between a cloud data stores and a data store in private network.
Dispatching the following transform activities against compute resources in on-premises or Azure Virtual
Network: HDInsight Hive activity (BYOC-Bring Your Own Cluster), HDInsight Pig activity (BYOC), HDInsight
MapReduce activity (BYOC), HDInsight Spark activity (BYOC), HDInsight Streaming activity (BYOC), Azure
Machine Learning Studio (classic) Batch Execution activity, Azure Machine Learning Studio (classic) Update
Resource activities, Stored Procedure activity, Data Lake Analytics U-SQL activity, Custom activity (runs on
Azure Batch), Lookup activity, and Get Metadata activity.

NOTE
Use self-hosted integration runtime to support data stores that requires bring-your-own driver such as SAP Hana,
MySQL, etc. For more information, see supported data stores.
NOTE
Java Runtime Environment (JRE) is a dependency of Self Hosted IR. Please make sure you have JRE installed on the same
host.

Self-hosted IR network environment


If you want to perform data integration securely in a private network environment, which doesn't have a direct
line-of-sight from the public cloud environment, you can install a self-hosted IR on premises environment
behind your corporate firewall, or inside a virtual private network. The self-hosted integration runtime only
makes outbound HTTP-based connections to open internet.
Self-hosted IR compute resource and scaling
Install Self-hosted IR on an on-premises machine or a virtual machine inside a private network. Currently, we
only support running the self-hosted IR on a Windows operating system.
For high availability and scalability, you can scale out the self-hosted IR by associating the logical instance with
multiple on-premises machines in active-active mode. For more information, see how to create and configure
self-hosted IR article under how to guides for details.

Azure-SSIS Integration Runtime


To lift and shift existing SSIS workload, you can create an Azure-SSIS IR to natively execute SSIS packages.
Azure -SSIS IR network environment
Azure-SSIS IR can be provisioned in either public network or private network. On-premises data access is
supported by joining Azure-SSIS IR to a Virtual Network that is connected to your on-premises network.
Azure -SSIS IR compute resource and scaling
Azure-SSIS IR is a fully managed cluster of Azure VMs dedicated to run your SSIS packages. You can bring your
own Azure SQL Database or SQL Managed Instance for the catalog of SSIS projects/packages (SSISDB). You can
scale up the power of the compute by specifying node size and scale it out by specifying the number of nodes in
the cluster. You can manage the cost of running your Azure-SSIS Integration Runtime by stopping and starting it
as you see fit.
For more information, see how to create and configure Azure-SSIS IR article under how to guides. Once created,
you can deploy and manage your existing SSIS packages with little to no change using familiar tools such as
SQL Server Data Tools (SSDT) and SQL Server Management Studio (SSMS), just like using SSIS on premises.
For more information about Azure-SSIS runtime, see the following articles:
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses an Azure SQL Database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using SQL Managed Instance and joining the IR to a virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or remove an Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding more nodes to the IR.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.

Integration runtime location


Relationship between factory location and IR location
When customer creates a data factory instance, they need to specify the location for the data factory. The Data
Factory location is where the metadata of the data factory is stored and where the triggering of the pipeline is
initiated from. Metadata for the factory is only stored in the region of customer’s choice and will not be stored in
other regions.
Meanwhile, a data factory can access data stores and compute services in other Azure regions to move data
between data stores or process data using compute services. This behavior is realized through the globally
available IR to ensure data compliance, efficiency, and reduced network egress costs.
The IR Location defines the location of its back-end compute, and essentially the location where the data
movement, activity dispatching, and SSIS package execution are performed. The IR location can be different
from the location of the data factory it belongs to.
Azure IR location
You can set a certain location of an Azure IR, in which case the activity execution or dispatch will happen in that
specific region.
If you choose to use the auto-resolve Azure IR in public network, which is the default,
For copy activity, ADF will make a best effort to automatically detect your sink data store's location, then
use the IR in either the same region if available or the closest one in the same geography; if the sink data
store's region is not detectable, IR in the data factory region as alternative is used.
For example, you have your factory created in East US,
When copy data to Azure Blob in West US, if ADF successfully detected that the Blob is in West US,
copy activity is executed on IR in West US; if the region detection fails, copy activity is executed on IR
in East US.
When copy data to Salesforce of which the region is not detectable, copy activity is executed on IR in
East US.

TIP
If you have strict data compliance requirements and need ensure that data do not leave a certain geography, you
can explicitly create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia
property. For example, if you want to copy data from Blob in UK South to Azure Synapse Analytics in UK South
and want to ensure data do not leave UK, create an Azure IR in UK South and link both Linked Services to this IR.

For Lookup/GetMetadata/Delete activity execution (also known as Pipeline activities), transformation


activity dispatching (also known as External activities), and authoring operations (test connection, browse
folder list and table list, preview data), ADF uses the IR in the data factory region.
For Data Flow, ADF uses the IR in the data factory region.

TIP
A good practice would be to ensure Data flow runs in the same region as your corresponding data stores (if
possible). You can either achieve this by auto-resolve Azure IR (if data store location is same as Data Factory
location), or by creating a new Azure IR instance in the same region as your data stores and then execute the data
flow on it.

If you enable Managed Virtual Network for auto-resolve Azure IR, ADF uses the IR in the data factory region.
You can monitor which IR location takes effect during activity execution in pipeline activity monitoring view on
UI or activity monitoring payload.
Self-hosted IR location
The self-hosted IR is logically registered to the Data Factory and the compute used to support its functionalities
is provided by you. Therefore there is no explicit location property for self-hosted IR.
When used to perform data movement, the self-hosted IR extracts data from the source and writes into the
destination.
Azure -SSIS IR location
Selecting the right location for your Azure-SSIS IR is essential to achieve high performance in your extract-
transform-load (ETL) workflows.
The location of your Azure-SSIS IR does not need to be the same as the location of your data factory, but it
should be the same as the location of your own Azure SQL Database or SQL Managed Instance where
SSISDB. This way, your Azure-SSIS Integration Runtime can easily access SSISDB without incurring excessive
traffics between different locations.
If you do not have an existing SQL Database or SQL Managed Instance, but you have on-premises data
sources/destinations, you should create a new Azure SQL Database or SQL Managed Instance in the same
location of a virtual network connected to your on-premises network. This way, you can create your Azure-
SSIS IR using the new Azure SQL Database or SQL Managed Instance and joining that virtual network, all in
the same location, effectively minimizing data movements across different locations.
If the location of your existing Azure SQL Database or SQL Managed Instance is not the same as the location
of a virtual network connected to your on-premises network, first create your Azure-SSIS IR using an existing
Azure SQL Database or SQL Managed Instance and joining another virtual network in the same location, and
then configure a virtual network to virtual network connection between different locations.
The following diagram shows location settings of Data Factory and its integration run times:

Determining which IR to use


If one data factory activity associates with more than one type of integration runtime, it will resolve to one of
them. The self-hosted integration runtime takes precedence over Azure integration runtime in Azure Data
Factory managed virtual network. And the latter takes precedence over public Azure integration runtime. For
example, one copy activity is used to copy data from source to sink. The public Azure integration runtime is
associated with the linked service to source and an Azure integration runtime in Azure Data Factory managed
virtual network associates with the linked service for sink, then the result is that both source and sink linked
service use Azure integration runtime in Azure Data Factory managed virtual network. But if a self-hosted
integration runtime associates the linked service for source, then both source and sink linked service use self-
hosted integration runtime.
Copy activity
For Copy activity, it requires source and sink linked services to define the direction of data flow. The following
logic is used to determine which integration runtime instance is used to perform the copy:
Copying between two cloud data sources : when both source and sink linked services are using Azure
IR, ADF uses the regional Azure IR if you specified, or auto determine a location of Azure IR if you choose the
autoresolve IR (default) as described in Integration runtime location section.
Copying between a cloud data source and a data source in private network : if either source or sink
linked service points to a self-hosted IR, the copy activity is executed on that self-hosted Integration Runtime.
Copying between two data sources in private network : both the source and sink Linked Service must
point to the same instance of integration runtime, and that integration runtime is used to execute the copy
Activity.
Lookup and GetMetadata activity
The Lookup and GetMetadata activity is executed on the integration runtime associated to the data store linked
service.
External transformation activity
Each external transformation activity that utilizes an external compute engine has a target compute Linked
Service, which points to an integration runtime. This integration runtime instance determines the location where
that external hand-coded transformation activity is dispatched from.
Data Flow activity
Data Flow activities are executed on the Azure integration runtime associated to it. The Spark compute utilized
by Data Flows are determined by the data flow properties in your Azure Integration Runtime and are fully
managed by ADF.

Next steps
See the following articles:
Create Azure integration runtime
Create self-hosted integration runtime
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides instructions on
using SQL Managed Instance and joining the IR to a virtual network.
Mapping data flows in Azure Data Factory
5/25/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

What are mapping data flows?


Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data
engineers to develop data transformation logic without writing code. The resulting data flows are executed as
activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can
be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.
Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on ADF-
managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation,
path optimization, and execution of your data flow jobs.

Getting started
Data flows are created from the factory resources pane like pipelines and datasets. To create a data flow, select
the plus sign next to Factor y Resources , and then select Data Flow .

This action takes you to the data flow canvas, where you can create your transformation logic. Select Add
source to start configuring your source transformation. For more information, see Source transformation.

Authoring data flows


Mapping data flow has a unique authoring canvas designed to make building transformation logic easy. The
data flow canvas is separated into three parts: the top bar, the graph, and the configuration panel.
Graph
The graph displays the transformation stream. It shows the lineage of source data as it flows into one or more
sinks. To add a new source, select Add source . To add a new transformation, select the plus sign on the lower
right of an existing transformation. Learn more on how to manage the data flow graph.

Configuration panel
The configuration panel shows the settings specific to the currently selected transformation. If no transformation
is selected, it shows the data flow. In the overall data flow configuration, you can add parameters via the
Parameters tab. For more information, see Mapping data flow parameters.
Each transformation contains at least four configuration tabs.
Transformation settings
The first tab in each transformation's configuration pane contains the settings specific to that transformation. For
more information, see that transformation's documentation page.

Optimize
The Optimize tab contains settings to configure partitioning schemes. To learn more about how to optimize
your data flows, see the mapping data flow performance guide.

Inspect
The Inspect tab provides a view into the metadata of the data stream that you're transforming. You can see
column counts, the columns changed, the columns added, data types, the column order, and column references.
Inspect is a read-only view of your metadata. You don't need to have debug mode enabled to see metadata in
the Inspect pane.
As you change the shape of your data through transformations, you'll see the metadata changes flow in the
Inspect pane. If there isn't a defined schema in your source transformation, then metadata won't be visible in
the Inspect pane. Lack of metadata is common in schema drift scenarios.
Data preview
If debug mode is on, the Data Preview tab gives you an interactive snapshot of the data at each transform. For
more information, see Data preview in debug mode.
Top bar
The top bar contains actions that affect the whole data flow, like saving and validation. You can view the
underlying JSON code and data flow script of your transformation logic as well. For more information, learn
about the data flow script.

Available transformations
View the mapping data flow transformation overview to get a list of available transformations.

Data flow data types


array
binary
boolean
complex
decimal (includes precision)
date
float
integer
long
map
short
string
timestamp

Data flow activity


Mapping data flows are operationalized within ADF pipelines using the data flow activity. All a user has to do is
specify which integration runtime to use and pass in parameter values. For more information, learn about the
Azure integration runtime.

Debug mode
Debug mode allows you to interactively see the results of each transformation step while you build and debug
your data flows. The debug session can be used both in when building your data flow logic and running pipeline
debug runs with data flow activities. To learn more, see the debug mode documentation.

Monitoring data flows


Mapping data flow integrates with existing Azure Data Factory monitoring capabilities. To learn how to
understand data flow monitoring output, see monitoring mapping data flows.
The Azure Data Factory team has created a performance tuning guide to help you optimize the execution time of
your data flows after building your business logic.
Available regions
Mapping data flows are available in the following regions in ADF:

A Z URE REGIO N DATA F LO W S IN A DF

Australia Central

Australia Central 2

Australia East ✓

Australia Southeast ✓

Brazil South ✓

Canada Central ✓

Central India ✓

Central US ✓

China East

China East 2

China Non-Regional

China North ✓

China North 2 ✓

East Asia ✓

East US ✓

East US 2 ✓

France Central ✓

France South

Germany Central (Sovereign)

Germany Non-Regional (Sovereign)

Germany North (Public)

Germany Northeast (Sovereign)

Germany West Central (Public)

Japan East ✓
A Z URE REGIO N DATA F LO W S IN A DF

Japan West

Korea Central ✓

Korea South

North Central US ✓

North Europe ✓

Norway East ✓

Norway West

South Africa North ✓

South Africa West

South Central US

South India

Southeast Asia ✓

Switzerland North

Switzerland West

UAE Central

UAE North ✓

UK South ✓

UK West

US DoD Central

US DoD East

US Gov Arizona ✓

US Gov Non-Regional

US Gov Texas

US Gov Virginia ✓

West Central US
A Z URE REGIO N DATA F LO W S IN A DF

West Europe ✓

West India

West US ✓

West US 2 ✓

Next steps
Learn how to create a source transformation.
Learn how to build your data flows in debug mode.
Mapping data flow Debug Mode
4/16/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Overview
Azure Data Factory mapping data flow's debug mode allows you to interactively watch the data shape transform
while you build and debug your data flows. The debug session can be used both in Data Flow design sessions as
well as during pipeline debug execution of data flows. To turn on debug mode, use the Data Flow Debug
button in the top bar of data flow canvas or pipeline canvas when you have data flow activities.

Once you turn on the slider, you will be prompted to select which integration runtime configuration you wish to
use. If AutoResolveIntegrationRuntime is chosen, a cluster with eight cores of general compute with a default
60-minute time to live will be spun up. If you'd like to allow for more idle team before your session times out,
you can choose a higher TTL setting. For more information on data flow integration runtimes, see Data flow
performance.

When Debug mode is on, you'll interactively build your data flow with an active Spark cluster. The session will
close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by
Azure Databricks during the time that you have the debug session turned on.
In most cases, it's a good practice to build your Data Flows in debug mode so that you can validate your
business logic and view your data transformations before publishing your work in Azure Data Factory. Use the
"Debug" button on the pipeline panel to test your data flow in a pipeline.
NOTE
Every debug session that a user starts from their ADF browser UI is a new session with its own Spark cluster. You can use
the monitoring view for debug sessions above to view and manage debug sessions per factory. You are charged for every
hour that each debug session is executing including the TTL time.

Cluster status
The cluster status indicator at the top of the design surface turns green when the cluster is ready for debug. If
your cluster is already warm, then the green indicator will appear almost instantly. If your cluster wasn't already
running when you entered debug mode, then the Spark cluster will perform a cold boot. The indicator will spin
until the environment is ready for interactive debugging.
When you are finished with your debugging, turn the Debug switch off so that your Spark cluster can terminate
and you'll no longer be billed for debug activity.

Debug settings
Once you turn on debug mode, you can edit how a data flow previews data. Debug settings can be edited by
clicking "Debug Settings" on the Data Flow canvas toolbar. You can select the row limit or file source to use for
each of your Source transformations here. The row limits in this setting are only for the current debug session.
You can also select the staging linked service to be used for an Azure Synapse Analytics source.
If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use
during debugging by selecting the Parameters tab.
Use the sampling settings here to point to sample files or sample tables of data so that you do not have to
change your source datasets. By using a sample file or table here, you can maintain the same logic and property
settings in your data flow while testing against a subset of data.

The default IR used for debug mode in ADF data flows is a small 4-core single worker node with a 4-core single
driver node. This works fine with smaller samples of data when testing your data flow logic. If you expand the
row limits in your debug settings during data preview or set a higher number of sampled rows in your source
during pipeline debug, then you may wish to consider setting a larger compute environment in a new Azure
Integration Runtime. Then you can restart your debug session using the larger compute environment.

Data preview
With debug on, the Data Preview tab will light-up on the bottom panel. Without debug mode on, Data Flow will
show you only the current metadata in and out of each of your transformations in the Inspect tab. The data
preview will only query the number of rows that you have set as your limit in your debug settings. Click
Refresh to fetch the data preview.
NOTE
File sources only limit the rows that you see, not the rows being read. For very large datasets, it is recommended that you
take a small portion of that file and use it for your testing. You can select a temporary file in Debug Settings for each
source that is a file dataset type.

When running in Debug Mode in Data Flow, your data will not be written to the Sink transform. A Debug session
is intended to serve as a test harness for your transformations. Sinks are not required during debug and are
ignored in your data flow. If you wish to test writing the data in your Sink, execute the Data Flow from an Azure
Data Factory Pipeline and use the Debug execution from a pipeline.
Data Preview is a snapshot of your transformed data using row limits and data sampling from data frames in
Spark memory. Therefore, the sink drivers are not utilized or tested in this scenario.
Testing join conditions
When unit testing Joins, Exists, or Lookup transformations, make sure that you use a small set of known data for
your test. You can use the Debug Settings option above to set a temporary file to use for your testing. This is
needed because when limiting or sampling rows from a large dataset, you cannot predict which rows and which
keys will be read into the flow for testing. The result is non-deterministic, meaning that your join conditions may
fail.
Quick actions
Once you see the data preview, you can generate a quick transformation to typecast, remove, or do a
modification on a column. Click on the column header and then select one of the options from the data preview
toolbar.
Once you select a modification, the data preview will immediately refresh. Click Confirm in the top-right corner
to generate a new transformation.

Typecast and Modify will generate a Derived Column transformation and Remove will generate a Select
transformation.

NOTE
If you edit your Data Flow, you need to re-fetch the data preview before adding a quick transformation.

Data profiling
Selecting a column in your data preview tab and clicking Statistics in the data preview toolbar will pop up a
chart on the far-right of your data grid with detailed statistics about each field. Azure Data Factory will make a
determination based upon the data sampling of which type of chart to display. High-cardinality fields will default
to NULL/NOT NULL charts while categorical and numeric data that has low cardinality will display bar charts
showing data value frequency. You'll also see max/len length of string fields, min/max values in numeric fields,
standard dev, percentiles, counts, and average.

Next steps
Once you're finished building and debugging your data flow, execute it from a pipeline.
When testing your pipeline with a data flow, use the pipeline Debug run execution option.
Schema drift in mapping data flow
11/2/2020 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Schema drift is the case where your sources often change metadata. Fields, columns, and, types can be added,
removed, or changed on the fly. Without handling for schema drift, your data flow becomes vulnerable to
upstream data source changes. Typical ETL patterns fail when incoming columns and fields change because they
tend to be tied to those source names.
To protect against schema drift, it's important to have the facilities in a data flow tool to allow you, as a Data
Engineer, to:
Define sources that have mutable field names, data types, values, and sizes
Define transformation parameters that can work with data patterns instead of hard-coded fields and values
Define expressions that understand patterns to match incoming fields, instead of using named fields
Azure Data Factory natively supports flexible schemas that change from execution to execution so that you can
build generic data transformation logic without the need to recompile your data flows.
You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When
you do this, you can protect against schema changes from the sources. However, you'll lose early-binding of
your columns and types throughout your data flow. Azure Data Factory treats schema drift flows as late-binding
flows, so when you build your transformations, the drifted column names won't be available to you in the
schema views throughout the flow.
This video provides an introduction to some of the complex solutions that you can build easily in ADF with data
flow's schema drift feature. In this example, we build reusable patterns based on flexible database schemas:

Schema drift in source


Columns coming into your data flow from your source definition are defined as "drifted" when they are not
present in your source projection. You can view your source projection from the projection tab in the source
transformation. When you select a dataset for your source, ADF will automatically take the schema from the
dataset and create a projection from that dataset schema definition.
In a source transformation, schema drift is defined as reading columns that aren't defined your dataset schema.
To enable schema drift, check Allow schema drift in your source transformation.
When schema drift is enabled, all incoming fields are read from your source during execution and passed
through the entire flow to the Sink. By default, all newly detected columns, known as drifted columns, arrive as a
string data type. If you wish for your data flow to automatically infer data types of drifted columns, check Infer
drifted column types in your source settings.

Schema drift in sink


In a sink transformation, schema drift is when you write additional columns on top of what is defined in the sink
data schema. To enable schema drift, check Allow schema drift in your sink transformation.

If schema drift is enabled, make sure the Auto-mapping slider in the Mapping tab is turned on. With this slider
on, all incoming columns are written to your destination. Otherwise you must use rule-based mapping to write
drifted columns.
Transforming drifted columns
When your data flow has drifted columns, you can access them in your transformations with the following
methods:
Use the byPosition and byName expressions to explicitly reference a column by name or position number.
Add a column pattern in a Derived Column or Aggregate transformation to match on any combination of
name, stream, position, origin, or type
Add rule-based mapping in a Select or Sink transformation to match drifted columns to columns aliases via a
pattern
For more information on how to implement column patterns, see Column patterns in mapping data flow.
Map drifted columns quick action
To explicitly reference drifted columns, you can quickly generate mappings for these columns via a data preview
quick action. Once debug mode is on, go to the Data Preview tab and click Refresh to fetch a data preview. If
data factory detects that drifted columns exist, you can click Map Drifted and generate a derived column that
allows you to reference all drifted columns in schema views downstream.

In the generated Derived Column transformation, each drifted column is mapped to its detected name and data
type. In the above data preview, the column 'movieId' is detected as an integer. After Map Drifted is clicked,
movieId is defined in the Derived Column as toInteger(byName('movieId')) and included in schema views in
downstream transformations.
Next steps
In the Data Flow Expression Language, you'll find additional facilities for column patterns and schema drift
including "byName" and "byPosition".
Using column patterns in mapping data flow
5/25/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Several mapping data flow transformations allow you to reference template columns based on patterns instead
of hard-coded column names. This matching is known as column patterns. You can define patterns to match
columns based on name, data type, stream, origin, or position instead of requiring exact field names. There are
two scenarios where column patterns are useful:
If incoming source fields change often such as the case of changing columns in text files or NoSQL
databases. This scenario is known as schema drift.
If you wish to do a common operation on a large group of columns. For example, wanting to cast every
column that has 'total' in its column name into a double.

Column patterns in derived column and aggregate


To add a column pattern in a derived column, aggregate, or window transformation, click on Add above the
column list or the plus icon next to an existing derived column. Choose Add column pattern .

Use the expression builder to enter the match condition. Create a boolean expression that matches columns
based on the name , type , stream , origin , and position of the column. The pattern will affect any column,
drifted or defined, where the condition returns true.

The above column pattern matches every column of type double and creates one derived column per match. By
stating $$ as the column name field, each matched column is updated with the same name. The value of the
each column is the existing value rounded to two decimal points.
To verify your matching condition is correct, you can validate the output schema of defined columns in the
Inspect tab or get a snapshot of the data in the Data preview tab.

Hierarchical pattern matching


You can build pattern matching inside of complex hierarchical structures as well. Expand the section
Each MoviesStruct that matches where you will be prompted for each hierarchy in your data stream. You can
then build matching patterns for properties within that chosen hierarchy.

Rule-based mapping in select and sink


When mapping columns in source and select transformations, you can add either fixed mapping or rule-based
mappings. Match based on the name , type , stream , origin , and position of columns. You can have any
combination of fixed and rule-based mappings. By default, all projections with greater than 50 columns will
default to a rule-based mapping that matches on every column and outputs the inputted name.
To add a rule-based mapping, click Add mapping and select Rule-based mapping .

Each rule-based mapping requires two inputs: the condition on which to match by and what to name each
mapped column. Both values are inputted via the expression builder. In the left expression box, enter your
boolean match condition. In the right expression box, specify what the matched column will be mapped to.
Use $$ syntax to reference the input name of a matched column. Using the above image as an example, say a
user wants to match on all string columns whose names are shorter than six characters. If one incoming column
was named test , the expression $$ + '_short' will rename the column test_short . If that's the only mapping
that exists, all columns that don't meet the condition will be dropped from the outputted data.
Patterns match both drifted and defined columns. To see which defined columns are mapped by a rule, click the
eyeglasses icon next to the rule. Verify your output using data preview.
Regex mapping
If you click the downward chevron icon, you can specify a regex-mapping condition. A regex-mapping condition
matches all column names that match the specified regex condition. This can be used in combination with
standard rule-based mappings.

The above example matches on regex pattern (r) or any column name that contains a lower case r. Similar to
standard rule-based mapping, all matched columns are altered by the condition on the right using $$ syntax.
Rule -based hierarchies
If your defined projection has a hierarchy, you can use rule-based mapping to map the hierarchies subcolumns.
Specify a matching condition and the complex column whose subcolumns you wish to map. Every matched
subcolumn will be outputted using the 'Name as' rule specified on the right.

The above example matches on all subcolumns of complex column a . a contains two subcolumns b and c .
The output schema will include two columns b and c as the 'Name as' condition is $$ .

Pattern matching expression values.


$$ translates to the name or value of each match at run time. Think of $$ as equivalent to this .
name represents the name of each incoming column
type represents the data type of each incoming column. The list of data types in the data flow type system
can be found here.
stream represents the name associated with each stream, or transformation in your flow
position is the ordinal position of columns in your data flow
origin is the transformation where a column originated or was last updated

Next steps
Learn more about the mapping data flow expression language for data transformations
Use column patterns in the sink transformation and select transformation with rule-based mapping
Monitor Data Flows
6/23/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


After you have completed building and debugging your data flow, you want to schedule your data flow to
execute on a schedule within the context of a pipeline. You can schedule the pipeline from Azure Data Factory
using Triggers. For testing and debugging you data flow from a pipeline, you can use the Debug button on the
toolbar ribbon or Trigger Now option from the Azure Data Factory Pipeline Builder to execute a single-run
execution to test your data flow within the pipeline context.

When you execute your pipeline, you can monitor the pipeline and all of the activities contained in the pipeline
including the Data Flow activity. Click on the monitor icon in the left-hand Azure Data Factory UI panel. You can
see a screen similar to the one below. The highlighted icons allow you to drill into the activities in the pipeline,
including the Data Flow activity.

You see statistics at this level as well including the run times and status. The Run ID at the activity level is
different than the Run ID at the pipeline level. The Run ID at the previous level is for the pipeline. Selecting the
eyeglasses gives you deep details on your data flow execution.

When you're in the graphical node monitoring view, you can see a simplified view-only version of your data flow
graph. To see the details view with larger graph nodes that include transformation stage labels, use the zoom
slider on the right side of your canvas. You can also use the search button on the right side to find parts of your
data flow logic in the graph.
View Data Flow Execution Plans
When your Data Flow is executed in Spark, Azure Data Factory determines optimal code paths based on the
entirety of your data flow. Additionally, the execution paths may occur on different scale-out nodes and data
partitions. Therefore, the monitoring graph represents the design of your flow, taking into account the execution
path of your transformations. When you select individual nodes, you can see "stages" that represent code that
was executed together on the cluster. The timings and counts that you see represent those groups or stages as
opposed to the individual steps in your design.

When you select the open space in the monitoring window, the stats in the bottom pane display timing
and row counts for each Sink and the transformations that led to the sink data for transformation lineage.
When you select individual transformations, you receive additional feedback on the right-hand panel that
shows partition stats, column counts, skewness (how evenly is the data distributed across partitions), and
kurtosis (how spiky is the data).
Sorting by processing time will help you to identify which stages in your data flow took the most time.
To find which transformations inside each stage took the most time, sort on highest processing time.
The rows written is also sortable as a way to identify which streams inside your data flow are writing the
most data.
When you select the Sink in the node view, you can see column lineage. There are three different
methods that columns are accumulated throughout your data flow to land in the Sink. They are:
Computed: You use the column for conditional processing or within an expression in your data flow,
but don't land it in the Sink
Derived: The column is a new column that you generated in your flow, that is, it was not present in the
Source
Mapped: The column originated from the source and your are mapping it to a sink field
Data flow status: The current status of your execution
Cluster startup time: Amount of time to acquire the JIT Spark compute environment for your data flow
execution
Number of transforms: How many transformation steps are being executed in your flow

Total Sink Processing Time vs. Transformation Processing Time


Each transformation stage includes a total time for that stage to complete with each partition execution time
totaled together. When you click on the Sink you will see "Sink Processing Time". This time includes the total of
the transformation time plus the I/O time it took to write your data to your destination store. The difference
between the Sink Processing Time and the total of the transformation is the I/O time to write the data.
You can also see detailed timing for each partition transformation step if you open the JSON output from your
data flow activity in the ADF pipeline monitoring view. The JSON contains millisecond timing for each partition,
whereas the UX monitoring view is an aggregate timing of partitions added together:

{
"stage": 4,
"partitionTimes": [
14353,
14914,
14246,
14912,
...
]
}

Sink processing time


When you select a sink transformation icon in your map, the slide-in panel on the right will show an additional
data point called "post processing time" at the bottom. This is the amount time spent executing your job on the
Spark cluster after your data has been loaded, transformed, and written. This time can include closing
connection pools, driver shutdown, deleting files, coalescing files, etc. When you perform actions in your flow
like "move files" and "output to single file", you will likely see an increase in the post processing time value.
Write stage duration: The time to write the data to a staging location for Synapse SQL
Table operation SQL duration: The time spent moving data from temp tables to target table
Pre SQL duration & Post SQL duration: The time spent running pre/post SQL commands
Pre commands duration & post commands duration: The time spent running any pre/post operations for file
based source/sinks. For example move or delete files after processing.
Merge duration: The time spent merging the file, merge files are used for file based sinks when writing to
single file or when "File name as column data" is used. If significant time is spent in this metric, you should
avoid using these options.
Stage time: Total amount of time spent inside of Spark to complete the operation as a stage.
Temporary staging stable: Name of the temporary table used by data flows to stage data in the database.

Error rows
Enabling error row handling in your data flow sink will be reflected in the monitoring output. When you set the
sink to "report success on error", the monitoring output will show the number of success and failed rows when
you click on the sink monitoring node.

When you select "report failure on error", the same output will be shown only in the activity monitoring output
text. This is because the data flow activity will return failure for execution and the detailed monitoring view will
be unavailable.
Monitor Icons
This icon means that the transformation data was already cached on the cluster, so the timings and execution
path have taken that into account:

You also see green circle icons in the transformation. They represent a count of the number of sinks that data is
flowing into.
Mapping data flows performance and tuning guide
6/8/2021 • 21 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Mapping data flows in Azure Data Factory provide a code-free interface to design and run data transformations
at scale. If you're not familiar with mapping data flows, see the Mapping Data Flow Overview. This article
highlights various ways to tune and optimize your data flows so that they meet your performance benchmarks.
Watch the below video to see shows some sample timings transforming data with data flows.

Testing data flow logic


When designing and testing data flows from the ADF UX, debug mode allows you to interactively test against a
live Spark cluster. This allows you to preview data and execute your data flows without waiting for a cluster to
warm up. For more information, see Debug Mode.

Monitoring data flow performance


Once you verify your transformation logic using debug mode, run your data flow end-to-end as an activity in a
pipeline. Data flows are operationalized in a pipeline using the execute data flow activity. The data flow activity
has a unique monitoring experience compared to other Azure Data Factory activities that displays a detailed
execution plan and performance profile of the transformation logic. To view detailed monitoring information of
a data flow, click on the eyeglasses icon in the activity run output of a pipeline. For more information, see
Monitoring mapping data flows.

When monitoring data flow performance, there are four possible bottlenecks to look out for:
Cluster start-up time
Reading from a source
Transformation time
Writing to a sink
Cluster start-up time is the time it takes to spin up an Apache Spark cluster. This value is located in the top-right
corner of the monitoring screen. Data flows run on a just-in-time model where each job uses an isolated cluster.
This start-up time generally takes 3-5 minutes. For sequential jobs, this can be reduced by enabling a time to live
value. For more information, see optimizing the Azure Integration Runtime.
Data flows utilize a Spark optimizer that reorders and runs your business logic in 'stages' to perform as quickly
as possible. For each sink that your data flow writes to, the monitoring output lists the duration of each
transformation stage, along with the time it takes to write data into the sink. The time that is the largest is likely
the bottleneck of your data flow. If the transformation stage that takes the largest contains a source, then you
may want to look at further optimizing your read time. If a transformation is taking a long time, then you may
need to repartition or increase the size of your integration runtime. If the sink processing time is large, you may
need to scale up your database or verify you are not outputting to a single file.
Once you have identified the bottleneck of your data flow, use the below optimizations strategies to improve
performance.

Optimize tab
The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. This tab exists in
every transformation of data flow and specifies whether you want to repartition the data after the
transformation has completed. Adjusting the partitioning provides control over the distribution of your data
across compute nodes and data locality optimizations that can have both positive and negative effects on your
overall data flow performance.
By default, Use current partitioning is selected which instructs Azure Data Factory keep the current output
partitioning of the transformation. As repartitioning data takes time, Use current partitioning is recommended
in most scenarios. Scenarios where you may want to repartition your data include after aggregates and joins
that significantly skew your data or when using Source partitioning on a SQL DB.
To change the partitioning on any transformation, select the Optimize tab and select the Set Par titioning
radio button. You are presented with a series of options for partitioning. The best method of partitioning differs
based on your data volumes, candidate keys, null values, and cardinality.

IMPORTANT
Single partition combines all the distributed data into a single partition. This is a very slow operation that also significantly
affects all downstream transformation and writes. The Azure Data Factory highly recommends against using this option
unless there is an explicit business reason to do so.

The following partitioning options are available in every transformation:


Round robin
Round robin distributes data equally across partitions. Use round-robin when you don't have good key
candidates to implement a solid, smart partitioning strategy. You can set the number of physical partitions.
Hash
Azure Data Factory produces a hash of columns to produce uniform partitions such that rows with similar
values fall in the same partition. When you use the Hash option, test for possible partition skew. You can set the
number of physical partitions.
Dynamic range
The dynamic range uses Spark dynamic ranges based on the columns or expressions that you provide. You can
set the number of physical partitions.
Fixed range
Build an expression that provides a fixed range for values within your partitioned data columns. To avoid
partition skew, you should have a good understanding of your data before you use this option. The values you
enter for the expression are used as part of a partition function. You can set the number of physical partitions.
Key
If you have a good understanding of the cardinality of your data, key partitioning might be a good strategy. Key
partitioning creates partitions for each unique value in your column. You can't set the number of partitions
because the number is based on unique values in the data.
TIP
Manually setting the partitioning scheme reshuffles the data and can offset the benefits of the Spark optimizer. A best
practice is to not manually set the partitioning unless you need to.

Logging level
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs,
you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose"
mode (default), you are requesting ADF to fully log activity at each individual partition level during your data
transformation. This can be an expensive operation, so only enabling verbose when troubleshooting can
improve your overall data flow and pipeline performance. "Basic" mode will only log transformation durations
while "None" will only provide a summary of durations.

Optimizing the Azure Integration Runtime


Data flows run on Spark clusters that are spun up at run-time. The configuration for the cluster used is defined
in the integration runtime (IR) of the activity. There are three performance considerations to make when defining
your integration runtime: cluster type, cluster size, and time to live.
For more information how to create an Integration Runtime, see Integration Runtime in Azure Data Factory.
Cluster type
There are three available options for the type of Spark cluster spun up: general purpose, memory optimized, and
compute optimized.
General purpose clusters are the default selection and will be ideal for most data flow workloads. These tend
to be the best balance of performance and cost.
If your data flow has many joins and lookups, you may want to use a memor y optimized cluster. Memory
optimized clusters can store more data in memory and will minimize any out-of-memory errors you may get.
Memory optimized have the highest price-point per core, but also tend to result in more successful pipelines. If
you experience any out of memory errors when executing data flows, switch to a memory optimized Azure IR
configuration.
Compute optimized aren't ideal for ETL workflows and aren't recommended by the Azure Data Factory team
for most production workloads. For simpler, non-memory intensive data transformations such as filtering data
or adding derived columns, compute-optimized clusters can be used at a cheaper price per core.
Cluster size
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in
parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More
nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to
reduce the processing time.
The default cluster size is four driver nodes and four worker nodes. As you process more data, larger clusters are
recommended. Below are the possible sizing options:

W O RK ER C O RES DRIVER C O RES TOTA L C O RES N OT ES

4 4 8 Not available for compute


optimized
W O RK ER C O RES DRIVER C O RES TOTA L C O RES N OT ES

8 8 16

16 16 32

32 16 48

64 16 80

128 16 144

256 16 272

Data flows are priced at vcore-hrs meaning that both cluster size and execution-time factor into this. As you
scale up, your cluster cost per minute will increase, but your overall time will decrease.

TIP
There is a ceiling on how much the size of a cluster affects the performance of a data flow. Depending on the size of your
data, there is a point where increasing the size of a cluster will stop improving performance. For example, If you have
more nodes than partitions of data, adding additional nodes won't help. A best practice is to start small and scale up to
meet your performance needs.

Time to live
By default, every data flow activity spins up a new Spark cluster based upon the Azure IR configuration. Cold
cluster start-up time takes a few minutes and data processing can't start until it is complete. If your pipelines
contain multiple sequential data flows, you can enable a time to live (TTL) value. Specifying a time to live value
keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR
during the TTL time, it will reuse the existing cluster and start up time will greatly reduced. After the second job
completes, the cluster will again stay alive for the TTL time.
You can additionally minimize the startup time of warm clusters by setting the "Quick re-use" option in the
Azure Integration runtime under Data Flow Properties. Setting this to true will tell ADF to not teardown the
existing cluster after each job and instead re-use the existing cluster, essentially keeping the compute
environment you've set in your Azure IR alive for up to the period of time specified in your TTL. This option
makes for the shortest start-up time of your data flow activities when executing from a pipeline.
However, if most of your data flows execute in parallel, it is not recommended that you enable TTL for the IR that
you use for those activities. Only one job can run on a single cluster at a time. If there is an available cluster, but
two data flows start, only one will use the live cluster. The second job will spin up its own isolated cluster.

NOTE
Time to live is not available when using the auto-resolve integration runtime

Optimizing sources
For every source except Azure SQL Database, it is recommended that you keep Use current par titioning as
the selected value. When reading from all other source systems, data flows automatically partitions data evenly
based upon the size of the data. A new partition is created for about every 128 MB of data. As your data size
increases, the number of partitions increase.
Any custom partitioning happens after Spark reads in the data and will negatively impact your data flow
performance. As the data is evenly partitioned on read, this is not recommended.

NOTE
Read speeds can be limited by the throughput of your source system.

Azure SQL Database sources


Azure SQL Database has a unique partitioning option called 'Source' partitioning. Enabling source partitioning
can improve your read times from Azure SQL DB by enabling parallel connections on the source system. Specify
the number of partitions and how to partition your data. Use a partition column with high cardinality. You can
also enter a query that matches the partitioning scheme of your source table.

TIP
For source partitioning, the I/O of the SQL Server is the bottleneck. Adding too many partitions may saturate your
source database. Generally four or five partitions is ideal when using this option.

Isolation level
The isolation level of the read on an Azure SQL source system has an impact on performance. Choosing 'Read
uncommitted' will provide the fastest performance and prevent any database locks. To learn more about SQL
Isolation levels, please see Understanding isolation levels.
Read using query
You can read from Azure SQL Database using a table or a SQL query. If you are executing a SQL query, the
query must complete before transformation can start. SQL Queries can be useful to push down operations that
may execute faster and reduce the amount of data read from a SQL Server such as SELECT, WHERE, and JOIN
statements. When pushing down operations, you lose the ability to track lineage and performance of the
transformations before the data comes into the data flow.
Azure Synapse Analytics sources
When using Azure Synapse Analytics, a setting called Enable staging exists in the source options. This allows
ADF to read from Synapse using Staging , which greatly improves read performance. Enabling Staging
requires you to specify an Azure Blob Storage or Azure Data Lake Storage gen2 staging location in the data flow
activity settings.
File -based sources
While data flows support a variety of file types, the Azure Data Factory recommends using the Spark-native
Parquet format for optimal read and write times.
If you're running the same data flow on a set of files, we recommend reading from a folder, using wildcard paths
or reading from a list of files. A single data flow activity run can process all of your files in batch. More
information on how to set these settings can be found in the connector documentation such as Azure Blob
Storage.
If possible, avoid using the For-Each activity to run data flows over a set of files. This will cause each iteration of
the for-each to spin up its own Spark cluster, which is often not necessary and can be expensive.

Optimizing sinks
When data flows write to sinks, any custom partitioning will happen immediately before the write. Like the
source, in most cases it is recommended that you keep Use current par titioning as the selected partition
option. Partitioned data will write significantly quicker than unpartitioned data, even your destination is not
partitioned. Below are the individual considerations for various sink types.
Azure SQL Database sinks
With Azure SQL Database, the default partitioning should work in most cases. There is a chance that your sink
may have too many partitions for your SQL database to handle. If you are running into this, reduce the number
of partitions outputted by your SQL Database sink.
Impact of error row handling to performance
When you enable error row handling ("continue on error") in the sink transformation, ADF will take an
additional step before writing the compatible rows to your destination table. This additional step will have a
small performance penalty that can be in the range of 5% added for this step with an additional small
performance hit also added if you set the option to also with the incompatible rows to a log file.
Disabling indexes using a SQL Script
Disabling indexes before a load in a SQL database can greatly improve performance of writing to the table. Run
the below command before writing to your SQL sink.
ALTER INDEX ALL ON dbo.[Table Name] DISABLE

After the write has completed, rebuild the indexes using the following command:
ALTER INDEX ALL ON dbo.[Table Name] REBUILD

These can both be done natively using Pre and Post-SQL scripts within an Azure SQL DB or Synapse sink in
mapping data flows.
WARNING
When disabling indexes, the data flow is effectively taking control of a database and queries are unlikely to succeed at this
time. As a result, many ETL jobs are triggered in the middle of the night to avoid this conflict. For more information, learn
about the constraints of disabling indexes

Scaling up your database


Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the
throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete,
resize your databases back to their normal run rate.
Azure Synapse Analytics sinks
When writing to Azure Synapse Analytics, make sure that Enable staging is set to true. This enables ADF to
write using SQL Copy Command which effectively loads the data in bulk. You will need to reference an Azure
Data Lake Storage gen2 or Azure Blob Storage account for staging of the data when using Staging.
Other than Staging, the same best practices apply to Azure Synapse Analytics as Azure SQL Database.
File -based sinks
While data flows support a variety of file types, the Azure Data Factory recommends using the Spark-native
Parquet format for optimal read and write times.
If the data is evenly distributed, Use current par titioning will be the fastest partitioning option for writing
files.
File name options
When writing files, you have a choice of naming options that each have a performance impact.
Selecting the Default option will write the fastest. Each partition will equate to a file with the Spark default
name. This is useful if you are just reading from the folder of data.
Setting a naming Pattern will rename each partition file to a more user-friendly name. This operation happens
after write and is slightly slower than choosing the default. Per partition allows you to name each individual
partition manually.
If a column corresponds to how you wish to output the data, you can select As data in column . This reshuffles
the data and can impact performance if the columns are not evenly distributed.
Output to single file combines all the data into a single partition. This leads to long write times, especially for
large datasets. The Azure Data Factory team highly recommends not choosing this option unless there is an
explicit business reason to do so.
CosmosDB sinks
When writing to CosmosDB, altering throughput and batch size during data flow execution can improve
performance. These changes only take effect during the data flow activity run and will return to the original
collection settings after conclusion.
Batch size: Usually, starting with the default batch size is sufficient. To further tune this value, calculate the
rough object size of your data, and make sure that object size * batch size is less than 2MB. If it is, you can
increase the batch size to get better throughput.
Throughput: Set a higher throughput setting here to allow documents to write faster to CosmosDB. Keep in
mind the higher RU costs based upon a high throughput setting.
Write throughput budget: Use a value which is smaller than total RUs per minute. If you have a data flow
with a high number of Spark partitions, setting a budget throughput will allow more balance across those
partitions.

Optimizing transformations
Optimizing Joins, Exists, and Lookups
Broadcasting
In joins, lookups, and exists transformations, if one or both data streams are small enough to fit into worker
node memory, you can optimize performance by enabling Broadcasting . Broadcasting is when you send small
data frames to all nodes in the cluster. This allows for the Spark engine to perform a join without reshuffling the
data in the large stream. By default, the Spark engine will automatically decide whether or not to broadcast one
side of a join. If you are familiar with your incoming data and know that one stream will be significantly smaller
than the other, you can select Fixed broadcasting. Fixed broadcasting forces Spark to broadcast the selected
stream.
If the size of the broadcasted data is too large for the Spark node, you may get an out of memory error. To avoid
out of memory errors, use memor y optimized clusters. If you experience broadcast timeouts during data flow
executions, you can switch off the broadcast optimization. However, this will result in slower performing data
flows.
When working with data sources that can take longer to query, like large database queries, it is recommended to
turn broadcast off for joins. Source with long query times can cause Spark timeouts when the cluster attempts
to broadcast to compute nodes. Another good choice for turning off broadcast is when you have a stream in
your data flow that is aggregating values for use in a lookup transformation later. This pattern can confuse the
Spark optimizer and cause timeouts.

Cross joins
If you use literal values in your join conditions or have multiple matches on both sides of a join, Spark will run
the join as a cross join. A cross join is a full cartesian product that then filters out the joined values. This is
significantly slower than other join types. Ensure that you have column references on both sides of your join
conditions to avoid the performance impact.
Sorting before joins
Unlike merge join in tools like SSIS, the join transformation isn't a mandatory merge join operation. The join
keys don't require sorting prior to the transformation. The Azure Data Factory team doesn't recommend using
Sort transformations in mapping data flows.
Window transformation performance
The Window transformation partitions your data by value in columns that you select as part of the over()
clause in the transformation settings. There are a number of very popular aggregate and analytical functions
that are exposed in the Windows transformation. However, if your use case is to generate a window over your
entire dataset for the purpose of ranking rank() or row number rowNumber() , it is recommended that you
instead use the Rank transformation and the Surrogate Key transformation. Those transformation will perform
better again full dataset operations using those functions.
Repartitioning skewed data
Certain transformations such as joins and aggregates reshuffle your data partitions and can occasionally lead to
skewed data. Skewed data means that data is not evenly distributed across the partitions. Heavily skewed data
can lead to slower downstream transformations and sink writes. You can check the skewness of your data at any
point in a data flow run by clicking on the transformation in the monitoring display.
The monitoring display will show how the data is distributed across each partition along with two metrics,
skewness and kurtosis. Skewness is a measure of how asymmetrical the data is and can have a positive, zero,
negative, or undefined value. Negative skew means the left tail is longer than the right. Kur tosis is the measure
of whether the data is heavy-tailed or light-tailed. High kurtosis values are not desirable. Ideal ranges of
skewness lie between -3 and 3 and ranges of kurtosis are less than 10. An easy way to interpret these numbers
is looking at the partition chart and seeing if 1 bar is significantly larger than the rest.
If your data is not evenly partitioned after a transformation, you can use the optimize tab to repartition.
Reshuffling data takes time and may not improve your data flow performance.

TIP
If you repartition your data, but have downstream transformations that reshuffle your data, use hash partitioning on a
column used as a join key.
Using data flows in pipelines
When building complex pipelines with multiple data flows, your logical flow can have a big impact on timing
and cost. This section covers the impact of different architecture strategies.
Executing data flows in parallel
If you execute multiple data flows in parallel, ADF spins up separate Spark clusters for each activity. This allows
for each job to be isolated and run in parallel, but will lead to multiple clusters running at the same time.
If your data flows execute in parallel, its recommended to not enable the Azure IR time to live property as it will
lead to multiple unused warm pools.

TIP
Instead of running the same data flow multiple times in a for each activity, stage your data in a data lake and use wildcard
paths to process the data in a single data flow.

Execute data flows sequentially


If you execute your data flow activities in sequence, it is recommended that you set a TTL in the Azure IR
configuration. ADF will reuse the compute resources resulting in a faster cluster start up time. Each activity will
still be isolated receive a new Spark context for each execution. To reduce the time between sequential activities
even more, set the "quick re-use" checkbox on the Azure IR to tell ADF to re-use the existing cluster.
Overloading a single data flow
If you put all of your logic inside of a single data flow, ADF will execute the entire job on a single Spark instance.
While this may seem like a way to reduce costs, it mixes together different logical flows and can be difficult to
monitor and debug. If one component fails, all other parts of the job will fail as well. The Azure Data Factory
team recommends organizing data flows by independent flows of business logic. If your data flow becomes too
large, splitting it into separates components will make monitoring and debugging easier. While there is no hard
limit on the number of transformations in a data flow, having too many will make the job complex.
Execute sinks in parallel
The default behavior of data flow sinks is to execute each sink sequentially, in a serial manner, and to fail the data
flow when an error is encountered in the sink. Additionally, all sinks are defaulted to the same group unless you
go into the data flow properties and set different priorities for the sinks.
Data flows allow you to group sinks together into groups from the data flow properties tab in the UI designer.
You can both set the order of execution of your sinks as well as to group sinks together using the same group
number. To help manage groups, you can ask ADF to run sinks in the same group, to run in parallel.
On the pipeline execute data flow activity under the "Sink Properties" section is an option to turn on parallel sink
loading. When you enable "run in parallel", you are instructing data flows write to connected sinks at the same
time rather than in a sequential manner. In order to utilize the parallel option, the sinks must be group together
and connected to the same stream via a New Branch or Conditional Split.

Next steps
See other Data Flow articles related to performance:
Data Flow activity
Monitor Data Flow performance
Managing the mapping data flow graph
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Mapping data flows are authored using a design surface know as the data flow graph. In the graph,
transformation logic is built left-to-right and additional data streams are added top-down. To add a new
transformation, select the plus sign on the lower right of an existing transformation.

As your data flows get more complex, use the following mechanisms to effectively navigate and manage the
data flow graph.

Moving transformations
In mapping data flows, a set of connected transformation logic is known as a stream . The Incoming stream
field dictates which data stream is feeding the current transformation. Each transformation has one or two
incoming streams depending on its function and represents an output stream. The output schema of the
incoming streams determines which column metadata can be referenced by the current transformation.
Unlike the pipeline canvas, data flow transformations aren't edited using a drag and drop model. To change the
incoming stream of or "move" a transformation, choose a different value from the Incoming stream
dropdown. When you do this, all downstream transformations will move alongside the edited transformation.
The graph will automatically update to show the new logical flow. If you change the incoming stream to a
transformation that already has downstream transformation, a new branch or parallel data stream will be
created. Learn more about new branches in mapping data flow.

Hide graph and show graph


When editing your transformation, you can expand the configuration panel to take up the entire canvas, hiding
the graph. Click on the upward-facing chevron located on the right side of the canvas.
When the graph is hidden, you can move between transformations within a stream by clicking Next or
Previous . Click the downward-facing chevron to show the graph.

Searching for transformations


To quickly find a transformation in your graph, click on the Search icon above the zoom setting.
You can search by transformation name or description to locate a transformation.

Hide reference nodes


If your data flow has any join, lookup, exists, or union transformations, data flow shows reference nodes to all
incoming streams. If you wish to minimize the amount of vertical space taken, you can minimize your reference
nodes. To do so, right click on the canvas and select Hide reference nodes .
Next steps
After completing your data flow logic, turn on debug mode and test it out in a data preview.
Build expressions in mapping data flow
4/29/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In mapping data flow, many transformation properties are entered as expressions. These expressions are
composed of column values, parameters, functions, operators, and literals that evaluate to a Spark data type at
run time. Mapping data flows has a dedicated experience aimed to aid you in building these expressions called
the Expression Builder . Utilizing IntelliSense code completion for highlighting, syntax checking, and
autocompleting, the expression builder is designed to make building data flows easy. This article explains how to
use the expression builder to effectively build your business logic.

Open Expression Builder


There are multiple entry points to opening the expression builder. These are all dependent on the specific
context of the data flow transformation. The most common use case is in transformations like derived column
and aggregate where users create or update columns using the data flow expression language. The expression
builder can be opened by selecting Open expression builder above the list of columns. You can also click on a
column context and open the expression builder directly to that expression.

In some transformations like filter, clicking on a blue expression text box will open the expression builder.
When you reference columns in a matching or group-by condition, an expression can extract values from
columns. To create an expression, select Computed column .

In cases where an expression or a literal value are valid inputs, select Add dynamic content to build an
expression that evaluates to a literal value.

Expression elements
In mapping data flows, expressions can be composed of column values, parameters, functions, local variables,
operators, and literals. These expressions must evaluate to a Spark data type such as string, boolean, or integer.

Functions
Mapping data flows has built-in functions and operators that can be used in expressions. For a list of available
functions, see the mapping data flow language reference.
Address array indexes
When dealing with columns or functions that return array types, use brackets ([]) to access a specific element. If
the index doesn't exist, the expression evaluates into NULL.
IMPORTANT
In mapping data flows, arrays are one-based meaning the first element is referenced by index one. For example,
myArray[1] will access the first element of an array called 'myArray'.

Input schema
If your data flow uses a defined schema in any of its sources, you can reference a column by name in many
expressions. If you are utilizing schema drift, you can reference columns explicitly using the byName() or
byNames() functions or match using column patterns.

Column names with special characters


When you have column names that include special characters or spaces, surround the name with curly braces to
reference them in an expression.
{[dbo].this_is my complex name$$$}

Parameters
Parameters are values that are passed into a data flow at run time from a pipeline. To reference a parameter,
either click on the parameter from the Expression elements view or reference it with a dollar sign in front of
its name. For example, a parameter called parameter1 would be referenced by $parameter1 . To learn more, see
parameterizing mapping data flows.
Cached lookup
A cached lookup allows you to do an inline lookup of the output of a cached sink. There are two functions
available to use on each sink, lookup() and outputs() . The syntax to reference these functions is
cacheSinkName#functionName() . For more information, see cache sinks.

lookup() takes in the matching columns in the current transformation as parameters and returns a complex
column equal to the row matching the key columns in the cache sink. The complex column returned contains a
subcolumn for each column mapped in the cache sink. For example, if you had an error code cache sink
errorCodeCache that had a key column matching on the code and a column called Message . Calling
errorCodeCache#lookup(errorCode).Message would return the message corresponding with the code passed in.

outputs() takes no parameters and returns the entire cache sink as an array of complex columns. This can't be
called if key columns are specified in the sink and should only be used if there is a small number of rows in the
cache sink. A common use case is appending the max value of an incrementing key. If a cached single
aggregated row CacheMaxKey contains a column MaxKey , you can reference the first value by calling
CacheMaxKey#outputs()[1].MaxKey .
Locals
If you are sharing logic across multiple columns or want to compartmentalize your logic, you can create a local
within a derived column. To reference a local, either click on the local from the Expression elements view or
reference it with a colon in front of its name. For example, a local called local1 would be referenced by :local1 .
Learn more about locals in the derived column documentation.

Preview expression results


If debug mode is switched on, you can interactively use the debug cluster to preview what your expression
evaluates to. Select Refresh next to data preview to update the results of the data preview. You can see the
output of each row given the input columns.

String interpolation
When creating long strings that use expression elements, use string interpolation to easily build up complex
string logic. String interpolation avoids extensive use of string concatenation when parameters are included in
query strings. Use double quotation marks to enclose literal string text together with expressions. You can
include expression functions, columns, and parameters. To use expression syntax, enclose it in curly braces,
Some examples of string interpolation:
"My favorite movie is {iif(instr(title,', The')>0,"The {split(title,', The')[1]}",title)}"

"select * from {$tablename} where orderyear > {$year}"

"Total cost with sales tax is {round(totalcost * 1.08,2)}"

"{:playerName} is a {:playerRating} player"

NOTE
When using string interpolation syntax in SQL source queries, the query string must be on one single line, without '/n'.

Commenting expressions
Add comments to your expressions by using single-line and multiline comment syntax.
The following examples are valid comments:
/* This is my comment */

/* This is a

multi-line comment */

If you put a comment at the top of your expression, it appears in the transformation text box to document your
transformation expressions.

Regular expressions
Many expression language functions use regular expression syntax. When you use regular expression functions,
Expression Builder tries to interpret a backslash (\) as an escape character sequence. When you use backslashes
in your regular expression, either enclose the entire regex in backticks (`) or use a double backslash.
An example that uses backticks:

regex_replace('100 and 200', `(\d+)`, 'digits')

An example that uses double slashes:

regex_replace('100 and 200', '(\\d+)', 'digits')

Keyboard shortcuts
Below are a list of shortcuts available in the expression builder. Most intellisense shortcuts are available when
creating expressions.
Ctrl+K Ctrl+C: Comment entire line.
Ctrl+K Ctrl+U: Uncomment.
F1: Provide editor help commands.
Alt+Down arrow key: Move down current line.
Alt+Up arrow key: Move up current line.
Ctrl+Spacebar: Show context help.

Commonly used expressions


Convert to dates or timestamps
To include string literals in your timestamp output, wrap your conversion in toString() .
toString(toTimestamp('12/31/2016T00:12:00', 'MM/dd/yyyy\'T\'HH:mm:ss'), 'MM/dd /yyyy\'T\'HH:mm:ss')

To convert milliseconds from epoch to a date or timestamp, use toTimestamp(<number of milliseconds>) . If time
is coming in seconds, multiply by 1,000.
toTimestamp(1574127407*1000l)

The trailing "l" at the end of the previous expression signifies conversion to a long type as inline syntax.
Find time from epoch or Unix Time
toLong( currentTimestamp() - toTimestamp('1970-01-01 00:00:00.000', 'yyyy-MM-dd HH:mm:ss.SSS') ) * 1000l
Data flow time evaluation
Dataflow processes till milliseconds. For 2018-07-31T20:00:00.2170000, you will see 2018-07-31T20:00:00.217
in output. In ADF portal, timestamp is being shown in the current browser setting , which can eliminate 217,
but when you will run the data flow end to end, 217 (milliseconds part will be processed as well). You can use
toString(myDateTimeColumn) as expression and see full precision data in preview. Process datetime as datetime
rather than string for all practical purposes.

Next steps
Begin building data transformation expressions
Data transformation expressions in mapping data
flow
7/9/2021 • 49 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Expression functions
In Data Factory, use the expression language of the mapping data flow feature to configure data
transformations.

abs

abs(<value1> : number) => number

Absolute value of a number.


abs(-20) -> 20
abs(10) -> 10

acos

acos(<value1> : number) => double

Calculates a cosine inverse value.


acos(1) -> 0.0

add

add(<value1> : any, <value2> : any) => any

Adds a pair of strings or numbers. Adds a date to a number of days. Adds a duration to a timestamp. Appends
one array of similar type to another. Same as the + operator.
add(10, 20) -> 30
10 + 20 -> 30
add('ice', 'cream') -> 'icecream'
'ice' + 'cream' + ' cone' -> 'icecream cone'
add(toDate('2012-12-12'), 3) -> toDate('2012-12-15')
toDate('2012-12-12') + 3 -> toDate('2012-12-15')
[10, 20] + [30, 40] -> [10, 20, 30, 40]
toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS') + (days(1) + hours(2) - seconds(10)) ->
toTimestamp('2019-02-04 07:19:18.871', 'yyyy-MM-dd HH:mm:ss.SSS')

addDays

addDays(<date/timestamp> : datetime, <days to add> : integral) => datetime

Add days to a date or timestamp. Same as the + operator for date.


addDays(toDate('2016-08-08'), 1) -> toDate('2016-08-09')

addMonths

addMonths(<date/timestamp> : datetime, <months to add> : integral, [<value3> : string]) => datetime

Add months to a date or timestamp. You can optionally pass a timezone.


addMonths(toDate('2016-08-31'), 1) -> toDate('2016-09-30')
addMonths(toTimestamp('2016-09-30 10:10:10'), -1) -> toTimestamp('2016-08-31 10:10:10')

and

and(<value1> : boolean, <value2> : boolean) => boolean

Logical AND operator. Same as &&.


and(true, false) -> false
true && false -> false

asin

asin(<value1> : number) => double

Calculates an inverse sine value.


asin(0) -> 0.0

atan

atan(<value1> : number) => double

Calculates a inverse tangent value.


atan(0) -> 0.0

atan2

atan2(<value1> : number, <value2> : number) => double

Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates.
atan2(0, 0) -> 0.0

between

between(<value1> : any, <value2> : any, <value3> : any) => boolean

Checks if the first value is in between two other values inclusively. Numeric, string and datetime values can be
compared
between(10, 5, 24)
true
between(currentDate(), currentDate() + 10, currentDate() + 20)
false

bitwiseAnd

bitwiseAnd(<value1> : integral, <value2> : integral) => integral

Bitwise And operator across integral types. Same as & operator


bitwiseAnd(0xf4, 0xef)
0xe4
(0xf4 & 0xef)
0xe4

bitwiseOr

bitwiseOr(<value1> : integral, <value2> : integral) => integral

Bitwise Or operator across integral types. Same as | operator


bitwiseOr(0xf4, 0xef)
0xff
(0xf4 | 0xef)
0xff

bitwiseXor

bitwiseXor(<value1> : any, <value2> : any) => any

Bitwise Or operator across integral types. Same as | operator


bitwiseXor(0xf4, 0xef)
0x1b
(0xf4 ^ 0xef)
0x1b
(true ^ false)
true
(true ^ true)
false

blake2b

blake2b(<value1> : integer, <value2> : any, ...) => string

Calculates the Blake2 digest of set of column of varying primitive datatypes given a bit length which can only be
multiples of 8 between 8 & 512. It can be used to calculate a fingerprint for a row
blake2b(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4'))
'c9521a5080d8da30dffb430c50ce253c345cc4c4effc315dab2162dac974711d'

blake2bBinary

blake2bBinary(<value1> : integer, <value2> : any, ...) => binary

Calculates the Blake2 digest of set of column of varying primitive datatypes given a bit length which can only be
multiples of 8 between 8 & 512. It can be used to calculate a fingerprint for a row
blake2bBinary(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4'))
unHex('c9521a5080d8da30dffb430c50ce253c345cc4c4effc315dab2162dac974711d')

case

case(<condition> : boolean, <true_expression> : any, <false_expression> : any, ...) => any

Based on alternating conditions applies one value or the other. If the number of inputs are even, the other is
defaulted to NULL for last condition.
case(10 + 20 == 30, 'dumbo', 'gumbo') -> 'dumbo'
case(10 + 20 == 25, 'bojjus', 'do' < 'go', 'gunchus') -> 'gunchus'
isNull(case(10 + 20 == 25, 'bojjus', 'do' > 'go', 'gunchus')) -> true
case(10 + 20 == 25, 'bojjus', 'do' > 'go', 'gunchus', 'dumbo') -> 'dumbo'

cbrt

cbrt(<value1> : number) => double

Calculates the cube root of a number.


cbrt(8) -> 2.0

ceil

ceil(<value1> : number) => number

Returns the smallest integer not smaller than the number.


ceil(-0.1) -> 0

coalesce

coalesce(<value1> : any, ...) => any

Returns the first not null value from a set of inputs. All inputs should be of the same type.
coalesce(10, 20) -> 10
coalesce(toString(null), toString(null), 'dumbo', 'bo', 'go') -> 'dumbo'

columnNames

columnNames(<value1> : string) => array

Gets the names of all output columns for a stream. You can pass an optional stream name as the second
argument.
columnNames()
columnNames('DeriveStream')

columns

columns([<stream name> : string]) => any

Gets the values of all output columns for a stream. You can pass an optional stream name as the second
argument.
columns()
columns('DeriveStream')

compare

compare(<value1> : any, <value2> : any) => integer

Compares two values of the same type. Returns negative integer if value1 < value2, 0 if value1 == value2,
positive value if value1 > value2.
(compare(12, 24) < 1) -> true
(compare('dumbo', 'dum') > 0) -> true

concat

concat(<this> : string, <that> : string, ...) => string

Concatenates a variable number of strings together. Same as the + operator with strings.
concat('dataflow', 'is', 'awesome') -> 'dataflowisawesome'
'dataflow' + 'is' + 'awesome' -> 'dataflowisawesome'
isNull('sql' + null) -> true

concatWS

concatWS(<separator> : string, <this> : string, <that> : string, ...) => string

Concatenates a variable number of strings together with a separator. The first parameter is the separator.
concatWS(' ', 'dataflow', 'is', 'awesome') -> 'dataflow is awesome'
isNull(concatWS(null, 'dataflow', 'is', 'awesome')) -> true
concatWS(' is ', 'dataflow', 'awesome') -> 'dataflow is awesome'

cos

cos(<value1> : number) => double

Calculates a cosine value.


cos(10) -> -0.8390715290764524
cosh

cosh(<value1> : number) => double

Calculates a hyperbolic cosine of a value.


cosh(0) -> 1.0

crc32

crc32(<value1> : any, ...) => long

Calculates the CRC32 hash of set of column of varying primitive datatypes given a bit length which can only be
of values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row.
crc32(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 3630253689L

currentDate

currentDate([<value1> : string]) => date

Gets the current date when this job starts to run. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for
available formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
currentDate() == toDate('2250-12-31') -> false
currentDate('PST') == toDate('2250-12-31') -> false
currentDate('America/New_York') == toDate('2250-12-31') -> false

currentTimestamp

currentTimestamp() => timestamp

Gets the current timestamp when the job starts to run with local time zone.
currentTimestamp() == toTimestamp('2250-12-31 12:12:12') -> false

currentUTC

currentUTC([<value1> : string]) => timestamp

Gets the current timestamp as UTC. If you want your current time to be interpreted in a different timezone than
your cluster time zone, you can pass an optional timezone in the form of 'GMT', 'PST', 'UTC', 'America/Cayman'.
It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html. To convert the UTC time to a
different timezone use fromUTC() .
currentUTC() == toTimestamp('2050-12-12 19:18:12') -> false
currentUTC() != toTimestamp('2050-12-12 19:18:12') -> true
fromUTC(currentUTC(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true

dayOfMonth

dayOfMonth(<value1> : datetime) => integer

Gets the day of the month given a date.


dayOfMonth(toDate('2018-06-08')) -> 8

dayOfWeek

dayOfWeek(<value1> : datetime) => integer

Gets the day of the week given a date. 1 - Sunday, 2 - Monday ..., 7 - Saturday.
dayOfWeek(toDate('2018-06-08')) -> 6

dayOfYear

dayOfYear(<value1> : datetime) => integer

Gets the day of the year given a date.


dayOfYear(toDate('2016-04-09')) -> 100

days

days(<value1> : integer) => long

Duration in milliseconds for number of days.


days(2) -> 172800000L

degrees

degrees(<value1> : number) => double

Converts radians to degrees.


degrees(3.141592653589793) -> 180
divide

divide(<value1> : any, <value2> : any) => any

Divides pair of numbers. Same as the / operator.


divide(20, 10) -> 2
20 / 10 -> 2

dropLeft

dropLeft(<value1> : string, <value2> : integer) => string

Removes as many characters from the left of the string. If the drop requested exceeds the length of the string, an
empty string is returned.
dropLeft('bojjus', 2) => 'jjus'
dropLeft('cake', 10) => ''

dropRight

dropRight(<value1> : string, <value2> : integer) => string

Removes as many characters from the right of the string. If the drop requested exceeds the length of the string,
an empty string is returned.
dropRight('bojjus', 2) => 'bojj'
dropRight('cake', 10) => ''

endsWith

endsWith(<string> : string, <substring to check> : string) => boolean

Checks if the string ends with the supplied string.


endsWith('dumbo', 'mbo') -> true

equals

equals(<value1> : any, <value2> : any) => boolean

Comparison equals operator. Same as == operator.


equals(12, 24) -> false
12 == 24 -> false
'bad' == 'bad' -> true
isNull('good' == toString(null)) -> true
isNull(null == null) -> true

equalsIgnoreCase

equalsIgnoreCase(<value1> : string, <value2> : string) => boolean

Comparison equals operator ignoring case. Same as <=> operator.


'abc'<=>'Abc' -> true
equalsIgnoreCase('abc', 'Abc') -> true

escape

escape(<string_to_escape> : string, <format> : string) => string

Escapes a string according to a format. Literal values for acceptable format are 'json', 'xml', 'ecmascript', 'html',
'java'.

expr

expr(<expr> : string) => any

Results in a expression from a string. This is the same as writing this expression in a non-literal form. This can be
used to pass parameters as string representations.
expr('price * discount') => any

factorial

factorial(<value1> : number) => long

Calculates the factorial of a number.


factorial(5) -> 120

false

false() => boolean

Always returns a false value. Use the function syntax(false()) if there is a column named 'false'.
(10 + 20 > 30) -> false
(10 + 20 > 30) -> false()
floor

floor(<value1> : number) => number

Returns the largest integer not greater than the number.


floor(-0.1) -> -1

fromBase64

fromBase64(<value1> : string) => string

Encodes the given string in base64.


fromBase64('Z3VuY2h1cw==') -> 'gunchus'

fromUTC

fromUTC(<value1> : timestamp, [<value2> : string]) => timestamp

Converts to the timestamp from UTC. You can optionally pass the timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
fromUTC(currentTimestamp()) == toTimestamp('2050-12-12 19:18:12') -> false
fromUTC(currentTimestamp(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true

greater

greater(<value1> : any, <value2> : any) => boolean

Comparison greater operator. Same as > operator.


greater(12, 24) -> false
('dumbo' > 'dum') -> true
(toTimestamp('2019-02-05 08:21:34.890', 'yyyy-MM-dd HH:mm:ss.SSS') > toTimestamp('2019-02-03
05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS')) -> true

greaterOrEqual

greaterOrEqual(<value1> : any, <value2> : any) => boolean

Comparison greater than or equal operator. Same as >= operator.


greaterOrEqual(12, 12) -> true
('dumbo' >= 'dum') -> true

greatest

greatest(<value1> : any, ...) => any

Returns the greatest value among the list of values as input skipping null values. Returns null if all inputs are
null.
greatest(10, 30, 15, 20) -> 30
greatest(10, toInteger(null), 20) -> 20
greatest(toDate('2010-12-12'), toDate('2011-12-12'), toDate('2000-12-12')) -> toDate('2011-12-12')
greatest(toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS'), toTimestamp('2019-02-05
08:21:34.890', 'yyyy-MM-dd HH:mm:ss.SSS')) -> toTimestamp('2019-02-05 08:21:34.890', 'yyyy-MM-dd
HH:mm:ss.SSS')

hasColumn

hasColumn(<column name> : string, [<stream name> : string]) => boolean

Checks for a column value by name in the stream. You can pass a optional stream name as the second
argument. Column names known at design time should be addressed just by their name. Computed inputs are
not supported but you can use parameter substitutions.
hasColumn('parent')

hour

hour(<value1> : timestamp, [<value2> : string]) => integer

Gets the hour value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
hour(toTimestamp('2009-07-30 12:58:59')) -> 12
hour(toTimestamp('2009-07-30 12:58:59'), 'PST') -> 12

hours

hours(<value1> : integer) => long

Duration in milliseconds for number of hours.


hours(2) -> 7200000L
iif

iif(<condition> : boolean, <true_expression> : any, [<false_expression> : any]) => any

Based on a condition applies one value or the other. If other is unspecified it is considered NULL. Both the values
must be compatible(numeric, string...).
iif(10 + 20 == 30, 'dumbo', 'gumbo') -> 'dumbo'
iif(10 > 30, 'dumbo', 'gumbo') -> 'gumbo'
iif(month(toDate('2018-12-01')) == 12, 345.12, 102.67) -> 345.12

iifNull

iifNull(<value1> : any, [<value2> : any], ...) => any

Checks if the first parameter is null. If not null, the first parameter is returned. If null, the second parameter is
returned. If three parameters are specified, the behavior is the same as iif(isNull(value1), value2, value3) and the
third parameter is returned if the first value is not null.
iifNull(10, 20) -> 10
iifNull(null, 20, 40) -> 20
iifNull('azure', 'data', 'factory') -> 'factory'
iifNull(null, 'data', 'factory') -> 'data'

initCap

initCap(<value1> : string) => string

Converts the first letter of every word to uppercase. Words are identified as separated by whitespace.
initCap('cool iceCREAM') -> 'Cool Icecream'

instr

instr(<string> : string, <substring to find> : string) => integer

Finds the position(1 based) of the substring within a string. 0 is returned if not found.
instr('dumbo', 'mbo') -> 3
instr('microsoft', 'o') -> 5
instr('good', 'bad') -> 0

isDelete

isDelete([<value1> : integer]) => boolean

Checks if the row is marked for delete. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isDelete()
isDelete(1)

isError

isError([<value1> : integer]) => boolean

Checks if the row is marked as error. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isError()
isError(1)

isIgnore

isIgnore([<value1> : integer]) => boolean

Checks if the row is marked to be ignored. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isIgnore()
isIgnore(1)

isInsert

isInsert([<value1> : integer]) => boolean

Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isInsert()
isInsert(1)

isMatch

isMatch([<value1> : integer]) => boolean

Checks if the row is matched at lookup. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isMatch()
isMatch(1)

isNull

isNull(<value1> : any) => boolean

Checks if the value is NULL.


isNull(NULL()) -> true
isNull('') -> false

isUpdate

isUpdate([<value1> : integer]) => boolean

Checks if the row is marked for update. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isUpdate()
isUpdate(1)

isUpsert

isUpsert([<value1> : integer]) => boolean

Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isUpsert()
isUpsert(1)

jaroWinkler

jaroWinkler(<value1> : string, <value2> : string) => double

Gets the JaroWinkler distance between two strings.


jaroWinkler('frog', 'frog') => 1.0

lastDayOfMonth

lastDayOfMonth(<value1> : datetime) => date

Gets the last date of the month given a date.


lastDayOfMonth(toDate('2009-01-12')) -> toDate('2009-01-31')

least

least(<value1> : any, ...) => any

Comparison lesser than or equal operator. Same as <= operator.


least(10, 30, 15, 20) -> 10
least(toDate('2010-12-12'), toDate('2011-12-12'), toDate('2000-12-12')) -> toDate('2000-12-12')

left

left(<string to subset> : string, <number of characters> : integral) => string

Extracts a substring start at index 1 with number of characters. Same as SUBSTRING(str, 1, n).
left('bojjus', 2) -> 'bo'
left('bojjus', 20) -> 'bojjus'

length

length(<value1> : string) => integer

Returns the length of the string.


length('dumbo') -> 5

lesser

lesser(<value1> : any, <value2> : any) => boolean

Comparison less operator. Same as < operator.


lesser(12, 24) -> true
('abcd' < 'abc') -> false
(toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS') < toTimestamp('2019-02-05
08:21:34.890', 'yyyy-MM-dd HH:mm:ss.SSS')) -> true

lesserOrEqual

lesserOrEqual(<value1> : any, <value2> : any) => boolean

Comparison lesser than or equal operator. Same as <= operator.


lesserOrEqual(12, 12) -> true
('dumbo' <= 'dum') -> false

levenshtein

levenshtein(<from string> : string, <to string> : string) => integer

Gets the levenshtein distance between two strings.


levenshtein('boys', 'girls') -> 4

like

like(<string> : string, <pattern match> : string) => boolean

The pattern is a string that is matched literally. The exceptions are the following special symbols: _ matches any
one character in the input (similar to . in posix regular expressions) % matches zero or more characters in the
input (similar to .* in posix regular expressions). The escape character is ''. If an escape character precedes a
special symbol or another escape character, the following character is matched literally. It is invalid to escape any
other character.
like('icecream', 'ice%') -> true

locate

locate(<substring to find> : string, <string> : string, [<from index - 1-based> : integral]) => integer

Finds the position(1 based) of the substring within a string starting a certain position. If the position is omitted it
is considered from the beginning of the string. 0 is returned if not found.
locate('mbo', 'dumbo') -> 3
locate('o', 'microsoft', 6) -> 7
locate('bad', 'good') -> 0

log

log(<value1> : number, [<value2> : number]) => double

Calculates log value. An optional base can be supplied else a Euler number if used.
log(100, 10) -> 2

log10

log10(<value1> : number) => double

Calculates log value based on 10 base.


log10(100) -> 2

lower

lower(<value1> : string) => string

Lowercases a string.
lower('GunChus') -> 'gunchus'

lpad

lpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string

Left pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is trimmed to the length.
lpad('dumbo', 10, '-') -> '-----dumbo'
lpad('dumbo', 4, '-') -> 'dumb'
lpad('dumbo', 8, '<>') -> '<><dumbo'

ltrim

ltrim(<string to trim> : string, [<trim characters> : string]) => string

Left trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter.
ltrim(' dumbo ') -> 'dumbo '
ltrim('!--!du!mbo!', '-!') -> 'du!mbo!'

md5

md5(<value1> : any, ...) => string

Calculates the MD5 digest of set of column of varying primitive datatypes and returns a 32 character hex string.
It can be used to calculate a fingerprint for a row.
md5(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '4ce8a880bd621a1ffad0bca905e1bc5a'

millisecond
millisecond(<value1> : timestamp, [<value2> : string]) => integer

Gets the millisecond value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
millisecond(toTimestamp('2009-07-30 12:58:59.871', 'yyyy-MM-dd HH:mm:ss.SSS')) -> 871

milliseconds

milliseconds(<value1> : integer) => long

Duration in milliseconds for number of milliseconds.


milliseconds(2) -> 2L

minus

minus(<value1> : any, <value2> : any) => any

Subtracts numbers. Subtract number of days from a date. Subtract duration from a timestamp. Subtract two
timestamps to get difference in milliseconds. Same as the - operator.
minus(20, 10) -> 10
20 - 10 -> 10
minus(toDate('2012-12-15'), 3) -> toDate('2012-12-12')
toDate('2012-12-15') - 3 -> toDate('2012-12-12')
toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS') + (days(1) + hours(2) - seconds(10)) ->
toTimestamp('2019-02-04 07:19:18.871', 'yyyy-MM-dd HH:mm:ss.SSS')
toTimestamp('2019-02-03 05:21:34.851', 'yyyy-MM-dd HH:mm:ss.SSS') - toTimestamp('2019-02-03
05:21:36.923', 'yyyy-MM-dd HH:mm:ss.SSS') -> -2072

minute

minute(<value1> : timestamp, [<value2> : string]) => integer

Gets the minute value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
minute(toTimestamp('2009-07-30 12:58:59')) -> 58
minute(toTimestamp('2009-07-30 12:58:59'), 'PST') -> 58

minutes

minutes(<value1> : integer) => long

Duration in milliseconds for number of minutes.


minutes(2) -> 120000L

mod

mod(<value1> : any, <value2> : any) => any

Modulus of pair of numbers. Same as the % operator.


mod(20, 8) -> 4
20 % 8 -> 4

month

month(<value1> : datetime) => integer

Gets the month value of a date or timestamp.


month(toDate('2012-8-8')) -> 8

monthsBetween

monthsBetween(<from date/timestamp> : datetime, <to date/timestamp> : datetime, [<roundoff> : boolean], [


<time zone> : string]) => double

Gets the number of months between two dates. You can round off the calculation.You can pass an optional
timezone in the form of 'GMT', 'PST', 'UTC', 'America/Cayman'. The local timezone is used as the default. Refer
Java's SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
monthsBetween(toTimestamp('1997-02-28 10:30:00'), toDate('1996-10-30')) -> 3.94959677

multiply

multiply(<value1> : any, <value2> : any) => any

Multiplies pair of numbers. Same as the * operator.


multiply(20, 10) -> 200
20 * 10 -> 200
negate

negate(<value1> : number) => number

Negates a number. Turns positive numbers to negative and vice versa.


negate(13) -> -13

nextSequence

nextSequence() => long

Returns the next unique sequence. The number is consecutive only within a partition and is prefixed by the
partitionId.
nextSequence() == 12313112 -> false

normalize

normalize(<String to normalize> : string) => string

Normalizes the string value to separate accented unicode characters.


regexReplace(normalize('bo²s'), `\p{M}`, '') -> 'boys'

not

not(<value1> : boolean) => boolean

Logical negation operator.


not(true) -> false
not(10 == 20) -> true

notEquals

notEquals(<value1> : any, <value2> : any) => boolean

Comparison not equals operator. Same as != operator.


12 != 24 -> true
'bojjus' != 'bo' + 'jjus' -> false

notNull

notNull(<value1> : any) => boolean

Checks if the value is not NULL.


notNull(NULL()) -> false
notNull('') -> true

null

null() => null

Returns a NULL value. Use the function syntax(null()) if there is a column named 'null'. Any operation that
uses will result in a NULL.
isNull('dumbo' + null) -> true
isNull(10 * null) -> true
isNull('') -> false
isNull(10 + 20) -> false
isNull(10/0) -> true

or

or(<value1> : boolean, <value2> : boolean) => boolean

Logical OR operator. Same as ||.


or(true, false) -> true
true || false -> true

pMod

pMod(<value1> : any, <value2> : any) => any

Positive Modulus of pair of numbers.


pmod(-20, 8) -> 4

partitionId

partitionId() => integer

Returns the current partition ID the input row is in.


partitionId()
power

power(<value1> : number, <value2> : number) => double

Raises one number to the power of another.


power(10, 2) -> 100

radians

radians(<value1> : number) => double

Converts degrees to radians


radians(180) => 3.141592653589793

random

random(<value1> : integral) => long

Returns a random number given an optional seed within a partition. The seed should be a fixed value and is
used in conjunction with the partitionId to produce random values
random(1) == 1 -> false

regexExtract

regexExtract(<string> : string, <regex to find> : string, [<match group 1-based index> : integral]) => string

Extract a matching substring for a given regex pattern. The last parameter identifies the match group and is
defaulted to 1 if omitted. Use <regex> (back quote) to match a string without escaping.
regexExtract('Cost is between 600 and 800 dollars', '(\\d+) and (\\d+)', 2) -> '800'
regexExtract('Cost is between 600 and 800 dollars', `(\d+) and (\d+)`, 2) -> '800'

regexMatch

regexMatch(<string> : string, <regex to match> : string) => boolean

Checks if the string matches the given regex pattern. Use <regex> (back quote) to match a string without
escaping.
regexMatch('200.50', '(\\d+).(\\d+)') -> true
regexMatch('200.50', `(\d+).(\d+)`) -> true

regexReplace

regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string

Replace all occurrences of a regex pattern with another substring in the given string Use <regex> (back quote) to
match a string without escaping.
regexReplace('100 and 200', '(\\d+)', 'bojjus') -> 'bojjus and bojjus'
regexReplace('100 and 200', `(\d+)`, 'gunchus') -> 'gunchus and gunchus'

regexSplit

regexSplit(<string to split> : string, <regex expression> : string) => array

Splits a string based on a delimiter based on regex and returns an array of strings.
regexSplit('bojjusAgunchusBdumbo', `[CAB]`) -> ['bojjus', 'gunchus', 'dumbo']
regexSplit('bojjusAgunchusBdumboC', `[CAB]`) -> ['bojjus', 'gunchus', 'dumbo', '']
(regexSplit('bojjusAgunchusBdumboC', `[CAB]`)[1]) -> 'bojjus'
isNull(regexSplit('bojjusAgunchusBdumboC', `[CAB]`)[20]) -> true

replace

replace(<string> : string, <substring to find> : string, [<substring to replace> : string]) => string

Replace all occurrences of a substring with another substring in the given string. If the last parameter is omitted,
it is default to empty string.
replace('doggie dog', 'dog', 'cat') -> 'catgie cat'
replace('doggie dog', 'dog', '') -> 'gie '
replace('doggie dog', 'dog') -> 'gie '

reverse

reverse(<value1> : string) => string

Reverses a string.
reverse('gunchus') -> 'suhcnug'

right

right(<string to subset> : string, <number of characters> : integral) => string


Extracts a substring with number of characters from the right. Same as SUBSTRING(str, LENGTH(str) - n, n).
right('bojjus', 2) -> 'us'
right('bojjus', 20) -> 'bojjus'

rlike

rlike(<string> : string, <pattern match> : string) => boolean

Checks if the string matches the given regex pattern.


rlike('200.50', `(\d+).(\d+)`) -> true
rlike('bogus', `M[0-9]+.*`) -> false

round

round(<number> : number, [<scale to round> : number], [<rounding option> : integral]) => double

Rounds a number given an optional scale and an optional rounding mode. If the scale is omitted, it is defaulted
to 0. If the mode is omitted, it is defaulted to ROUND_HALF_UP(5). The values for rounding include 1 -
ROUND_UP 2 - ROUND_DOWN 3 - ROUND_CEILING 4 - ROUND_FLOOR 5 - ROUND_HALF_UP 6 -
ROUND_HALF_DOWN 7 - ROUND_HALF_EVEN 8 - ROUND_UNNECESSARY.
round(100.123) -> 100.0
round(2.5, 0) -> 3.0
round(5.3999999999999995, 2, 7) -> 5.40

rpad

rpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string

Right pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater
than the length, then it is trimmed to the length.
rpad('dumbo', 10, '-') -> 'dumbo-----'
rpad('dumbo', 4, '-') -> 'dumb'
rpad('dumbo', 8, '<>') -> 'dumbo<><'

rtrim

rtrim(<string to trim> : string, [<trim characters> : string]) => string

Right trims a string of trailing characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter.
rtrim(' dumbo ') -> ' dumbo'
rtrim('!--!du!mbo!', '-!') -> '!--!du!mbo'

second

second(<value1> : timestamp, [<value2> : string]) => integer

Gets the second value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
second(toTimestamp('2009-07-30 12:58:59')) -> 59

seconds

seconds(<value1> : integer) => long

Duration in milliseconds for number of seconds.


seconds(2) -> 2000L

sha1

sha1(<value1> : any, ...) => string

Calculates the SHA-1 digest of set of column of varying primitive datatypes and returns a 40 character hex
string. It can be used to calculate a fingerprint for a row.
sha1(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '46d3b478e8ec4e1f3b453ac3d8e59d5854e282bb'

sha2

sha2(<value1> : integer, <value2> : any, ...) => string

Calculates the SHA-2 digest of set of column of varying primitive datatypes given a bit length which can only be
of values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row.
sha2(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) ->
'afe8a553b1761c67d76f8c31ceef7f71b66a1ee6f4e6d3b5478bf68b47d06bd3'

sin

sin(<value1> : number) => double

Calculates a sine value.


sin(2) -> 0.9092974268256817

sinh

sinh(<value1> : number) => double

Calculates a hyperbolic sine value.


sinh(0) -> 0.0

soundex

soundex(<value1> : string) => string

Gets the soundex code for the string.


soundex('genius') -> 'G520'

split

split(<string to split> : string, <split characters> : string) => array

Splits a string based on a delimiter and returns an array of strings.


split('bojjus,guchus,dumbo', ',') -> ['bojjus', 'guchus', 'dumbo']
split('bojjus,guchus,dumbo', '|') -> ['bojjus,guchus,dumbo']
split('bojjus, guchus, dumbo', ', ') -> ['bojjus', 'guchus', 'dumbo']
split('bojjus, guchus, dumbo', ', ')[1] -> 'bojjus'
isNull(split('bojjus, guchus, dumbo', ', ')[0]) -> true
isNull(split('bojjus, guchus, dumbo', ', ')[20]) -> true
split('bojjusguchusdumbo', ',') -> ['bojjusguchusdumbo']

sqrt

sqrt(<value1> : number) => double

Calculates the square root of a number.


sqrt(9) -> 3

startsWith

startsWith(<string> : string, <substring to check> : string) => boolean

Checks if the string starts with the supplied string.


startsWith('dumbo', 'du') -> true

subDays

subDays(<date/timestamp> : datetime, <days to subtract> : integral) => datetime

Subtract days from a date or timestamp. Same as the - operator for date.
subDays(toDate('2016-08-08'), 1) -> toDate('2016-08-07')

subMonths

subMonths(<date/timestamp> : datetime, <months to subtract> : integral) => datetime

Subtract months from a date or timestamp.


subMonths(toDate('2016-09-30'), 1) -> toDate('2016-08-31')

substring

substring(<string to subset> : string, <from 1-based index> : integral, [<number of characters> : integral])
=> string

Extracts a substring of a certain length from a position. Position is 1 based. If the length is omitted, it is defaulted
to end of the string.
substring('Cat in the hat', 5, 2) -> 'in'
substring('Cat in the hat', 5, 100) -> 'in the hat'
substring('Cat in the hat', 5) -> 'in the hat'
substring('Cat in the hat', 100, 100) -> ''

tan

tan(<value1> : number) => double

Calculates a tangent value.


tan(0) -> 0.0

tanh

tanh(<value1> : number) => double

Calculates a hyperbolic tangent value.


tanh(0) -> 0.0

translate

translate(<string to translate> : string, <lookup characters> : string, <replace characters> : string) =>
string

Replace one set of characters by another set of characters in the string. Characters have 1 to 1 replacement.
translate('(bojjus)', '()', '[]') -> '[bojjus]'
translate('(gunchus)', '()', '[') -> '[gunchus'

trim

trim(<string to trim> : string, [<trim characters> : string]) => string

Trims a string of leading and trailing characters. If second parameter is unspecified, it trims whitespace. Else it
trims any character specified in the second parameter.
trim(' dumbo ') -> 'dumbo'
trim('!--!du!mbo!', '-!') -> 'du!mbo'

true

true() => boolean

Always returns a true value. Use the function syntax(true()) if there is a column named 'true'.
(10 + 20 == 30) -> true
(10 + 20 == 30) -> true()

typeMatch

typeMatch(<type> : string, <base type> : string) => boolean

Matches the type of the column. Can only be used in pattern expressions.number matches short, integer, long,
double, float or decimal, integral matches short, integer, long, fractional matches double, float, decimal and
datetime matches date or timestamp type.
typeMatch(type, 'number')
typeMatch('date', 'datetime')

unescape

unescape(<string_to_escape> : string, <format> : string) => string

Unescapes a string according to a format. Literal values for acceptable format are 'json', 'xml', 'ecmascript',
'html', 'java'.
unescape('{\\\\\"value\\\\\": 10}', 'json')
'{\\\"value\\\": 10}'

upper

upper(<value1> : string) => string

Uppercases a string.
upper('bojjus') -> 'BOJJUS'

uuid

uuid() => string

Returns the generated UUID.


uuid()

weekOfYear

weekOfYear(<value1> : datetime) => integer

Gets the week of the year given a date.


weekOfYear(toDate('2008-02-20')) -> 8

weeks

weeks(<value1> : integer) => long

Duration in milliseconds for number of weeks.


weeks(2) -> 1209600000L

xor

xor(<value1> : boolean, <value2> : boolean) => boolean

Logical XOR operator. Same as ^ operator.


xor(true, false) -> true
xor(true, true) -> false
true ^ false -> true

year

year(<value1> : datetime) => integer

Gets the year value of a date.


year(toDate('2012-8-8')) -> 2012

Aggregate functions
The following functions are only available in aggregate, pivot, unpivot, and window transformations.

approxDistinctCount

approxDistinctCount(<value1> : any, [ <value2> : double ]) => long

Gets the approximate aggregate count of distinct values for a column. The optional second parameter is to
control the estimation error.
approxDistinctCount(ProductID, .05) => long

avg

avg(<value1> : number) => number

Gets the average of values of a column.


avg(sales)

avgIf

avgIf(<value1> : boolean, <value2> : number) => number

Based on a criteria gets the average of values of a column.


avgIf(region == 'West', sales)

collect

collect(<value1> : any) => array

Collects all values of the expression in the aggregated group into an array. Structures can be collected and
transformed to alternate structures during this process. The number of items will be equal to the number of
rows in that group and can contain null values. The number of collected items should be small.
collect(salesPerson)
collect(firstName + lastName))
collect(@(name = salesPerson, sales = salesAmount) )

count

count([<value1> : any]) => long

Gets the aggregate count of values. If the optional column(s) is specified, it ignores NULL values in the count.
count(custId)
count(custId, custName)
count()
count(iif(isNull(custId), 1, NULL))

countDistinct

countDistinct(<value1> : any, [<value2> : any], ...) => long

Gets the aggregate count of distinct values of a set of columns.


countDistinct(custId, custName)

countIf

countIf(<value1> : boolean, [<value2> : any]) => long

Based on a criteria gets the aggregate count of values. If the optional column is specified, it ignores NULL values
in the count.
countIf(state == 'CA' && commission < 10000, name)

covariancePopulation

covariancePopulation(<value1> : number, <value2> : number) => double

Gets the population covariance between two columns.


covariancePopulation(sales, profit)
covariancePopulationIf

covariancePopulationIf(<value1> : boolean, <value2> : number, <value3> : number) => double

Based on a criteria, gets the population covariance of two columns.


covariancePopulationIf(region == 'West', sales)

covarianceSample

covarianceSample(<value1> : number, <value2> : number) => double

Gets the sample covariance of two columns.


covarianceSample(sales, profit)

covarianceSampleIf

covarianceSampleIf(<value1> : boolean, <value2> : number, <value3> : number) => double

Based on a criteria, gets the sample covariance of two columns.


covarianceSampleIf(region == 'West', sales, profit)

first

first(<value1> : any, [<value2> : boolean]) => any

Gets the first value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false.
first(sales)
first(sales, false)

isDistinct

isDistinct(<value1> : any , <value1> : any) => boolean

Finds if a column or set of columns is distinct. It does not count null as a distinct value
isDistinct(custId, custName) => boolean

kurtosis

kurtosis(<value1> : number) => double

Gets the kurtosis of a column.


kurtosis(sales)

kurtosisIf

kurtosisIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the kurtosis of a column.


kurtosisIf(region == 'West', sales)

last

last(<value1> : any, [<value2> : boolean]) => any

Gets the last value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false.
last(sales)
last(sales, false)

max

max(<value1> : any) => any

Gets the maximum value of a column.


max(sales)

maxIf

maxIf(<value1> : boolean, <value2> : any) => any

Based on a criteria, gets the maximum value of a column.


maxIf(region == 'West', sales)

mean

mean(<value1> : number) => number

Gets the mean of values of a column. Same as AVG.


mean(sales)

meanIf
meanIf(<value1> : boolean, <value2> : number) => number

Based on a criteria gets the mean of values of a column. Same as avgIf.


meanIf(region == 'West', sales)

min

min(<value1> : any) => any

Gets the minimum value of a column.


min(sales)

minIf

minIf(<value1> : boolean, <value2> : any) => any

Based on a criteria, gets the minimum value of a column.


minIf(region == 'West', sales)

skewness

skewness(<value1> : number) => double

Gets the skewness of a column.


skewness(sales)

skewnessIf

skewnessIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the skewness of a column.


skewnessIf(region == 'West', sales)

stddev

stddev(<value1> : number) => double

Gets the standard deviation of a column.


stdDev(sales)

stddevIf

stddevIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the standard deviation of a column.


stddevIf(region == 'West', sales)

stddevPopulation

stddevPopulation(<value1> : number) => double

Gets the population standard deviation of a column.


stddevPopulation(sales)

stddevPopulationIf

stddevPopulationIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the population standard deviation of a column.


stddevPopulationIf(region == 'West', sales)

stddevSample

stddevSample(<value1> : number) => double

Gets the sample standard deviation of a column.


stddevSample(sales)

stddevSampleIf

stddevSampleIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the sample standard deviation of a column.


stddevSampleIf(region == 'West', sales)

sum

sum(<value1> : number) => number

Gets the aggregate sum of a numeric column.


sum(col)

sumDistinct

sumDistinct(<value1> : number) => number

Gets the aggregate sum of distinct values of a numeric column.


sumDistinct(col)

sumDistinctIf

sumDistinctIf(<value1> : boolean, <value2> : number) => number

Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column.
sumDistinctIf(state == 'CA' && commission < 10000, sales)
sumDistinctIf(true, sales)

sumIf

sumIf(<value1> : boolean, <value2> : number) => number

Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column.
sumIf(state == 'CA' && commission < 10000, sales)
sumIf(true, sales)

variance

variance(<value1> : number) => double

Gets the variance of a column.


variance(sales)

varianceIf

varianceIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the variance of a column.


varianceIf(region == 'West', sales)

variancePopulation

variancePopulation(<value1> : number) => double

Gets the population variance of a column.


variancePopulation(sales)

variancePopulationIf

variancePopulationIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the population variance of a column.


variancePopulationIf(region == 'West', sales)

varianceSample

varianceSample(<value1> : number) => double

Gets the unbiased variance of a column.


varianceSample(sales)

varianceSampleIf

varianceSampleIf(<value1> : boolean, <value2> : number) => double

Based on a criteria, gets the unbiased variance of a column.


varianceSampleIf(region == 'West', sales)

Array functions
Array functions perform transformations on data structures that are arrays. These include special keywords to
address array elements and indexes:
#acc represents a value that you wish to include in your single output when reducing an array
#index represents the current array index, along with array index numbers #index2, #index3 ...
#item represents the current element value in the array

array

array([<value1> : any], ...) => array

Creates an array of items. All items should be of the same type. If no items are specified, an empty string array is
the default. Same as a [] creation operator.
array('Seattle', 'Washington')
['Seattle', 'Washington']
['Seattle', 'Washington'][1]
'Washington'

at

at(<value1> : array/map, <value2> : integer/key type) => array

Finds the element at an array index. The index is 1-based. Out of bounds index results in a null value. Finds a
value in a map given a key. If the key is not found it returns null.
at(['apples', 'pears'], 1) => 'apples'
at(['fruit' -> 'apples', 'vegetable' -> 'carrot'], 'fruit') => 'apples'

contains

contains(<value1> : array, <value2> : unaryfunction) => boolean

Returns true if any element in the provided array evaluates as true in the provided predicate. Contains expects a
reference to one element in the predicate function as #item.
contains([1, 2, 3, 4], #item == 3) -> true
contains([1, 2, 3, 4], #item > 5) -> false

distinct

distinct(<value1> : array) => array

Returns a distinct set of items from an array.


distinct([10, 20, 30, 10]) => [10, 20, 30]

except

except(<value1> : array, <value2> : array) => array

Returns a difference set of one array from another dropping duplicates.


except([10, 20, 30], [20, 40]) => [10, 30]

filter

filter(<value1> : array, <value2> : unaryfunction) => array

Filters elements out of the array that do not meet the provided predicate. Filter expects a reference to one
element in the predicate function as #item.
filter([1, 2, 3, 4], #item > 2) -> [3, 4]
filter(['a', 'b', 'c', 'd'], #item == 'a' || #item == 'b') -> ['a', 'b']

find

find(<value1> : array, <value2> : unaryfunction) => any

Find the first item from an array that match the condition. It takes a filter function where you can address the
item in the array as #item. For deeply nested maps you can refer to the parent maps using the #item_n(#item_1,
#item_2...) notation.
find([10, 20, 30], #item > 10) -> 20
find(['azure', 'data', 'factory'], length(#item) > 4) -> 'azure'
find([ @( name = 'Daniel', types = [ @(mood = 'jovial', behavior = 'terrific'), @(mood = 'grumpy',
behavior = 'bad') ] ), @( name = 'Mark', types = [ @(mood = 'happy', behavior = 'awesome'), @(mood =
'calm', behavior = 'reclusive') ] ) ], contains(#item.types, #item.mood=='happy') /*Filter out the happy
kid*/ )
@( name = 'Mark', types = [ @(mood = 'happy', behavior = 'awesome'), @(mood = 'calm', behavior =
'reclusive') ] )

flatten

flatten(<array> : array, <value2> : array ..., <value2> : boolean) => array

Flattens array or arrays into a single array. Arrays of atomic items are returned unaltered. The last argument is
optional and is defaulted to false to flatten recursively more than one level deep.
flatten([['bojjus', 'girl'], ['gunchus', 'boy']]) => ['bojjus', 'girl', 'gunchus', 'boy']
flatten([[['bojjus', 'gunchus']]] , true) => ['bojjus', 'gunchus']

in

in(<array of items> : array, <item to find> : any) => boolean

Checks if an item is in the array.


in([10, 20, 30], 10) -> true
in(['good', 'kid'], 'bad') -> false

intersect

intersect(<value1> : array, <value2> : array) => array


Returns an intersection set of distinct items from 2 arrays.
intersect([10, 20, 30], [20, 40]) => [20]

map

map(<value1> : array, <value2> : unaryfunction) => any

Maps each element of the array to a new element using the provided expression. Map expects a reference to one
element in the expression function as #item.
map([1, 2, 3, 4], #item + 2) -> [3, 4, 5, 6]
map(['a', 'b', 'c', 'd'], #item + '_processed') -> ['a_processed', 'b_processed', 'c_processed',
'd_processed']

mapIf

mapIf (<value1> : array, <value2> : binaryfunction, <value3>: binaryFunction) => any

Conditionally maps an array to another array of same or smaller length. The values can be of any datatype
including structTypes. It takes a mapping function where you can address the item in the array as #item and
current index as #index. For deeply nested maps you can refer to the parent maps using the
#item_[n](#item_1, #index_1...) notation.

mapIf([10, 20, 30], #item > 10, #item + 5) -> [25, 35]
mapIf(['icecream', 'cake', 'soda'], length(#item) > 4, upper(#item)) -> ['ICECREAM', 'CAKE']

mapIndex

mapIndex(<value1> : array, <value2> : binaryfunction) => any

Maps each element of the array to a new element using the provided expression. Map expects a reference to one
element in the expression function as #item and a reference to the element index as #index.
mapIndex([1, 2, 3, 4], #item + 2 + #index) -> [4, 6, 8, 10]

mapLoop

mapLoop(<value1> : integer, <value2> : unaryfunction) => any

Loops through from 1 to length to create an array of that length. It takes a mapping function where you can
address the index in the array as #index. For deeply nested maps you can refer to the parent maps using the
#index_n(#index_1, #index_2...) notation.
mapLoop(3, #index * 10) -> [10, 20, 30]

reduce

reduce(<value1> : array, <value2> : any, <value3> : binaryfunction, <value4> : unaryfunction) => any

Accumulates elements in an array. Reduce expects a reference to an accumulator and one element in the first
expression function as #acc and #item and it expects the resulting value as #result to be used in the second
expression function.
toString(reduce(['1', '2', '3', '4'], '0', #acc + #item, #result)) -> '01234'

size

size(<value1> : any) => integer

Finds the size of an array or map type


size(['element1', 'element2']) -> 2
size([1,2,3]) -> 3

slice

slice(<array to slice> : array, <from 1-based index> : integral, [<number of items> : integral]) => array

Extracts a subset of an array from a position. Position is 1 based. If the length is omitted, it is defaulted to end of
the string.
slice([10, 20, 30, 40], 1, 2) -> [10, 20]
slice([10, 20, 30, 40], 2) -> [20, 30, 40]
slice([10, 20, 30, 40], 2)[1] -> 20
isNull(slice([10, 20, 30, 40], 2)[0]) -> true
isNull(slice([10, 20, 30, 40], 2)[20]) -> true
slice(['a', 'b', 'c', 'd'], 8) -> []

sort

sort(<value1> : array, <value2> : binaryfunction) => array

Sorts the array using the provided predicate function. Sort expects a reference to two consecutive elements in
the expression function as #item1 and #item2.
sort([4, 8, 2, 3], compare(#item1, #item2)) -> [2, 3, 4, 8]
sort(['a3', 'b2', 'c1'], iif(right(#item1, 1) >= right(#item2, 1), 1, -1)) -> ['c1', 'b2', 'a3']
unfold

unfold (<value1>: array) => any

Unfolds an array into a set of rows and repeats the values for the remaining columns in every row.
unfold(addresses) => any
unfold( @(name = salesPerson, sales = salesAmount) ) => any

union

union(<value1>: array, <value2> : array) => array

Returns a union set of distinct items from 2 arrays.


union([10, 20, 30], [20, 40]) => [10, 20, 30, 40]

Cached lookup functions


The following functions are only available when using a cached lookup when you've included a cached sink.

lookup

lookup(key, key2, ...) => complex[]

Looks up the first row from the cached sink using the specified keys that match the keys from the cached sink.
cacheSink#lookup(movieId)

mlookup

mlookup(key, key2, ...) => complex[]

Looks up the all matching rows from the cached sink using the specified keys that match the keys from the
cached sink.
cacheSink#mlookup(movieId)

output

output() => any

Returns the first row of the results of the cache sink


cacheSink#output()

outputs

output() => any

Returns the entire output row set of the results of the cache sink
cacheSink#outputs()

Conversion functions
Conversion functions are used to convert data and test for data types
isBitSet

isBitSet (<value1> : array, <value2>:integer ) => boolean

Checks if a bit position is set in this bitset


isBitSet(toBitSet([10, 32, 98]), 10) => true

setBitSet

setBitSet (<value1>: array, <value2>:array) => array

Sets bit positions in this bitset


setBitSet(toBitSet([10, 32]), [98]) => [4294968320L, 17179869184L]

isBoolean

isBoolean(<value1>: string) => boolean

Checks if the string value is a boolean value according to the rules of toBoolean()

isBoolean('true') -> true


isBoolean('no') -> true
isBoolean('microsoft') -> false

isByte

isByte(<value1> : string) => boolean

Checks if the string value is a byte value given an optional format according to the rules of toByte()
isByte('123') -> true
isByte('chocolate') -> false

isDate

isDate (<value1> : string, [: string]) => boolean

Checks if the input date string is a date using an optional input date format. Refer Java's SimpleDateFormat for
available formats. If the input date format is omitted, default format is yyyy-[M]M-[d]d . Accepted formats are
[ yyyy, yyyy-[M]M, yyyy-[M]M-[d]d, yyyy-[M]M-[d]dT* ]

isDate('2012-8-18') -> true


isDate('12/18--234234' -> 'MM/dd/yyyy') -> false

isShort

isShort (<value1> : string, [: string]) => boolean

Checks of the string value is a short value given an optional format according to the rules of toShort()

isShort('123') -> true


isShort('$123' -> '$###') -> true
isShort('microsoft') -> false

isInteger

isInteger (<value1> : string, [: string]) => boolean

Checks of the string value is a integer value given an optional format according to the rules of toInteger()

isInteger('123') -> true


isInteger('$123' -> '$###') -> true
isInteger('microsoft') -> false

isLong

isLong (<value1> : string, [: string]) => boolean

Checks of the string value is a long value given an optional format according to the rules of toLong()

isLong('123') -> true


isLong('$123' -> '$###') -> true
isLong('gunchus') -> false

isNan

isNan (<value1> : integral) => boolean

Check if this is not a number.


isNan(10.2) => false

isFloat

isFloat (<value1> : string, [: string]) => boolean

Checks of the string value is a float value given an optional format according to the rules of toFloat()

isFloat('123') -> true


isFloat('$123.45' -> '$###.00') -> true
isFloat('icecream') -> false

isDouble

isDouble (<value1> : string, [: string]) => boolean

Checks of the string value is a double value given an optional format according to the rules of toDouble()

isDouble('123') -> true


isDouble('$123.45' -> '$###.00') -> true
isDouble('icecream') -> false

isDecimal

isDecimal (<value1> : string) => boolean

Checks of the string value is a decimal value given an optional format according to the rules of toDecimal()

isDecimal('123.45') -> true


isDecimal('12/12/2000') -> false

isTimestamp

isTimestamp (<value1> : string, [: string]) => boolean

Checks if the input date string is a timestamp using an optional input timestamp format. Refer to Java's
SimpleDateFormat for available formats. If the timestamp is omitted the default pattern
yyyy-[M]M-[d]d hh:mm:ss[.f...] is used. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. Timestamp supports up to millisecond accuracy with value of 999 Refer to Java's
SimpleDateFormat for available formats.
isTimestamp('2016-12-31 00:12:00') -> true
isTimestamp('2016-12-31T00:12:00' -> 'yyyy-MM-dd\\'T\\'HH:mm:ss' -> 'PST') -> true
isTimestamp('2012-8222.18') -> false

toBase64

toBase64(<value1> : string) => string

Encodes the given string in base64.


toBase64('bojjus') -> 'Ym9qanVz'

toBinary

toBinary(<value1> : any) => binary

Converts any numeric/date/timestamp/string to binary representation.


toBinary(3) -> [0x11]

toBoolean

toBoolean(<value1> : string) => boolean

Converts a value of ('t', 'true', 'y', 'yes', '1') to true and ('f', 'false', 'n', 'no', '0') to false and NULL for any other
value.
toBoolean('true') -> true
toBoolean('n') -> false
isNull(toBoolean('truthy')) -> true

toByte

toByte(<value> : any, [<format> : string], [<locale> : string]) => byte

Converts any numeric or string to a byte value. An optional Java decimal format can be used for the conversion.
toByte(123)
123
toByte(0xFF)
-1
toByte('123')
123

toDate

toDate(<string> : any, [<date format> : string]) => date

Converts input date string to date using an optional input date format. Refer Java's SimpleDateFormat class for
available formats. If the input date format is omitted, default format is yyyy-[M]M-[d]d. Accepted formats are :[
yyyy, yyyy-[M]M, yyyy-[M]M-[d]d, yyyy-[M]M-[d]dT* ].
toDate('2012-8-18') -> toDate('2012-08-18')
toDate('12/18/2012', 'MM/dd/yyyy') -> toDate('2012-12-18')

toDecimal

toDecimal(<value> : any, [<precision> : integral], [<scale> : integral], [<format> : string], [<locale> :


string]) => decimal(10,0)

Converts any numeric or string to a decimal value. If precision and scale are not specified, it is defaulted to
(10,2).An optional Java decimal format can be used for the conversion. An optional locale format in the form of
BCP47 language like en-US, de, zh-CN.
toDecimal(123.45) -> 123.45
toDecimal('123.45', 8, 4) -> 123.4500
toDecimal('$123.45', 8, 4,'$###.00') -> 123.4500
toDecimal('Ç123,45', 10, 2, 'Ç###,##', 'de') -> 123.45

toDouble

toDouble(<value> : any, [<format> : string], [<locale> : string]) => double

Converts any numeric or string to a double value. An optional Java decimal format can be used for the
conversion. An optional locale format in the form of BCP47 language like en-US, de, zh-CN.
toDouble(123.45) -> 123.45
toDouble('123.45') -> 123.45
toDouble('$123.45', '$###.00') -> 123.45
toDouble('Ç123,45', 'Ç###,##', 'de') -> 123.45

toFloat
toFloat(<value> : any, [<format> : string], [<locale> : string]) => float

Converts any numeric or string to a float value. An optional Java decimal format can be used for the conversion.
Truncates any double.
toFloat(123.45) -> 123.45f
toFloat('123.45') -> 123.45f
toFloat('$123.45', '$###.00') -> 123.45f

toInteger

toInteger(<value> : any, [<format> : string], [<locale> : string]) => integer

Converts any numeric or string to an integer value. An optional Java decimal format can be used for the
conversion. Truncates any long, float, double.
toInteger(123) -> 123
toInteger('123') -> 123
toInteger('$123', '$###') -> 123

toLong

toLong(<value> : any, [<format> : string], [<locale> : string]) => long

Converts any numeric or string to a long value. An optional Java decimal format can be used for the conversion.
Truncates any float, double.
toLong(123) -> 123
toLong('123') -> 123
toLong('$123', '$###') -> 123

toShort

toShort(<value> : any, [<format> : string], [<locale> : string]) => short

Converts any numeric or string to a short value. An optional Java decimal format can be used for the
conversion. Truncates any integer, long, float, double.
toShort(123) -> 123
toShort('123') -> 123
toShort('$123', '$###') -> 123

toString

toString(<value> : any, [<number format/date format> : string]) => string

Converts a primitive datatype to a string. For numbers and date a format can be specified. If unspecified the
system default is picked.Java decimal format is used for numbers. Refer to Java SimpleDateFormat for all
possible date formats; the default format is yyyy-MM-dd.
toString(10) -> '10'
toString('engineer') -> 'engineer'
toString(123456.789, '##,###.##') -> '123,456.79'
toString(123.78, '000000.000') -> '000123.780'
toString(12345, '##0.#####E0') -> '12.345E3'
toString(toDate('2018-12-31')) -> '2018-12-31'
isNull(toString(toDate('2018-12-31', 'MM/dd/yy'))) -> true
toString(4 == 20) -> 'false'

toTimestamp

toTimestamp(<string> : any, [<timestamp format> : string], [<time zone> : string]) => timestamp

Converts a string to a timestamp given an optional timestamp format. If the timestamp is omitted the default
pattern yyyy-[M]M-[d]d hh:mm:ss[.f...] is used. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. Timestamp supports up to millisecond accuracy with value of 999. Refer Java's
SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
toTimestamp('2016-12-31 00:12:00') -> toTimestamp('2016-12-31 00:12:00')
toTimestamp('2016-12-31T00:12:00', 'yyyy-MM-dd\'T\'HH:mm:ss', 'PST') -> toTimestamp('2016-12-31
00:12:00')
toTimestamp('12/31/2016T00:12:00', 'MM/dd/yyyy\'T\'HH:mm:ss') -> toTimestamp('2016-12-31 00:12:00')
millisecond(toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS')) -> 871

toUTC

toUTC(<value1> : timestamp, [<value2> : string]) => timestamp

Converts the timestamp to UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
toUTC(currentTimestamp()) == toTimestamp('2050-12-12 19:18:12') -> false
toUTC(currentTimestamp(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true
Map functions
Map functions perform operations on map data types
associate

reassociate(<value1> : map, <value2> : binaryFunction) => map

Creates a map of key/values. All the keys & values should be of the same type. If no items are specified, it is
defaulted to a map of string to string type.Same as a [ -> ] creation operator. Keys and values should alternate
with each other.
associate('fruit', 'apple', 'vegetable', 'carrot' )=> ['fruit' -> 'apple', 'vegetable' -> 'carrot']

keyValues

keyValues(<value1> : array, <value2> : array) => map

Creates a map of key/values. The first parameter is an array of keys and second is the array of values. Both
arrays should have equal length.
keyValues(['bojjus', 'appa'], ['gunchus', 'ammi']) => ['bojjus' -> 'gunchus', 'appa' -> 'ammi']

mapAssociation

mapAssociation(<value1> : map, <value2> : binaryFunction) => array

Transforms a map by associating the keys to new values. Returns an array. It takes a mapping function where
you can address the item as #key and current value as #value.
mapAssociation(['bojjus' -> 'gunchus', 'appa' -> 'ammi'], @(key = #key, value = #value)) => [@(key =
'bojjus', value = 'gunchus'), @(key = 'appa', value = 'ammi')]

reassociate

reassociate(<value1> : map, <value2> : binaryFunction) => map

Transforms a map by associating the keys to new values. It takes a mapping function where you can address the
item as #key and current value as #value.
reassociate(['fruit' -> 'apple', 'vegetable' -> 'tomato'], substring(#key, 1, 1) + substring(#value, 1,
1)) => ['fruit' -> 'fa', 'vegetable' -> 'vt']

Metafunctions
Metafunctions primarily function on metadata in your data flow
byItem

byItem(<parent column> : any, <column name> : string) => any

Find a sub item within a structure or array of structure If there are multiple matches, the first match is returned.
If no match it returns a NULL value. The returned value has to be type converted by one of the type conversion
actions(? date, ? string ...). Column names known at design time should be addressed just by their name.
Computed inputs are not supported but you can use parameter substitutions
byItem( byName('customer'), 'orderItems') ? (itemName as string, itemQty as integer)

byItem( byItem( byName('customer'), 'orderItems'), 'itemName') ? string

byOrigin

byOrigin(<column name> : string, [<origin stream name> : string]) => any

Selects a column value by name in the origin stream. The second argument is the origin stream name. If there
are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value has to
be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...). Column names known at
design time should be addressed just by their name. Computed inputs are not supported but you can use
parameter substitutions.
toString(byOrigin('ancestor', 'ancestorStream'))

byOrigins

byOrigins(<column names> : array, [<origin stream name> : string]) => any

Selects an array of columns by name in the stream. The second argument is the stream where it originated from.
If there are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value
has to be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...) Column names
known at design time should be addressed just by their name. Computed inputs are not supported but you can
use parameter substitutions.
toString(byOrigins(['ancestor1', 'ancestor2'], 'ancestorStream'))
byName

byName(<column name> : string, [<stream name> : string]) => any

Selects a column value by name in the stream. You can pass a optional stream name as the second argument. If
there are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value
has to be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...). Column names
known at design time should be addressed just by their name. Computed inputs are not supported but you can
use parameter substitutions.
toString(byName('parent'))
toLong(byName('income'))
toBoolean(byName('foster'))
toLong(byName($debtCol))
toString(byName('Bogus Column'))
toString(byName('Bogus Column', 'DeriveStream'))

byNames

byNames(<column names> : array, [<stream name> : string]) => any

Select an array of columns by name in the stream. You can pass a optional stream name as the second
argument. If there are multiple matches, the first match is returned. If there are no matches for a column, the
entire output is a NULL value. The returned value requires a type conversion functions (toDate, toString, ...).
Column names known at design time should be addressed just by their name. Computed inputs are not
supported but you can use parameter substitutions.
toString(byNames(['parent', 'child']))
byNames(['parent']) ? string
toLong(byNames(['income']))
byNames(['income']) ? long
toBoolean(byNames(['foster']))
toLong(byNames($debtCols))
toString(byNames(['a Column']))
toString(byNames(['a Column'], 'DeriveStream'))
byNames(['orderItem']) ? (itemName as string, itemQty as integer)

byPath

byPath(<value1> : string, [<streamName> : string]) => any

Finds a hierarchical path by name in the stream. You can pass an optional stream name as the second argument.
If no such path is found it returns null. Column names/paths known at design time should be addressed just by
their name or dot notation path. Computed inputs are not supported but you can use parameter substitutions.
byPath('grandpa.parent.child') => column

byPosition

byPosition(<position> : integer) => any

Selects a column value by its relative position(1 based) in the stream. If the position is out of bounds it returns a
NULL value. The returned value has to be type converted by one of the type conversion functions(TO_DATE,
TO_STRING ...) Computed inputs are not supported but you can use parameter substitutions.
toString(byPosition(1))
toDecimal(byPosition(2), 10, 2)
toBoolean(byName(4))
toString(byName($colName))
toString(byPosition(1234))

hasPath

hasPath(<value1> : string, [<streamName> : string]) => boolean

Checks if a certain hierarchical path exists by name in the stream. You can pass an optional stream name as the
second argument. Column names/paths known at design time should be addressed just by their name or dot
notation path. Computed inputs are not supported but you can use parameter substitutions.
hasPath('grandpa.parent.child') => boolean

originColumns

originColumns(<streamName> : string) => any

Gets all output columns for a origin stream where columns were created. Must be enclosed in another function.
array(toString(originColumns('source1')))

hex

hex(<value1>: binary) => string

Returns a hex string representation of a binary value


hex(toBinary([toByte(0x1f), toByte(0xad), toByte(0xbe)])) -> '1fadbe'
unhex

unhex(<value1>: string) => binary

Unhexes a binary value from its string representation. This can be used in conjunction with sha2, md5 to
convert from string to binary representation
unhex('1fadbe') -> toBinary([toByte(0x1f), toByte(0xad), toByte(0xbe)])

unhex(md5(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4'))) ->


toBinary([toByte(0x4c),toByte(0xe8),toByte(0xa8),toByte(0x80),toByte(0xbd),toByte(0x62),toByte(0x1a),toByte(0x1f),toByte(0xfa),toByte(0xd0),toByte(0xbc),toByte(0xa9),t

Window functions
The following functions are only available in window transformations.

cumeDist

cumeDist() => integer

The CumeDist function computes the position of a value relative to all values in the partition. The result is the
number of rows preceding or equal to the current row in the ordering of the partition divided by the total
number of rows in the window partition. Any tie values in the ordering will evaluate to the same position.
cumeDist()

denseRank

denseRank() => integer

Computes the rank of a value in a group of values specified in a window's order by clause. The result is one plus
the number of rows preceding or equal to the current row in the ordering of the partition. The values will not
produce gaps in the sequence. Dense Rank works even when data is not sorted and looks for change in values.
denseRank()

lag

lag(<value> : any, [<number of rows to look before> : number], [<default value> : any]) => any

Gets the value of the first parameter evaluated n rows before the current row. The second parameter is the
number of rows to look back and the default value is 1. If there are not as many rows a value of null is returned
unless a default value is specified.
lag(amount, 2)
lag(amount, 2000, 100)

lead

lead(<value> : any, [<number of rows to look after> : number], [<default value> : any]) => any

Gets the value of the first parameter evaluated n rows after the current row. The second parameter is the
number of rows to look forward and the default value is 1. If there are not as many rows a value of null is
returned unless a default value is specified.
lead(amount, 2)
lead(amount, 2000, 100)

nTile

nTile([<value1> : integer]) => integer

The NTile function divides the rows for each window partition into n buckets ranging from 1 to at most n .
Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the
number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket. The
NTile function is useful for the calculation of tertiles , quartiles, deciles, and other common summary
statistics. The function calculates two variables during initialization: The size of a regular bucket will have one
extra row added to it. Both variables are based on the size of the current partition. During the calculation process
the function keeps track of the current row number, the current bucket number, and the row number at which
the bucket will change (bucketThreshold). When the current row number reaches bucket threshold, the bucket
value is increased by one and the threshold is increased by the bucket size (plus one extra if the current bucket is
padded).
nTile()
nTile(numOfBuckets)

rank

rank() => integer

Computes the rank of a value in a group of values specified in a window's order by clause. The result is one plus
the number of rows preceding or equal to the current row in the ordering of the partition. The values will
produce gaps in the sequence. Rank works even when data is not sorted and looks for change in values.
rank()

rowNumber

rowNumber() => integer


Assigns a sequential row numbering for rows in a window starting with 1.
rowNumber()

Next steps
Learn how to use Expression Builder.
What is data wrangling?
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Organizations need to the ability to explore their critical business data for data preparation and wrangling in
order to provide accurate analysis of complex data that continues to grow every day. Data preparation is
required so that organizations can use the data in various business processes and reduce the time to value.
Data Factory empowers you with code-free data preparation at cloud scale iteratively using Power Query. Data
Factory integrates with Power Query Online and makes Power Query M functions available as a pipeline activity.
Data Factory translates M generated by the Power Query Online Mashup Editor into spark code for cloud scale
execution by translating M into Azure Data Factory Data Flows. Wrangling data with Power Query and data
flows are especially useful for data engineers or 'citizen data integrators'.

NOTE
The Power Query activity in Azure Data Factory is currently available in public preview

Use cases

Fast interactive data exploration and preparation


Multiple data engineers and citizen data integrators can interactively explore and prepare datasets at cloud scale.
With the rise of volume, variety and velocity of data in data lakes, users need an effective way to explore and
prepare data sets. For example, you may need to create a dataset that 'has all customer demographic info for
new customers since 2017'. You aren't mapping to a known target. You're exploring, wrangling, and prepping
datasets to meet a requirement before publishing it in the lake. Wrangling is often used for less formal analytics
scenarios. The prepped datasets can be used for doing transformations and machine learning operations
downstream.
Code -free agile data preparation
Citizen data integrators spend more than 60% of their time looking for and preparing data. They're looking to
do it in a code free manner to improve operational productivity. Allowing citizen data integrators to enrich,
shape, and publish data using known tools like Power Query Online in a scalable manner drastically improves
their productivity. Wrangling in Azure Data Factory enables the familiar Power Query Online mashup editor to
allow citizen data integrators to fix errors quickly, standardize data, and produce high-quality data to support
business decisions.
Data validation and exploration
Visually scan your data in a code-free manner to remove any outliers, anomalies, and conform it to a shape for
fast analytics.

Supported sources
C O N N EC TO R DATA F O RM AT A UT H EN T IC AT IO N T Y P E

Azure Blob Storage CSV, Parquet Account Key


C O N N EC TO R DATA F O RM AT A UT H EN T IC AT IO N T Y P E

Azure Data Lake Storage Gen1 CSV Service Principal

Azure Data Lake Storage Gen2 CSV, Parquet Account Key, Service Principal

Azure SQL Database - SQL authentication

Azure Synapse Analytics - SQL authentication

The mashup editor


When you create a Power Query activity, all source datasets become dataset queries and are placed in the
ADFResource folder. By default, the UserQuery will point to the first dataset query. All transformations should
be done on the UserQuery as changes to dataset queries are not supported nor will they be persisted.
Renaming, adding and deleting queries is currently not supported.

Currently not all Power Query M functions are supported for data wrangling despite being available during
authoring. While building your Power Query activities, you'll be prompted with the following error message if a
function isn't supported:
The wrangling data flow is invalid. Expression.Error: The transformation logic isn't supported. Please try a
simpler expression

For more information on supported transformations, see data wrangling functions.

Next steps
Learn how to create a data wrangling Power Query mash-up.
Transformation functions in Power Query for data
wrangling
6/18/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Data Wrangling in Azure Data Factory allows you to do code-free agile data preparation and wrangling at cloud
scale by translating Power Query M scripts into Data Flow script. ADF integrates with Power Query Online and
makes Power Query M functions available for data wrangling via Spark execution using the data flow Spark
infrastructure.

NOTE
Power Query in ADF is currently available in public preview

Currently not all Power Query M functions are supported for data wrangling despite being available during
authoring. While building your mash-ups, you'll be prompted with the following error message if a function isn't
supported:
UserQuery : Expression.Error: The transformation logic is not supported as it requires dynamic access to rows
of data, which cannot be scaled out.

Below is a list of supported Power Query M functions.

Column Management
Selection: Table.SelectColumns
Removal: Table.RemoveColumns
Renaming: Table.RenameColumns, Table.PrefixColumns, Table.TransformColumnNames
Reordering: Table.ReorderColumns

Row Filtering
Use M function Table.SelectRows to filter on the following conditions:
Equality and inequality
Numeric, text, and date comparisons (but not DateTime)
Numeric information such as Number.IsEven/Odd
Text containment using Text.Contains, Text.StartsWith, or Text.EndsWith
Date ranges including all the 'IsIn' Date functions)
Combinations of these using and, or, or not conditions

Adding and Transforming Columns


The following M functions add or transform columns: Table.AddColumn, Table.TransformColumns,
Table.ReplaceValue, Table.DuplicateColumn. Below are the supported transformation functions.
Numeric arithmetic
Text concatenation
Date and Time Arithmetic (Arithmetic operators, Date.AddDays, Date.AddMonths, Date.AddQuarters,
Date.AddWeeks, Date.AddYears)
Durations can be used for date and time arithmetic, but must be transformed into another type before
written to a sink (Arithmetic operators, #duration, Duration.Days, Duration.Hours, Duration.Minutes,
Duration.Seconds, Duration.TotalDays, Duration.TotalHours, Duration.TotalMinutes, Duration.TotalSeconds)
Most standard, scientific, and trigonometric numeric functions (All functions under Operations, Rounding,
and Trigonometry except Number.Factorial, Number.Permutations, and Number.Combinations)
Replacement (Replacer.ReplaceText, Replacer.ReplaceValue, Text.Replace, Text.Remove)
Positional text extraction (Text.PositionOf, Text.Length, Text.Start, Text.End, Text.Middle, Text.ReplaceRange,
Text.RemoveRange)
Basic text formatting (Text.Lower, Text.Upper, Text.Trim/Start/End, Text.PadStart/End, Text.Reverse)
Date/Time Functions (Date.Day, Date.Month, Date.Year Time.Hour, Time.Minute, Time.Second,
Date.DayOfWeek, Date.DayOfYear, Date.DaysInMonth)
If expressions (but branches must have matching types)
Row filters as a logical column
Number, text, logical, date, and datetime constants

Merging/Joining tables
Power Query will generate a nested join (Table.NestedJoin; users can also manually write
Table.AddJoinColumn). Users must then expand the nested join column into a non-nested join
(Table.ExpandTableColumn, not supported in any other context).
The M function Table.Join can be written directly to avoid the need for an additional expansion step, but the
user must ensure that there are no duplicate column names among the joined tables
Supported Join Kinds: Inner, LeftOuter, RightOuter, FullOuter
Both Value.Equals and Value.NullableEquals are supported as key equality comparers

Group by
Use Table.Group to aggregate values.
Must be used with an aggregation function
Supported aggregation functions: List.Sum, List.Count, List.Average, List.Min, List.Max, List.StandardDeviation,
List.First, List.Last

Sorting
Use Table.Sort to sort values.

Reducing Rows
Keep and Remove Top, Keep Range (corresponding M functions, only supporting counts, not conditions:
Table.FirstN, Table.Skip, Table.RemoveFirstN, Table.Range, Table.MinN, Table.MaxN)

Known unsupported functions


F UN C T IO N STAT US

Table.PromoteHeaders Not supported. The same result can be achieved by setting


"First row as header" in the dataset.
F UN C T IO N STAT US

Table.CombineColumns This is a common scenario that isn't directly supported but


can be achieved by adding a new column that concatenates
two given columns. For example,
Table.AddColumn(RemoveEmailColumn, "Name", each
[FirstName] & " " & [LastName])

Table.TransformColumnTypes This is supported in most cases. The following scenarios are


unsupported: transforming string to currency type,
transforming string to time type, transforming string to
Percentage type.

Table.NestedJoin Just doing a join will result in a validation error. The columns
must be expanded for it to work.

Table.Distinct Remove duplicate rows isn't supported.

Table.RemoveLastN Remove bottom rows isn't supported.

Table.RowCount Not supported, but can be achieved by adding a custom


column containing the value 1, then aggregating that
column with List.Sum. Table.Group is supported.

Row level error handling Row level error handling is currently not supported. For
example, to filter out non-numeric values from a column,
one approach would be to transform the text column to a
number. Every cell which fails to transform will be in an error
state and need to be filtered. This scenario isn't possible in
scaled-out M.

Table.Transpose Not supported

Table.Pivot Not supported

Table.SplitColumn Partially supported

M script workarounds
For SplitColumn there is an alternate for split by length and by position
Table.AddColumn(Source, "First characters", each Text.Start([Email], 7), type text)
Table.AddColumn(#"Inserted first characters", "Text range", each Text.Middle([Email], 4, 9), type text)
This option is accessible from the Extract option in the ribbon
For Table.CombineColumns

Table.AddColumn(RemoveEmailColumn, "Name", each [FirstName] & " " & [LastName])

Next steps
Learn how to create a data wrangling Power Query in ADF.
Roles and permissions for Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes the roles required to create and manage Azure Data Factory resources, and the
permissions granted by those roles.

Roles and requirements


To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor role, the owner role, or an administrator of the Azure subscription. To view the permissions that you
have in the subscription, in the Azure portal, select your username in the upper-right corner, and then select
Permissions . If you have access to multiple subscriptions, select the appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the Resource Group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.

Set up permissions
After you create a Data Factory, you may want to let other users work with the data factory. To give this access to
other users, you have to add them to the built-in Data Factor y Contributor role on the Resource Group that
contains the Data Factory.
Scope of the Data Factory Contributor role
Membership of the Data Factor y Contributor role lets users do the following things:
Create, edit, and delete data factories and child resources including datasets, linked services, pipelines,
triggers, and integration runtimes.
Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by
Data Factory in the Azure portal.
Manage App Insights alerts for a data factory.
Create support tickets.
For more info about this role, see Data Factory Contributor role.
Resource Manager template deployment
The Data Factor y Contributor role, at the resource group level or above, lets users deploy Resource Manager
templates. As a result, members of the role can use Resource Manager templates to deploy both data factories
and their child resources, including datasets, linked services, pipelines, triggers, and integration runtimes.
Membership in this role does not let the user create other resources.
Permissions on Azure Repos and GitHub are independent of Data Factory permissions. As a result, a user with
repo permissions who is only a member of the Reader role can edit Data Factory child resources and commit
changes to the repo, but can't publish these changes.
IMPORTANT
Resource Manager template deployment with the Data Factor y Contributor role does not elevate your permissions.
For example, if you deploy a template that creates an Azure virtual machine, and you don't have permission to create
virtual machines, the deployment fails with an authorization error.

In publish context, Microsoft.DataFactor y/factories/write permission applies to following modes.


That permission is only required in Live mode when the customer modifies the global parameters.
That permission is always required in Git mode since every time after the customer publishes,the factory
object with the last commit ID needs to be updated.
Custom scenarios and custom roles
Sometimes you may need to grant different access levels for different data factory users. For example:
You may need a group where users only have permissions on a specific data factory.
Or you may need a group where users can only monitor a data factory (or factories) but can't modify it.
You can achieve these custom scenarios by creating custom roles and assigning users to those roles. For more
info about custom roles, see Custom roles in Azure.
Here are a few examples that demonstrate what you can achieve with custom roles:
Let a user create, edit, or delete any data factory in a resource group from the Azure portal.
Assign the built-in Data Factor y contributor role at the resource group level for the user. If you want to
allow access to any data factory in a subscription, assign the role at the subscription level.
Let a user view (read) and monitor a data factory, but not edit or change it.
Assign the built-in reader role on the data factory resource for the user.
Let a user edit a single data factory in the Azure portal.
This scenario requires two role assignments.
1. Assign the built-in contributor role at the data factory level.
2. Create a custom role with the permission Microsoft.Resources/deployments/ . Assign this custom
role to the user at resource group level.
Let a user be able to test connection in a linked service or preview data in a dataset
Create a custom role with permissions for the following actions:
Microsoft.DataFactor y/factories/getFeatureValue/read and
Microsoft.DataFactor y/factories/getDataPlaneAccess/action . Assign this custom role on the data
factory resource for the user.
Let a user update a data factory from PowerShell or the SDK, but not in the Azure portal.
Assign the built-in contributor role on the data factory resource for the user. This role lets the user see
the resources in the Azure portal, but the user can't access the Publish and Publish All buttons.

Next steps
Learn more about roles in Azure - Understand role definitions
Learn more about the Data Factor y contributor role - Data Factory Contributor role.
Azure Data Factory - naming rules
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The following table provides naming rules for Data Factory artifacts.

NAME N A M E UN IQ UEN ESS VA L IDAT IO N C H EC K S

Data factory Unique across Microsoft Azure. Names Each data factory is tied to
are case-insensitive, that is, MyDF and exactly one Azure subscription.
mydf refer to the same data factory. Object names must start with a
letter or a number, and can
contain only letters, numbers,
and the dash (-) character.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number. Consecutive dashes
are not permitted in container
names.
Name can be 3-63 characters
long.

Linked Unique within a data factory. Names Object names must start with a
services/Datasets/Pipelines/Data Flows are case-insensitive. letter.
The following characters are
not allowed: “.”, “+”, “?”, “/”, “<”,
”>”,”*”,”%”,”&”,”:”,”\”
Dashes ("-") are not allowed in
the names of linked services,
data flows, and datasets.

Integration Runtime Unique within a data factory. Names Integration runtime Name can
are case-insensitive. contain only letters, numbers
and the dash (-) character.
The first and last characters
must be a letter or number.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number.
Consecutive dashes are not
permitted in integration
runtime name.

Data flow transformations Unique within a data flow. Names are Data flow transformation
case-insensitive names can only contain letters
and numbers
The first character must be a
letter.
NAME N A M E UN IQ UEN ESS VA L IDAT IO N C H EC K S

Resource Group Unique across Microsoft Azure. Names For more info, see Azure naming rules
are case-insensitive. and restrictions.

Pipeline parameters & variable Unique within the pipeline. Names are Validation check on parameter
case-insensitive. names and variable names is
limited to uniqueness because
of backward compatibility
reason.
When use parameters or
variables to reference entity
names, for example linked
service, the entity naming rules
apply.
A good practice is to follow
data flow transformation
naming rules to name your
pipeline parameters and
variables.

Next steps
Learn how to create data factories by following step-by-step instructions in Quickstart: create a data factory
article.
Azure Data Factory data redundancy
3/5/2021 • 2 minutes to read • Edit Online

Azure Data Factory data includes metadata (pipeline, datasets, linked services, integration runtime and triggers)
and monitoring data (pipeline, trigger, and activity runs).
In all regions (except Brazil South and Southeast Asia), Azure Data Factory data is stored and replicated in the
paired region to protect against metadata loss. During regional datacenter failures, Microsoft may initiate a
regional failover of your Azure Data Factory instance. In most cases, no action is required on your part. When
the Microsoft-managed failover has completed, you will be able to access your Azure Data Factory in the failover
region.
Due to data residency requirements in Brazil South, and Southeast Asia, Azure Data Factory data is stored on
local region only. For Southeast Asia, all the data are stored in Singapore. For Brazil South, all data are stored in
Brazil. When the region is lost due to a significant disaster, Microsoft will not be able to recover your Azure Data
Factory data.

NOTE
Microsoft-managed failover does not apply to self-hosted integration runtime (SHIR) since this infrastructure is typically
customer-managed. If the SHIR is set up on Azure VM, then the recommendation is to leverage Azure site recovery for
handling the Azure VM failover to another region.

Using source control in Azure Data Factory


To ensure that you are able to track and audit the changes made to your Azure data factory metadata, you
should consider setting up source control for your Azure Data Factory. It will also enable you to access your
metadata JSON files for pipelines, datasets, linked services, and trigger. Azure Data Factory enables you to work
with different Git repository (Azure DevOps and GitHub).
Learn how to set up source control in Azure Data Factory.

NOTE
In case of a disaster (loss of region), new data factory can be provisioned manually or in an automated fashion. Once the
new data factory has been created, you can restore your pipelines, datasets and linked services JSON from the existing Git
repository.

Data stores
Azure Data Factory enables you to move data among data stores located on-premises and in the cloud. To
ensure business continuity with your data stores, you should refer to the business continuity recommendations
for each of these data stores.

See also
Azure Regional Pairs
Data residency in Azure
Visual authoring in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Data Factory user interface experience (UX) lets you visually author and deploy resources for your
data factory without having to write any code. You can drag activities to a pipeline canvas, perform test runs,
debug iteratively, and deploy and monitor your pipeline runs.
Currently, the Azure Data Factory UX is only supported in Microsoft Edge and Google Chrome.

Authoring canvas
To open the authoring canvas , click on the pencil icon.

Here, you author the pipelines, activities, datasets, linked services, data flows, triggers, and integration runtimes
that comprise your factory. To get started building a pipeline using the authoring canvas, see Copy data using
the copy Activity.
The default visual authoring experience is directly working with the Data Factory service. Azure Repos Git or
GitHub integration is also supported to allow source control and collaboration for work on your data factory
pipelines. To learn more about the differences between these authoring experiences, see Source control in Azure
Data Factory.
Properties pane
For top-level resources such as pipelines, datasets, and data flows, high-level properties are editable in the
properties pane on the right-hand side of the canvas. The properties pane contains properties such as name,
description, annotations, and other high-level properties. Subresources such as pipeline activities and data flow
transformations are edited using the panel at the bottom of the canvas.
The properties pane only opens by default on resource creation. To edit it, click on the properties pane icon
located in the top-right corner of the canvas.
Related resources
In the properties pane, you can see what resources are dependent on the selected resource by selecting the
Related tab. Any resource that references the current resource will be listed here.
For example, in the above image, one pipeline and two data flows use the dataset currently selected.

Management hub
The management hub, accessed by the Manage tab in the Azure Data Factory UX, is a portal that hosts global
management actions for your data factory. Here, you can manage your connections to data stores and external
computes, source control configuration, and trigger settings. For more information, learn about the capabilities
of the management hub.

Expressions and functions


Expressions and functions can be used instead of static values to specify many properties in Azure Data Factory.
To specify an expression for a property value, select Add Dynamic Content or click Alt + P while focusing on
the field.

This opens the Data Factor y Expression Builder where you can build expressions from supported system
variables, activity output, functions, and user-specified variables or parameters.
For information about the expression language, see Expressions and functions in Azure Data Factory.

Provide feedback
Select Feedback to comment about features or to notify Microsoft about issues with the tool:
Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines programmatically.
Iterative development and debugging with Azure
Data Factory
4/22/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory lets you iteratively develop and debug Data Factory pipelines as you are developing your
data integration solutions. These features allow you to test your changes before creating a pull request or
publishing them to the data factory service.
For an eight-minute introduction and demonstration of this feature, watch the following video:

Debugging a pipeline
As you author using the pipeline canvas, you can test your activities using the Debug capability. When you do
test runs, you don't have to publish your changes to the data factory before you select Debug . This feature is
helpful in scenarios where you want to make sure that the changes work as expected before you update the data
factory workflow.

As the pipeline is running, you can see the results of each activity in the Output tab of the pipeline canvas.
View the results of your test runs in the Output window of the pipeline canvas.
After a test run succeeds, add more activities to your pipeline and continue debugging in an iterative manner.
You can also Cancel a test run while it is in progress.

IMPORTANT
Selecting Debug actually runs the pipeline. For example, if the pipeline contains copy activity, the test run copies data
from source to destination. As a result, we recommend that you use test folders in your copy activities and other activities
when debugging. After you've debugged the pipeline, switch to the actual folders that you want to use in normal
operations.

Setting breakpoints
Azure Data Factory allows for you to debug a pipeline until you reach a particular activity on the pipeline canvas.
Put a breakpoint on the activity until which you want to test, and select Debug . Data Factory ensures that the
test runs only until the breakpoint activity on the pipeline canvas. This Debug Until feature is useful when you
don't want to test the entire pipeline, but only a subset of activities inside the pipeline.
To set a breakpoint, select an element on the pipeline canvas. A Debug Until option appears as an empty red
circle at the upper right corner of the element.

After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled.

Monitoring debug runs


When you run a pipeline debug run, the results will appear in the Output window of the pipeline canvas. The
output tab will only contain the most recent run that occurred during the current browser session.
To view a historical view of debug runs or see a list of all active debug runs, you can go into the Monitor
experience.

NOTE
The Azure Data Factory service only persists debug run history for 15 days.

Debugging mapping data flows


Mapping data flows allow you to build code-free data transformation logic that runs at scale. When building
your logic, you can turn on a debug session to interactively work with your data using a live Spark cluster. To
learn more, read about mapping data flow debug mode.
You can monitor active data flow debug sessions across a factory in the Monitor experience.

Data preview in the data flow designer and pipeline debugging of data flows are intended to work best with
small samples of data. However, if you need to test your logic in a pipeline or data flow against large amounts of
data, increase the size of the Azure Integration Runtime being used in the debug session with more cores and a
minimum of general purpose compute.
Debugging a pipeline with a data flow activity
When executing a debug pipeline run with a data flow, you have two options on which compute to use. You can
either use an existing debug cluster or spin up a new just-in-time cluster for your data flows.
Using an existing debug session will greatly reduce the data flow start up time as the cluster is already running,
but is not recommended for complex or parallel workloads as it may fail when multiple jobs are run at once.
Using the activity runtime will create a new cluster using the settings specified in each data flow activity's
integration runtime. This allows each job to be isolated and should be used for complex workloads or
performance testing. You can also control the TTL in the Azure IR so that the cluster resources used for
debugging will still be available for that time period to serve additional job requests.

NOTE
If you have a pipeline with data flows executing in parallel or data flows that need to be tested with large datasets, choose
"Use Activity Runtime" so that Data Factory can use the Integration Runtime that you've selected in your data flow
activity. This will allow the data flows to execute on multiple clusters and can accommodate your parallel data flow
executions.

Next steps
After testing your changes, promote them to higher environments using continuous integration and
deployment in Azure Data Factory.
Management hub in Azure Data Factory
4/28/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The management hub, accessed by the Manage tab in the Azure Data Factory UX, is a portal that hosts global
management actions for your data factory. Here, you can manage your connections to data stores and external
computes, source control configuration, and trigger settings.

Manage connections
Linked services
Linked services define the connection information for Azure Data Factory to connect to external data stores and
compute environments. For more information, see linked services concepts. Linked service creation, editing, and
deletion is done in the management hub.

Integration runtimes
An integration runtime is a compute infrastructure used by Azure Data Factory to provide data integration
capabilities across different network environments. For more information, learn about integration runtime
concepts. In the management hub, you can create, delete, and monitor your integration runtimes.

Manage source control


Git configuration
You can view/ edit all the Git-related information under the Git configuration settings in the management hub.
Last published commit information is listed as well and can help to understand the precise commit, which was
last published/ deployed across environments. It can also be helpful when doing Hot Fixes in production.
For more information, learn about source control in Azure Data Factory.
Parameterization template
To override the generated Resource Manager template parameters when publishing from the collaboration
branch, you can generate or edit a custom parameters file. For more information, learn how to use custom
parameters in the Resource Manager template. The parameterization template is only available when working in
a git repository. If the arm-template-parameters-definition.json file doesn't exist in the working branch, editing
the default template will generate it.

Manage authoring
Triggers
Triggers determine when a pipeline run should be kicked off. Currently triggers can be on a wall clock schedule,
operate on a periodic interval, or depend on an event. For more information, learn about trigger execution. In
the management hub, you can create, edit, delete, or view the current state of a trigger.
Global parameters
Global parameters are constants across a data factory that can be consumed by a pipeline in any expression. For
more information, learn about global parameters.

Next steps
Learn how to configure a git repository to your ADF
Source control in Azure Data Factory
7/2/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


By default, the Azure Data Factory user interface experience (UX) authors directly against the data factory
service. This experience has the following limitations:
The Data Factory service doesn't include a repository for storing the JSON entities for your changes. The only
way to save changes is via the Publish All button and all changes are published directly to the data factory
service.
The Data Factory service isn't optimized for collaboration and version control.
The Azure Resource Manager template required to deploy Data Factory itself is not included.
To provide a better authoring experience, Azure Data Factory allows you to configure a Git repository with either
Azure Repos or GitHub. Git is a version control system that allows for easier change tracking and collaboration.
This article will outline how to configure and work in a git repository along with highlighting best practices and
a troubleshooting guide.

NOTE
For Azure Government Cloud, only GitHub Enterprise Server is available.

To learn more about how Azure Data Factory integrates with Git, view the 15-minute tutorial video below:

Advantages of Git integration


Below is a list of some of the advantages git integration provides to the authoring experience:
Source control: As your data factory workloads become crucial, you would want to integrate your factory
with Git to leverage several source control benefits like the following:
Ability to track/audit changes.
Ability to revert changes that introduced bugs.
Par tial saves: When authoring against the data factory service, you can't save changes as a draft and all
publishes must pass data factory validation. Whether your pipelines are not finished or you simply don't
want to lose changes if your computer crashes, git integration allows for incremental changes of data factory
resources regardless of what state they are in. Configuring a git repository allows you to save changes,
letting you only publish when you have tested your changes to your satisfaction.
Collaboration and control: If you have multiple team members contributing to the same factory, you may
want to let your teammates collaborate with each other via a code review process. You can also set up your
factory such that not every contributor has equal permissions. Some team members may only be allowed to
make changes via Git and only certain people in the team are allowed to publish the changes to the factory.
Better CI/CD: If you are deploying to multiple environments with a continuous delivery process, git
integration makes certain actions easier. Some of these actions include:
Configure your release pipeline to trigger automatically as soon as there are any changes made to
your 'dev' factory.
Customize the properties in your factory that are available as parameters in the Resource Manager
template. It can be useful to keep only the required set of properties as parameters, and have
everything else hard coded.
Better Performance: An average factory with git integration loads 10 times faster than one authoring
against the data factory service. This performance improvement is because resources are downloaded via
Git.

NOTE
Authoring directly with the Data Factory service is disabled in the Azure Data Factory UX when a Git repository is
configured. Changes made via PowerShell or an SDK are published directly to the Data Factory service, and are not
entered into Git.

Connect to a Git repository


There are four different ways to connect a Git repository to your data factory for both Azure Repos and GitHub.
After you connect to a Git repository, you can view and manage your configuration in the management hub
under Git configuration in the Source control section
Configuration method 1: Home page
In the Azure Data Factory home page, select Set up code repositor y at the top.

Configuration method 2: Authoring canvas


In the Azure Data Factory UX authoring canvas, select the Data Factor y drop-down menu, and then select Set
up code repositor y .

Configuration method 3: Management hub


Go to the management hub in the ADF UX. Select Git configuration in the Source control section. If you
have no repository connected, click Configure .
Configuration method 4: During factory creation
When creating a new data factory in the Azure portal, you can configure Git repository information in the Git
configuration tab.

NOTE
When configuring git in the Azure Portal, settings like project name and repo name have to be manually entered instead
being part of a dropdown.

Author with Azure Repos Git integration


Visual authoring with Azure Repos Git integration supports source control and collaboration for work on your
data factory pipelines. You can associate a data factory with an Azure Repos Git organization repository for
source control, collaboration, versioning, and so on. A single Azure Repos Git organization can have multiple
repositories, but an Azure Repos Git repository can be associated with only one data factory. If you don't have an
Azure Repos organization or repository, follow these instructions to create your resources.
NOTE
You can store script and data files in an Azure Repos Git repository. However, you have to upload the files manually to
Azure Storage. A data factory pipeline doesn't automatically upload script or data files stored in an Azure Repos Git
repository to Azure Storage.

Azure Repos settings

The configuration pane shows the following Azure Repos code repository settings:

SET T IN G DESC RIP T IO N VA L UE

Repositor y Type The type of the Azure Repos code Azure DevOps Git or GitHub
repository.

Azure Active Director y Your Azure AD tenant name. <your tenant name>
SET T IN G DESC RIP T IO N VA L UE

Azure Repos Organization Your Azure Repos organization name. <your organization name>
You can locate your Azure Repos
organization name at
https://{organization
name}.visualstudio.com
. You can sign in to your Azure Repos
organization to access your Visual
Studio profile and see your repositories
and projects.

ProjectName Your Azure Repos project name. You <your Azure Repos project name>
can locate your Azure Repos project
name at
https://{organization
name}.visualstudio.com/{project
name}
.

Repositor yName Your Azure Repos code repository <your Azure Repos code
name. Azure Repos projects contain Git repository name>
repositories to manage your source
code as your project grows. You can
create a new repository or use an
existing repository that's already in
your project.

Collaboration branch Your Azure Repos collaboration branch <your collaboration branch name>
that is used for publishing. By default,
it's main . Change this setting in case
you want to publish resources from
another branch.

Root folder Your root folder in your Azure Repos <your root folder name>
collaboration branch.

Impor t existing Data Factor y Specifies whether to import existing Selected (default)
resources to repositor y data factory resources from the UX
Authoring canvas into an Azure
Repos Git repository. Select the box to
import your data factory resources
into the associated Git repository in
JSON format. This action exports each
resource individually (that is, the linked
services and datasets are exported
into separate JSONs). When this box
isn't selected, the existing resources
aren't imported.

Branch to impor t resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing
NOTE
If you are using Microsoft Edge and do not see any values in your Azure DevOps Account dropdown, add
https://*.visualstudio.com to the trusted sites list.

Use a different Azure Active Directory tenant


The Azure Repos Git repo can be in a different Azure Active Directory tenant. To specify a different Azure AD
tenant, you have to have administrator permissions for the Azure subscription that you're using. For more info,
see change subscription administrator

IMPORTANT
To connect to another Azure Active Directory, the user logged in must be a part of that active directory.

Use your personal Microsoft account


To use a personal Microsoft account for Git integration, you can link your personal Azure Repo to your
organization's Active Directory.
1. Add your personal Microsoft account to your organization's Active Directory as a guest. For more info,
see Add Azure Active Directory B2B collaboration users in the Azure portal.
2. Log in to the Azure portal with your personal Microsoft account. Then switch to your organization's Active
Directory.
3. Go to the Azure DevOps section, where you now see your personal repo. Select the repo and connect
with Active Directory.
After these configuration steps, your personal repo is available when you set up Git integration in the Data
Factory UI.
For more info about connecting Azure Repos to your organization's Active Directory, see Connect your Azure
DevOps organization to Azure Active Directory.

Author with GitHub integration


Visual authoring with GitHub integration supports source control and collaboration for work on your data
factory pipelines. You can associate a data factory with a GitHub account repository for source control,
collaboration, versioning. A single GitHub account can have multiple repositories, but a GitHub repository can
be associated with only one data factory. If you don't have a GitHub account or repository, follow these
instructions to create your resources.
The GitHub integration with Data Factory supports both public GitHub (that is, https://github.com) and GitHub
Enterprise. You can use both public and private GitHub repositories with Data Factory as long you have read and
write permission to the repository in GitHub.
To configure a GitHub repo, you must have administrator permissions for the Azure subscription that you're
using.
GitHub settings
The configuration pane shows the following GitHub repository settings:

SET T IN G DESC RIP T IO N VA L UE

Repositor y Type The type of the Azure Repos code GitHub


repository.

Use GitHub Enterprise Checkbox to select GitHub Enterprise unselected (default)


SET T IN G DESC RIP T IO N VA L UE

GitHub Enterprise URL The GitHub Enterprise root URL (must <your GitHub enterprise url>
be HTTPS for local GitHub Enterprise
server). For example:
https://github.mydomain.com .
Required only if Use GitHub
Enterprise is selected

GitHub account Your GitHub account name. This name <your GitHub account name>
can be found from
https://github.com/{account
name}/{repository name}. Navigating
to this page prompts you to enter
GitHub OAuth credentials to your
GitHub account.

Repositor y Name Your GitHub code repository name. <your repository name>
GitHub accounts contain Git
repositories to manage your source
code. You can create a new repository
or use an existing repository that's
already in your account.

Collaboration branch Your GitHub collaboration branch that <your collaboration branch>
is used for publishing. By default,
it's main. Change this setting in case
you want to publish resources from
another branch.

Root folder Your root folder in your GitHub <your root folder name>
collaboration branch.

Impor t existing Data Factor y Specifies whether to import existing Selected (default)
resources to repositor y data factory resources from the
UX authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported
into separate JSONs). When this box
isn't selected, the existing resources
aren't imported.

Branch to impor t resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing

GitHub organizations
Connecting to a GitHub organization requires the organization to grant permission to Azure Data Factory. A user
with ADMIN permissions on the organization must perform the below steps to allow data factory to connect.
Connecting to GitHub for the first time in Azure Data Factory
If you're connecting to GitHub from Azure Data Factory for the first time, follow these steps to connect to a
GitHub organization.
1. In the Git configuration pane, enter the organization name in the GitHub Account field. A prompt to login into
GitHub will appear.
2. Login using your user credentials.
3. You'll be asked to authorize Azure Data Factory as an application called AzureDataFactory. On this screen, you
will see an option to grant permission for ADF to access the organization. If you don't see the option to grant
permission, ask an admin to manually grant the permission through GitHub.
Once you follow these steps, your factory will be able to connect to both public and private repositories within
your organization. If you are unable to connect, try clearing the browser cache and retrying.
Already connected to GitHub using a personal account
If you have already connected to GitHub and only granted permission to access a personal account, follow the
below steps to grant permissions to an organization.
1. Go to GitHub and open Settings .

2. Select Applications . In the Authorized OAuth apps tab, you should see AzureDataFactory.
3. Select the application and grant the application access to your organization.

Once you follow these steps, your factory will be able to connect to both public and private repositories within
your organization.
Known GitHub limitations
You can store script and data files in a GitHub repository. However, you have to upload the files manually
to Azure Storage. A Data Factory pipeline does not automatically upload script or data files stored in a
GitHub repository to Azure Storage.
GitHub Enterprise with a version older than 2.14.0 doesn't work in the Microsoft Edge browser.
GitHub integration with the Data Factory visual authoring tools only works in the generally available
version of Data Factory.
A maximum of 1,000 entities per resource type (such as pipelines and datasets) can be fetched from a
single GitHub branch. If this limit is reached, is suggested to split your resources into separate factories.
Azure DevOps Git does not have this limitation.

Version control
Version control systems (also known as source control) let developers collaborate on code and track changes
that are made to the code base. Source control is an essential tool for multi-developer projects.
Creating feature branches
Each Azure Repos Git repository that's associated with a data factory has a collaboration branch. ( main ) is the
default collaboration branch). Users can also create feature branches by clicking + New Branch in the branch
dropdown. Once the new branch pane appears, enter the name of your feature branch.

When you are ready to merge the changes from your feature branch to your collaboration branch, click on the
branch dropdown and select Create pull request . This action takes you to Azure Repos Git where you can
raise pull requests, do code reviews, and merge changes to your collaboration branch. ( main is the default). You
are only allowed to publish to the Data Factory service from your collaboration branch.

Configure publishing settings


By default, data factory generates the Resource Manager templates of the published factory and saves them into
a branch called adf_publish . To configure a custom publish branch, add a publish_config.json file to the root
folder in the collaboration branch. When publishing, ADF reads this file, looks for the field publishBranch , and
saves all Resource Manager templates to the specified location. If the branch doesn't exist, data factory will
automatically create it. And example of what this file looks like is below:

{
"publishBranch": "factory/adf_publish"
}

Azure Data Factory can only have one publish branch at a time. When you specify a new publish branch, Data
Factory doesn't delete the previous publish branch. If you want to remove the previous publish branch, delete it
manually.

NOTE
Data Factory only reads the publish_config.json file when it loads the factory. If you already have the factory loaded
in the portal, refresh the browser to make your changes take effect.

Publish code changes


After you have merged changes to the collaboration branch ( main is the default), click Publish to manually
publish your code changes in the main branch to the Data Factory service.

A side pane will open where you confirm that the publish branch and pending changes are correct. Once you
verify your changes, click OK to confirm the publish.
IMPORTANT
The main branch is not representative of what's deployed in the Data Factory service. The main branch must be published
manually to the Data Factory service.

Best practices for Git integration


Permissions
Typically you don't want every team member to have permissions to update the Data Factory. The following
permissions settings are recommended:
All team members should have read permissions to the Data Factory.
Only a select set of people should be allowed to publish to the Data Factory. To do so, they must have the
Data Factor y contributor role on the Resource Group that contains the Data Factory. For more
information on permissions, see Roles and permissions for Azure Data Factory.
It's recommended to not allow direct check-ins to the collaboration branch. This restriction can help prevent
bugs as every check-in will go through a pull request review process described in Creating feature branches.
Using passwords from Azure Key Vault
It's recommended to use Azure Key Vault to store any connection strings or passwords or managed identity
authentication for Data Factory Linked Services. For security reasons, data factory doesn't store secrets in Git.
Any changes to Linked Services containing secrets such as passwords are published immediately to the Azure
Data Factory service.
Using Key Vault or MSI authentication also makes continuous integration and deployment easier as you won't
have to provide these secrets during Resource Manager template deployment.

Troubleshooting Git integration


Stale publish branch
If the publish branch is out of sync with the main branch and contains out-of-date resources despite a recent
publish, try following these steps:
1. Remove your current Git repository
2. Reconfigure Git with the same settings, but make sure Impor t existing Data Factor y resources to
repositor y is selected and choose New branch
3. Create a pull request to merge the changes to the collaboration branch
Below are some examples of situations that can cause a stale publish branch:
A user has multiple branches. In one feature branch, they deleted a linked service that isn't AKV associated
(non-AKV linked services are published immediately regardless if they are in Git or not) and never merged
the feature branch into the collaboration branch.
A user modified the data factory using the SDK or PowerShell
A user moved all resources to a new branch and tried to publish for the first time. Linked services should be
created manually when importing resources.
A user uploads a non-AKV linked service or an Integration Runtime JSON manually. They reference that
resource from another resource such as a dataset, linked service, or pipeline. A non-AKV linked service
created through the UX is published immediately because the credentials need to be encrypted. If you upload
a dataset referencing that linked service and try to publish, the UX will allow it because it exists in the git
environment. It will be rejected at publish time since it does not exist in the data factory service.
Switch to a different Git repository
To switch to a different Git repository, go to Git configuration page in the management hub under Source
control . Select Disconnect .

Enter your data factory name and click confirm to remove the Git repository associated with your data factory.

After you remove the association with the current repo, you can configure your Git settings to use a different
repo and then import existing Data Factory resources to the new repo.

IMPORTANT
Removing Git configuration from a data factory doesn't delete anything from the repository. The factory will contain all
published resources. You can continue to edit the factory directly against the service.

Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines
programmatically.
To implement continuous integration and deployment, see Continuous integration and delivery (CI/CD) in
Azure Data Factory.
Continuous integration and delivery in Azure Data
Factory
6/10/2021 • 29 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Overview
Continuous integration is the practice of testing each change made to your codebase automatically and as early
as possible. Continuous delivery follows the testing that happens during continuous integration and pushes
changes to a staging or production system.
In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from
one environment (development, test, production) to another. Azure Data Factory utilizes Azure Resource
Manager templates to store the configuration of your various ADF entities (pipelines, datasets, data flows, and
so on). There are two suggested methods to promote a data factory to another environment:
Automated deployment using Data Factory's integration with Azure Pipelines
Manually upload a Resource Manager template using Data Factory UX integration with Azure Resource
Manager.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

CI/CD lifecycle
Below is a sample overview of the CI/CD lifecycle in an Azure data factory that's configured with Azure Repos
Git. For more information on how to configure a Git repository, see Source control in Azure Data Factory.
1. A development data factory is created and configured with Azure Repos Git. All developers should have
permission to author Data Factory resources like pipelines and datasets.
2. A developer creates a feature branch to make a change. They debug their pipeline runs with their most
recent changes. For more information on how to debug a pipeline run, see Iterative development and
debugging with Azure Data Factory.
3. After a developer is satisfied with their changes, they create a pull request from their feature branch to
the main or collaboration branch to get their changes reviewed by peers.
4. After a pull request is approved and changes are merged in the main branch, the changes get published
to the development factory.
5. When the team is ready to deploy the changes to a test or UAT (User Acceptance Testing) factory, the
team goes to their Azure Pipelines release and deploys the desired version of the development factory to
UAT. This deployment takes place as part of an Azure Pipelines task and uses Resource Manager template
parameters to apply the appropriate configuration.
6. After the changes have been verified in the test factory, deploy to the production factory by using the
next task of the pipelines release.

NOTE
Only the development factory is associated with a git repository. The test and production factories shouldn't have a git
repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource
Management template.

The below image highlights the different steps of this lifecycle.

Automate continuous integration by using Azure Pipelines releases


The following is a guide for setting up an Azure Pipelines release that automates the deployment of a data
factory to multiple environments.
Requirements
An Azure subscription linked to Visual Studio Team Foundation Server or Azure Repos that uses the Azure
Resource Manager service endpoint.
A data factory configured with Azure Repos Git integration.
An Azure key vault that contains the secrets for each environment.
Set up an Azure Pipelines release
1. In Azure DevOps, open the project that's configured with your data factory.
2. On the left side of the page, select Pipelines , and then select Releases .
3. Select New pipeline , or, if you have existing pipelines, select New and then New release pipeline .
4. Select the Empty job template.
5. In the Stage name box, enter the name of your environment.
6. Select Add ar tifact , and then select the git repository configured with your development data factory.
Select the publish branch of the repository for the Default branch . By default, this publish branch is
adf_publish . For the Default version , select Latest from default branch .

7. Add an Azure Resource Manager Deployment task:


a. In the stage view, select View stage tasks .
b. Create a new task. Search for ARM Template Deployment , and then select Add .
c. In the Deployment task, select the subscription, resource group, and location for the target data factory.
Provide credentials if necessary.
d. In the Action list, select Create or update resource group .
e. Select the ellipsis button (… ) next to the Template box. Browse for the Azure Resource Manager
template that is generated in your publish branch of the configured git repository. Look for the file
ARMTemplateForFactory.json in the folder of the adf_publish branch.

f. Select … next to the Template parameters box to choose the parameters file. Look for the file
ARMTemplateParametersForFactory.json in the folder of the adf_publish branch.

g. Select … next to the Override template parameters box, and enter the desired parameter values for
the target data factory. For credentials that come from Azure Key Vault, enter the secret's name between
double quotation marks. For example, if the secret's name is cred1, enter "$(cred1)" for this value.
h. Select Incremental for the Deployment mode .

WARNING
In Complete deployment mode, resources that exist in the resource group but aren't specified in the new Resource
Manager template will be deleted . For more information, please refer to Azure Resource Manager Deployment
Modes
8. Save the release pipeline.
9. To trigger a release, select Create release . To automate the creation of releases, see Azure DevOps
release triggers

IMPORTANT
In CI/CD scenarios, the integration runtime (IR) type in different environments must be the same. For example, if you have
a self-hosted IR in the development environment, the same IR must also be of type self-hosted in other environments,
such as test and production. Similarly, if you're sharing integration runtimes across multiple stages, you have to configure
the integration runtimes as linked self-hosted in all environments, such as development, test, and production.

Get secrets from Azure Key Vault


If you have secrets to pass in an Azure Resource Manager template, we recommend that you use Azure Key
Vault with the Azure Pipelines release.
There are two ways to handle secrets:
1. Add the secrets to parameters file. For more info, see Use Azure Key Vault to pass secure parameter value
during deployment.
Create a copy of the parameters file that's uploaded to the publish branch. Set the values of the
parameters that you want to get from Key Vault by using this format:
{
"parameters": {
"azureSqlReportingDbPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/<subId>/resourceGroups/<resourcegroupId>
/providers/Microsoft.KeyVault/vaults/<vault-name> "
},
"secretName": " < secret - name > "
}
}
}
}

When you use this method, the secret is pulled from the key vault automatically.
The parameters file needs to be in the publish branch as well.
2. Add an Azure Key Vault task before the Azure Resource Manager Deployment task described in the
previous section:
a. On the Tasks tab, create a new task. Search for Azure Key Vault and add it.
b. In the Key Vault task, select the subscription in which you created the key vault. Provide credentials
if necessary, and then select the key vault.

Grant permissions to the Azure Pipelines agent


The Azure Key Vault task might fail with an Access Denied error if the correct permissions aren't set. Download
the logs for the release, and locate the .ps1 file that contains the command to give permissions to the Azure
Pipelines agent. You can run the command directly. Or you can copy the principal ID from the file and add the
access policy manually in the Azure portal. Get and List are the minimum permissions required.
Updating active triggers
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
WARNING
If you do not use latest versions of PowerShell and Data Factory module, you may run into deserialization errors while
running the commands.

Deployment can fail if you try to update active triggers. To update active triggers, you need to manually stop
them and then restart them after the deployment. You can do this by using an Azure PowerShell task:
1. On the Tasks tab of the release, add an Azure PowerShell task. Choose task version the latest Azure
PowerShell version.
2. Select the subscription your factory is in.
3. Select Script File Path as the script type. This requires you to save your PowerShell script in your
repository. The following PowerShell script can be used to stop triggers:

$triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName $DataFactoryName -ResourceGroupName


$ResourceGroupName

$triggersADF | ForEach-Object { Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -


DataFactoryName $DataFactoryName -Name $_.name -Force }

You can complete similar steps (with the Start-AzDataFactoryV2Trigger function) to restart the triggers after
deployment.
The data factory team has provided a sample pre- and post-deployment script located at the bottom of this
article.

Manually promote a Resource Manager template for each


environment
1. Go to Manage hub in your data factory, and select ARM template in the "Source control" section. Under
ARM template section, select Expor t ARM template to export the Resource Manager template for
your data factory in the development environment.

2. In your test and production data factories, select Impor t ARM Template . This action takes you to the
Azure portal, where you can import the exported template. Select Build your own template in the
editor to open the Resource Manager template editor.
3. Select Load file , and then select the generated Resource Manager template. This is the
arm_template.json file located in the .zip file exported in step 1.

4. In the settings section, enter the configuration values, like linked service credentials. When you're done,
select Purchase to deploy the Resource Manager template.
Use custom parameters with the Resource Manager template
If your development factory has an associated git repository, you can override the default Resource Manager
template parameters of the Resource Manager template generated by publishing or exporting the template. You
might want to override the default Resource Manager parameter configuration in these scenarios:
You use automated CI/CD and you want to change some properties during Resource Manager
deployment, but the properties aren't parameterized by default.
Your factory is so large that the default Resource Manager template is invalid because it has more than
the maximum allowed parameters (256).
To handle custom parameter 256 limit, there are three options:
Use the custom parameter file and remove properties that don't need parameterization, i.e., properties
that can keep a default value and hence decrease the parameter count.
Refactor logic in the dataflow to reduce parameters, for example, pipeline parameters all have the
same value, you can just use global parameters instead.
Split one data factory into multiple data flows.
To override the default Resource Manager parameter configuration, go to the Manage hub and select ARM
template in the "Source control" section. Under ARM parameter configuration section, click Edit icon in
"Edit parameter configuration" to open the Resource Manager parameter configuration code editor.

NOTE
ARM parameter configuration is only enabled in "GIT mode". Currently it is disabled in "live mode" or "Data Factory"
mode.

Creating a custom Resource Manager parameter configuration creates a file named arm-template-
parameters-definition.json in the root folder of your git branch. You must use that exact file name.

When publishing from the collaboration branch, Data Factory will read this file and use its configuration to
generate which properties get parameterized. If no file is found, the default template is used.
When exporting a Resource Manager template, Data Factory reads this file from whichever branch you're
currently working on, not the collaboration branch. You can create or edit the file from a private branch, where
you can test your changes by selecting Expor t ARM Template in the UI. You can then merge the file into the
collaboration branch.
NOTE
A custom Resource Manager parameter configuration doesn't change the ARM template parameter limit of 256. It lets
you choose and decrease the number of parameterized properties.

Custom parameter syntax


The following are some guidelines to follow when you create the custom parameters file, arm-template-
parameters-definition.json . The file consists of a section for each entity type: trigger, pipeline, linked service,
dataset, integration runtime, and data flow.
Enter the property path under the relevant entity type.
Setting a property name to * indicates that you want to parameterize all properties under it (only down to
the first level, not recursively). You can also provide exceptions to this configuration.
Setting the value of a property as a string indicates that you want to parameterize the property. Use the
format <action>:<name>:<stype> .
<action> can be one of these characters:
= means keep the current value as the default value for the parameter.
- means don't keep the default value for the parameter.
| is a special case for secrets from Azure Key Vault for connection strings or keys.
<name> is the name of the parameter. If it's blank, it takes the name of the property. If the value starts
with a - character, the name is shortened. For example,
AzureStorage1_properties_typeProperties_connectionString would be shortened to
AzureStorage1_connectionString .
<stype> is the type of parameter. If <stype> is blank, the default type is string . Supported values:
string , securestring , int , bool , object , secureobject and array .
Specifying an array in the definition file indicates that the matching property in the template is an array. Data
Factory iterates through all the objects in the array by using the definition that's specified in the integration
runtime object of the array. The second object, a string, becomes the name of the property, which is used as
the name for the parameter for each iteration.
A definition can't be specific to a resource instance. Any definition applies to all resources of that type.
By default, all secure strings, like Key Vault secrets, and secure strings, like connection strings, keys, and
tokens, are parameterized.
Sample parameterization template
Here's an example of what an Resource Manager parameter configuration might look like:
{
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
"activities": [{
"typeProperties": {
"waitTimeInSeconds": "-::int",
"headers": "=::object"
}
}]
}
},
"Microsoft.DataFactory/factories/integrationRuntimes": {
"properties": {
"typeProperties": {
"*": "="
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"typeProperties": {
"recurrence": {
"*": "=",
"interval": "=:triggerSuffix:int",
"frequency": "=:-freq"
},
"maxConcurrency": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"connectionString": "|:-connectionString:secureString",
"secretAccessKey": "|"
}
}
},
"AzureDataLakeStore": {
"properties": {
"typeProperties": {
"dataLakeStoreUri": "="
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"properties": {
"typeProperties": {
"*": "="
}
}
}
}

Here's an explanation of how the preceding template is constructed, broken down by resource type.
Pipelines
Any property in the path activities/typeProperties/waitTimeInSeconds is parameterized. Any activity in a
pipeline that has a code-level property named waitTimeInSeconds (for example, the Wait activity) is
parameterized as a number, with a default name. But it won't have a default value in the Resource Manager
template. It will be a mandatory input during the Resource Manager deployment.
Similarly, a property called headers (for example, in a Web activity) is parameterized with type object
(JObject). It has a default value, which is the same value as that of the source factory.
IntegrationRuntimes
All properties under the path typeProperties are parameterized with their respective default values. For
example, there are two properties under IntegrationRuntimes type properties: computeProperties and
ssisProperties . Both property types are created with their respective default values and types (Object).

Triggers
Under typeProperties , two properties are parameterized. The first one is maxConcurrency , which is specified
to have a default value and is of type string . It has the default parameter name
<entityName>_properties_typeProperties_maxConcurrency .
The recurrence property also is parameterized. Under it, all properties at that level are specified to be
parameterized as strings, with default values and parameter names. An exception is the interval property,
which is parameterized as type int . The parameter name is suffixed with
<entityName>_properties_typeProperties_recurrence_triggerSuffix . Similarly, the freq property is a string
and is parameterized as a string. However, the freq property is parameterized without a default value. The
name is shortened and suffixed. For example, <entityName>_freq .
LinkedServices
Linked services are unique. Because linked services and datasets have a wide range of types, you can provide
type-specific customization. In this example, for all linked services of type AzureDataLakeStore , a specific
template will be applied. For all others (via * ), a different template will be applied.
The connectionString property will be parameterized as a securestring value. It won't have a default value.
It will have a shortened parameter name that's suffixed with connectionString .
The property secretAccessKey happens to be an AzureKeyVaultSecret (for example, in an Amazon S3 linked
service). It's automatically parameterized as an Azure Key Vault secret and fetched from the configured key
vault. You can also parameterize the key vault itself.
Datasets
Although type-specific customization is available for datasets, you can provide configuration without
explicitly having a *-level configuration. In the preceding example, all dataset properties under
typeProperties are parameterized.

NOTE
Azure aler ts and matrices if configured for a pipeline are not currently supported as parameters for ARM
deployments. To reapply the alerts and matrices in new environment, please follow Data Factory Monitoring, Alerts and
Matrices.

Default parameterization template


Below is the current default parameterization template. If you need to add only a few parameters, editing this
template directly might be a good idea because you won't lose the existing parameterization structure.

{
"Microsoft.DataFactory/factories": {
"properties": {
"globalParameters": {
"*": {
"value": "="
}
}
},
"location": "="
},
"Microsoft.DataFactory/factories/pipelines": {
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/dataflows": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
},
"computeProperties": {
"dataFlowProperties": {
"externalComputeInfo": [{
"accessToken": "-::secureString"
}
]
}
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"host": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"poolName": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"functionAppUrl":"=",
"environmentUrl": "=",
"aadResourceId": "=",
"sasUri": "|:-sasUri:secureString",
"sasToken": "|",
"connectionString": "|:-connectionString:secureString",
"hostKeyFingerprint": "="
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}
},
"Microsoft.DataFactory/factories/managedVirtualNetworks/managedPrivateEndpoints": {
"properties": {
"*": "="
}
}
}

Example: parameterizing an existing Azure Databricks interactive cluster ID


The following example shows how to add a single value to the default parameterization template. We only want
to add an existing Azure Databricks interactive cluster ID for a Databricks linked service to the parameters file.
Note that this file is the same as the previous file except for the addition of existingClusterId under the
properties field of Microsoft.DataFactory/factories/linkedServices .

{
"Microsoft.DataFactory/factories": {
"properties": {
"globalParameters": {
"*": {
"*": {
"value": "="
}
}
},
"location": "="
},
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/dataflows": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}

}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"poolName": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString",
"existingClusterId": "-"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}

Linked Resource Manager templates


If you've set up CI/CD for your data factories, you might exceed the Azure Resource Manager template limits as
your factory grows bigger. For example, one limit is the maximum number of resources in a Resource Manager
template. To accommodate large factories while generating the full Resource Manager template for a factory,
Data Factory now generates linked Resource Manager templates. With this feature, the entire factory payload is
broken down into several files so that you aren't constrained by the limits.
If you've configured Git, the linked templates are generated and saved alongside the full Resource Manager
templates in the adf_publish branch in a new folder called linkedTemplates:
The linked Resource Manager templates usually consist of a master template and a set of child templates that
are linked to the master. The parent template is called ArmTemplate_master.json, and child templates are named
with the pattern ArmTemplate_0.json, ArmTemplate_1.json, and so on.
To use linked templates instead of the full Resource Manager template, update your CI/CD task to point to
ArmTemplate_master.json instead of ArmTemplateForFactory.json (the full Resource Manager template).
Resource Manager also requires that you upload the linked templates into a storage account so Azure can
access them during deployment. For more info, see Deploying linked Resource Manager templates with VSTS.
Remember to add the Data Factory scripts in your CI/CD pipeline before and after the deployment task.
If you don't have Git configured, you can access the linked templates via Expor t ARM Template in the ARM
Template list.
When deploying your resources, you specify that the deployment is either an incremental update or a complete
update. The difference between these two modes is how Resource Manager handles existing resources in the
resource group that aren't in the template. Please review Deployment Modes.

Hotfix production environment


If you deploy a factory to production and realize there's a bug that needs to be fixed right away, but you can't
deploy the current collaboration branch, you might need to deploy a hotfix. This approach is as known as quick-
fix engineering or QFE.
1. In Azure DevOps, go to the release that was deployed to production. Find the last commit that was
deployed.
2. From the commit message, get the commit ID of the collaboration branch.
3. Create a new hotfix branch from that commit.
4. Go to the Azure Data Factory UX and switch to the hotfix branch.
5. By using the Azure Data Factory UX, fix the bug. Test your changes.
6. After the fix is verified, select Expor t ARM Template to get the hotfix Resource Manager template.
7. Manually check this build into the adf_publish branch.
8. If you've configured your release pipeline to automatically trigger based on adf_publish check-ins, a new
release will start automatically. Otherwise, manually queue a release.
9. Deploy the hotfix release to the test and production factories. This release contains the previous
production payload plus the fix that you made in step 5.
10. Add the changes from the hotfix to the development branch so that later releases won't include the same
bug.
See the video below an in-depth video tutorial on how to hot-fix your environments.

Exposure control and feature flags


When working on a team, there are instances where you may merge changes, but don't want them to be ran in
elevated environments such as PROD and QA. To handle this scenario, the ADF team recommends the DevOps
concept of using feature flags. In ADF, you can combine global parameters and the if condition activity to hide
sets of logic based upon these environment flags.
To learn how to set up a feature flag, see the below video tutorial:

Best practices for CI/CD


If you're using Git integration with your data factory and have a CI/CD pipeline that moves your changes from
development into test and then to production, we recommend these best practices:
Git integration . Configure only your development data factory with Git integration. Changes to test and
production are deployed via CI/CD and don't need Git integration.
Pre- and post-deployment script . Before the Resource Manager deployment step in CI/CD, you need
to complete certain tasks, like stopping and restarting triggers and performing cleanup. We recommend
that you use PowerShell scripts before and after the deployment task. For more information, see Update
active triggers. The data factory team has provided a script to use located at the bottom of this page.
Integration runtimes and sharing . Integration runtimes don't change often and are similar across all
stages in your CI/CD. So Data Factory expects you to have the same name and type of integration runtime
across all stages of CI/CD. If you want to share integration runtimes across all stages, consider using a
ternary factory just to contain the shared integration runtimes. You can use this shared factory in all of
your environments as a linked integration runtime type.
Managed private endpoint deployment . If a private endpoint already exists in a factory and you try
to deploy an ARM template that contains a private endpoint with the same name but with modified
properties, the deployment will fail. In other words, you can successfully deploy a private endpoint as
long as it has the same properties as the one that already exists in the factory. If any property is different
between environments, you can override it by parameterizing that property and providing the respective
value during deployment.
Key Vault . When you use linked services whose connection information is stored in Azure Key Vault, it is
recommended to keep separate key vaults for different environments. You can also configure separate
permission levels for each key vault. For example, you might not want your team members to have
permissions to production secrets. If you follow this approach, we recommend that you to keep the same
secret names across all stages. If you keep the same secret names, you don't need to parameterize each
connection string across CI/CD environments because the only thing that changes is the key vault name,
which is a separate parameter.
Resource naming Due to ARM template constraints, issues in deployment may arise if your resources
contain spaces in the name. The Azure Data Factory team recommends using '_' or '-' characters instead
of spaces for resources. For example, 'Pipeline_1' would be a preferable name over 'Pipeline 1'.
Unsupported features
By design, Data Factory doesn't allow cherry-picking of commits or selective publishing of resources.
Publishes will include all changes made in the data factory.
Data factory entities depend on each other. For example, triggers depend on pipelines, and pipelines
depend on datasets and other pipelines. Selective publishing of a subset of resources could lead to
unexpected behaviors and errors.
On rare occasions when you need selective publishing, consider using a hotfix. For more information,
see Hotfix production environment.
The Azure Data Factory team doesn’t recommend assigning Azure RBAC controls to individual entities
(pipelines, datasets, etc.) in a data factory. For example, if a developer has access to a pipeline or a dataset,
they should be able to access all pipelines or datasets in the data factory. If you feel that you need to
implement many Azure roles within a data factory, look at deploying a second data factory.
You can't publish from private branches.
You can't currently host projects on Bitbucket.
You can't currently export and import alerts and matrices as parameters.

Sample pre- and post-deployment script


Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.

WARNING
If you do not use latest versions of PowerShell and Data Factory module, you may run into deserialization errors while
running the commands.

The following sample script can be used to stop triggers before deployment and restart them afterward. The
script also includes code to delete resources that have been removed. Save the script in an Azure DevOps git
repository and reference it via an Azure PowerShell task the latest Azure PowerShell version.
When running a pre-deployment script, you will need to specify a variation of the following parameters in the
Script Arguments field.
-armTemplate "$(System.DefaultWorkingDirectory)/<your-arm-template-location>" -ResourceGroupName <your-
resource-group-name> -DataFactoryName <your-data-factory-name> -predeployment $true -deleteDeployment $false

When running a post-deployment script, you will need to specify a variation of the following parameters in the
Script Arguments field.
-armTemplate "$(System.DefaultWorkingDirectory)/<your-arm-template-location>" -ResourceGroupName <your-
resource-group-name> -DataFactoryName <your-data-factory-name> -predeployment $false -deleteDeployment $true

NOTE
The -deleteDeployment flag is used to specify the deletion of the ADF deployment entry from the deployment history
in ARM.
Here is the script that can be used for pre- and post-deployment. It accounts for deleted resources and resource
references.

param
(
[parameter(Mandatory = $false)] [String] $armTemplate,
[parameter(Mandatory = $false)] [String] $ResourceGroupName,
[parameter(Mandatory = $false)] [String] $DataFactoryName,
[parameter(Mandatory = $false)] [Bool] $predeployment=$true,
[parameter(Mandatory = $false)] [Bool] $deleteDeployment=$false
)

function getPipelineDependencies {
param([System.Object] $activity)
if ($activity.Pipeline) {
return @($activity.Pipeline.ReferenceName)
} elseif ($activity.Activities) {
$result = @()
$activity.Activities | ForEach-Object{ $result += getPipelineDependencies -activity $_ }
return $result
} elseif ($activity.ifFalseActivities -or $activity.ifTrueActivities) {
$result = @()
$activity.ifFalseActivities | Where-Object {$_ -ne $null} | ForEach-Object{ $result +=
getPipelineDependencies -activity $_ }
$activity.ifTrueActivities | Where-Object {$_ -ne $null} | ForEach-Object{ $result +=
getPipelineDependencies -activity $_ }
return $result
} elseif ($activity.defaultActivities) {
$result = @()
$activity.defaultActivities | ForEach-Object{ $result += getPipelineDependencies -activity $_ }
if ($activity.cases) {
$activity.cases | ForEach-Object{ $_.activities } | ForEach-Object{$result +=
getPipelineDependencies -activity $_ }
}
return $result
} else {
return @()
}
}

function pipelineSortUtil {
param([Microsoft.Azure.Commands.DataFactoryV2.Models.PSPipeline]$pipeline,
[Hashtable] $pipelineNameResourceDict,
[Hashtable] $visited,
[Hashtable] $visited,
[System.Collections.Stack] $sortedList)
if ($visited[$pipeline.Name] -eq $true) {
return;
}
$visited[$pipeline.Name] = $true;
$pipeline.Activities | ForEach-Object{ getPipelineDependencies -activity $_ -pipelineNameResourceDict
$pipelineNameResourceDict} | ForEach-Object{
pipelineSortUtil -pipeline $pipelineNameResourceDict[$_] -pipelineNameResourceDict
$pipelineNameResourceDict -visited $visited -sortedList $sortedList
}
$sortedList.Push($pipeline)

function Get-SortedPipelines {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$pipelines = Get-AzDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$ppDict = @{}
$visited = @{}
$stack = new-object System.Collections.Stack
$pipelines | ForEach-Object{ $ppDict[$_.Name] = $_ }
$pipelines | ForEach-Object{ pipelineSortUtil -pipeline $_ -pipelineNameResourceDict $ppDict -visited
$visited -sortedList $stack }
$sortedList = new-object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSPipeline]

while ($stack.Count -gt 0) {


$sortedList.Add($stack.Pop())
}
$sortedList
}

function triggerSortUtil {
param([Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]$trigger,
[Hashtable] $triggerNameResourceDict,
[Hashtable] $visited,
[System.Collections.Stack] $sortedList)
if ($visited[$trigger.Name] -eq $true) {
return;
}
$visited[$trigger.Name] = $true;
if ($trigger.Properties.DependsOn) {
$trigger.Properties.DependsOn | Where-Object {$_ -and $_.ReferenceTrigger} | ForEach-Object{
triggerSortUtil -trigger $triggerNameResourceDict[$_.ReferenceTrigger.ReferenceName] -
triggerNameResourceDict $triggerNameResourceDict -visited $visited -sortedList $sortedList
}
}
$sortedList.Push($trigger)
}

function Get-SortedTriggers {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$triggers = Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName
$triggerDict = @{}
$visited = @{}
$stack = new-object System.Collections.Stack
$triggers | ForEach-Object{ $triggerDict[$_.Name] = $_ }
$triggers | ForEach-Object{ triggerSortUtil -trigger $_ -triggerNameResourceDict $triggerDict -visited
$visited -sortedList $stack }
$sortedList = new-object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]

while ($stack.Count -gt 0) {


$sortedList.Add($stack.Pop())
}
$sortedList
}

function Get-SortedLinkedServices {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$linkedServices = Get-AzDataFactoryV2LinkedService -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName
$LinkedServiceHasDependencies = @('HDInsightLinkedService', 'HDInsightOnDemandLinkedService',
'AzureBatchLinkedService')
$Akv = 'AzureKeyVaultLinkedService'
$HighOrderList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]
$RegularList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]
$AkvList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]

$linkedServices | ForEach-Object {
if ($_.Properties.GetType().Name -in $LinkedServiceHasDependencies) {
$HighOrderList.Add($_)
}
elseif ($_.Properties.GetType().Name -eq $Akv) {
$AkvList.Add($_)
}
else {
$RegularList.Add($_)
}
}

$SortedList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]($HighOrderList.Count
+ $RegularList.Count + $AkvList.Count)
$SortedList.AddRange($HighOrderList)
$SortedList.AddRange($RegularList)
$SortedList.AddRange($AkvList)
$SortedList
}

$templateJson = Get-Content $armTemplate | ConvertFrom-Json


$resources = $templateJson.resources

#Triggers
Write-Host "Getting triggers"
$triggersInTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/triggers" }
$triggerNamesInTemplate = $triggersInTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}

$triggersDeployed = Get-SortedTriggers -DataFactoryName $DataFactoryName -ResourceGroupName


$ResourceGroupName

$triggersToStop = $triggersDeployed | Where-Object { $triggerNamesInTemplate -contains $_.Name } | ForEach-


Object {
New-Object PSObject -Property @{
Name = $_.Name
TriggerType = $_.Properties.GetType().Name
}
}
$triggersToDelete = $triggersDeployed | Where-Object { $triggerNamesInTemplate -notcontains $_.Name } |
ForEach-Object {
New-Object PSObject -Property @{
Name = $_.Name
TriggerType = $_.Properties.GetType().Name
}
}
}
$triggersToStart = $triggersInTemplate | Where-Object { $_.properties.runtimeState -eq "Started" -and
($_.properties.pipelines.Count -gt 0 -or $_.properties.pipeline.pipelineReference -ne $null)} | ForEach-
Object {
New-Object PSObject -Property @{
Name = $_.name.Substring(37, $_.name.Length-40)
TriggerType = $_.Properties.type
}
}

if ($predeployment -eq $true) {


#Stop all triggers
Write-Host "Stopping deployed triggers`n"
$triggersToStop | ForEach-Object {
if ($_.TriggerType -eq "BlobEventsTrigger" -or $_.TriggerType -eq "CustomEventsTrigger") {
Write-Host "Unsubscribing" $_.Name "from events"
$status = Remove-AzDataFactoryV2TriggerSubscription -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Name $_.Name
while ($status.Status -ne "Disabled"){
Start-Sleep -s 15
$status = Get-AzDataFactoryV2TriggerSubscriptionStatus -ResourceGroupName $ResourceGroupName
-DataFactoryName $DataFactoryName -Name $_.Name
}
}
Write-Host "Stopping trigger" $_.Name
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-Name $_.Name -Force
}
}
else {
#Deleted resources
#pipelines
Write-Host "Getting pipelines"
$pipelinesADF = Get-SortedPipelines -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$pipelinesTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/pipelines"
}
$pipelinesNames = $pipelinesTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
$deletedpipelines = $pipelinesADF | Where-Object { $pipelinesNames -notcontains $_.Name }
#dataflows
$dataflowsADF = Get-AzDataFactoryV2DataFlow -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$dataflowsTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/dataflows"
}
$dataflowsNames = $dataflowsTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40) }
$deleteddataflow = $dataflowsADF | Where-Object { $dataflowsNames -notcontains $_.Name }
#datasets
Write-Host "Getting datasets"
$datasetsADF = Get-AzDataFactoryV2Dataset -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$datasetsTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/datasets" }
$datasetsNames = $datasetsTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40) }
$deleteddataset = $datasetsADF | Where-Object { $datasetsNames -notcontains $_.Name }
#linkedservices
Write-Host "Getting linked services"
$linkedservicesADF = Get-SortedLinkedServices -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$linkedservicesTemplate = $resources | Where-Object { $_.type -eq
"Microsoft.DataFactory/factories/linkedservices" }
$linkedservicesNames = $linkedservicesTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-
40)}
$deletedlinkedservices = $linkedservicesADF | Where-Object { $linkedservicesNames -notcontains $_.Name }
#Integrationruntimes
Write-Host "Getting integration runtimes"
$integrationruntimesADF = Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -
ResourceGroupName $ResourceGroupName
$integrationruntimesTemplate = $resources | Where-Object { $_.type -eq
"Microsoft.DataFactory/factories/integrationruntimes" }
$integrationruntimesNames = $integrationruntimesTemplate | ForEach-Object {$_.name.Substring(37,
$integrationruntimesNames = $integrationruntimesTemplate | ForEach-Object {$_.name.Substring(37,
$_.name.Length-40)}
$deletedintegrationruntimes = $integrationruntimesADF | Where-Object { $integrationruntimesNames -
notcontains $_.Name }

#Delete resources
Write-Host "Deleting triggers"
$triggersToDelete | ForEach-Object {
Write-Host "Deleting trigger " $_.Name
$trig = Get-AzDataFactoryV2Trigger -name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName
if ($trig.RuntimeState -eq "Started") {
if ($_.TriggerType -eq "BlobEventsTrigger" -or $_.TriggerType -eq "CustomEventsTrigger") {
Write-Host "Unsubscribing trigger" $_.Name "from events"
$status = Remove-AzDataFactoryV2TriggerSubscription -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Name $_.Name
while ($status.Status -ne "Disabled"){
Start-Sleep -s 15
$status = Get-AzDataFactoryV2TriggerSubscriptionStatus -ResourceGroupName
$ResourceGroupName -DataFactoryName $DataFactoryName -Name $_.Name
}
}
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Name $_.Name -Force
}
Remove-AzDataFactoryV2Trigger -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting pipelines"
$deletedpipelines | ForEach-Object {
Write-Host "Deleting pipeline " $_.Name
Remove-AzDataFactoryV2Pipeline -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting dataflows"
$deleteddataflow | ForEach-Object {
Write-Host "Deleting dataflow " $_.Name
Remove-AzDataFactoryV2DataFlow -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting datasets"
$deleteddataset | ForEach-Object {
Write-Host "Deleting dataset " $_.Name
Remove-AzDataFactoryV2Dataset -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting linked services"
$deletedlinkedservices | ForEach-Object {
Write-Host "Deleting Linked Service " $_.Name
Remove-AzDataFactoryV2LinkedService -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
Write-Host "Deleting integration runtimes"
$deletedintegrationruntimes | ForEach-Object {
Write-Host "Deleting integration runtime " $_.Name
Remove-AzDataFactoryV2IntegrationRuntime -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}

if ($deleteDeployment -eq $true) {


Write-Host "Deleting ARM deployment ... under resource group: " $ResourceGroupName
$deployments = Get-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName
$deploymentsToConsider = $deployments | Where { $_.DeploymentName -like "ArmTemplate_master*" -or
$_.DeploymentName -like "ArmTemplateForFactory*" } | Sort-Object -Property Timestamp -Descending
$deploymentName = $deploymentsToConsider[0].DeploymentName

Write-Host "Deployment to be deleted: " $deploymentName


$deploymentOperations = Get-AzResourceGroupDeploymentOperation -DeploymentName $deploymentName -
ResourceGroupName $ResourceGroupName
$deploymentsToDelete = $deploymentOperations | Where { $_.properties.targetResource.id -like
"*Microsoft.Resources/deployments*" }

$deploymentsToDelete | ForEach-Object {
Write-host "Deleting inner deployment: " $_.properties.targetResource.id
Remove-AzResourceGroupDeployment -Id $_.properties.targetResource.id
}
Write-Host "Deleting deployment: " $deploymentName
Remove-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName -Name $deploymentName
}

#Start active triggers - after cleanup efforts


Write-Host "Starting active triggers"
$triggersToStart | ForEach-Object {
if ($_.TriggerType -eq "BlobEventsTrigger" -or $_.TriggerType -eq "CustomEventsTrigger") {
Write-Host "Subscribing" $_.Name "to events"
$status = Add-AzDataFactoryV2TriggerSubscription -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Name $_.Name
while ($status.Status -ne "Enabled"){
Start-Sleep -s 15
$status = Get-AzDataFactoryV2TriggerSubscriptionStatus -ResourceGroupName $ResourceGroupName
-DataFactoryName $DataFactoryName -Name $_.Name
}
}
Write-Host "Starting trigger" $_.Name
Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-Name $_.Name -Force
}
}
Automated publishing for continuous integration
and delivery
6/24/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Overview
Continuous integration is the practice of testing each change made to your codebase automatically. As early as
possible, continuous delivery follows the testing that happens during continuous integration and pushes
changes to a staging or production system.
In Azure Data Factory, continuous integration and continuous delivery (CI/CD) means moving Data Factory
pipelines from one environment, such as development, test, and production, to another. Data Factory uses Azure
Resource Manager templates (ARM templates) to store the configuration of your various Data Factory entities,
such as pipelines, datasets, and data flows.
There are two suggested methods to promote a data factory to another environment:
Automated deployment using the integration of Data Factory with Azure Pipelines.
Manually uploading an ARM template by using Data Factory user experience integration with Azure Resource
Manager.
For more information, see Continuous integration and delivery in Azure Data Factory.
This article focuses on the continuous deployment improvements and the automated publish feature for CI/CD.

Continuous deployment improvements


The automated publish feature takes the Validate all and Expor t ARM template features from the Data
Factory user experience and makes the logic consumable via a publicly available npm package
@microsoft/azure-data-factory-utilities. For this reason, you can programmatically trigger these actions instead
of having to go to the Data Factory UI and select a button manually. This capability will give your CI/CD pipelines
a truer continuous integration experience.
Current CI/CD flow
1. Each user makes changes in their private branches.
2. Push to master isn't allowed. Users must create a pull request to make changes.
3. Users must load the Data Factory UI and select Publish to deploy changes to Data Factory and generate the
ARM templates in the publish branch.
4. The DevOps Release pipeline is configured to create a new release and deploy the ARM template each time a
new change is pushed to the publish branch.
Manual step
In the current CI/CD flow, the user experience is the intermediary to create the ARM template. As a result, a user
must go to the Data Factory UI and manually select Publish to start the ARM template generation and drop it in
the publish branch.
The new CI/CD flow
1. Each user makes changes in their private branches.
2. Push to master isn't allowed. Users must create a pull request to make changes.
3. The Azure DevOps pipeline build is triggered every time a new commit is made to master. It validates the
resources and generates an ARM template as an artifact if validation succeeds.
4. The DevOps Release pipeline is configured to create a new release and deploy the ARM template each time a
new build is available.
What changed?
We now have a build process that uses a DevOps build pipeline.
The build pipeline uses the ADFUtilities NPM package, which will validate all the resources and generate the
ARM templates. These templates can be single and linked.
The build pipeline is responsible for validating Data Factory resources and generating the ARM template
instead of the Data Factory UI (Publish button).
The DevOps release definition will now consume this new build pipeline instead of the Git artifact.

NOTE
You can continue to use the existing mechanism, which is the adf_publish branch, or you can use the new flow. Both
are supported.

Package overview
Two commands are currently available in the package:
Export ARM template
Validate
Export ARM template
Run npm run start export <rootFolder> <factoryId> [outputFolder] to export the ARM template by using the
resources of a given folder. This command also runs a validation check prior to generating the ARM template.
Here's an example:

npm run start export C:\DataFactories\DevDataFactory /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-


xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/DevDataFactory
ArmTemplateOutput

RootFolder is a mandatory field that represents where the Data Factory resources are located.
FactoryId is a mandatory field that represents the Data Factory resource ID in the format
/subscriptions/<subId>/resourceGroups/<rgName>/providers/Microsoft.DataFactory/factories/<dfName> .
OutputFolder is an optional parameter that specifies the relative path to save the generated ARM template.

NOTE
The ARM template generated isn't published to the live version of the factory. Deployment should be done by using a
CI/CD pipeline.

Validate
Run npm run start validate <rootFolder> <factoryId> to validate all the resources of a given folder. Here's an
example:

npm run start validate C:\DataFactories\DevDataFactory /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-


xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/DevDataFactory

RootFolder is a mandatory field that represents where the Data Factory resources are located.
FactoryId is a mandatory field that represents the Data Factory resource ID in the format
/subscriptions/<subId>/resourceGroups/<rgName>/providers/Microsoft.DataFactory/factories/<dfName> .

Create an Azure pipeline


While npm packages can be consumed in various ways, one of the primary benefits is being consumed via
Azure Pipeline. On each merge into your collaboration branch, a pipeline can be triggered that first validates all
of the code and then exports the ARM template into a build artifact that can be consumed by a release pipeline.
How it differs from the current CI/CD process is that you will point your release pipeline at this artifact instead
of the existing adf_publish branch.
Follow these steps to get started:
1. Open an Azure DevOps project, and go to Pipelines . Select New Pipeline .

2. Select the repository where you want to save your pipeline YAML script. We recommend saving it in a
build folder in the same repository of your Data Factory resources. Ensure there's a package.json file in
the repository that contains the package name, as shown in the following example:

{
"scripts":{
"build":"node node_modules/@microsoft/azure-data-factory-utilities/lib/index"
},
"dependencies":{
"@microsoft/azure-data-factory-utilities":"^0.1.5"
}
}

3. Select Star ter pipeline . If you've uploaded or merged the YAML file, as shown in the following example,
you can also point directly at that and edit it.
# Sample YAML file to validate and export an ARM template into a build artifact
# Requires a package.json file located in the target repository

trigger:
- main #collaboration branch

pool:
vmImage: 'ubuntu-latest'

steps:

# Installs Node and the npm packages saved in your package.json file in the build

- task: NodeTool@0
inputs:
versionSpec: '10.x'
displayName: 'Install Node.js'

- task: Npm@1
inputs:
command: 'install'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
verbose: true
displayName: 'Install npm package'

# Validates all of the Data Factory resources in the repository. You'll get the same validation
errors as when "Validate All" is selected.
# Enter the appropriate subscription and name for the source factory.

- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
customCommand: 'run build validate $(Build.Repository.LocalPath) /subscriptions/xxxxxxxx-xxxx-
xxxx-xxxx-
xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/yourFactoryNa
me'
displayName: 'Validate'

# Validate and then generate the ARM template into the destination folder, which is the same as
selecting "Publish" from the UX.
# The ARM template generated isn't published to the live version of the factory. Deployment should be
done by using a CI/CD pipeline.

- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
customCommand: 'run build export $(Build.Repository.LocalPath) /subscriptions/xxxxxxxx-xxxx-xxxx-
xxxx-
xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/yourFactoryNa
me "ArmTemplate"'
displayName: 'Validate and Generate ARM template'

# Publish the artifact to be used as a source for a release pipeline.

- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>/ArmTemplate'
#replace with the package.json folder
artifact: 'ArmTemplates'
publishLocation: 'pipeline'
4. Enter your YAML code. We recommend that you use the YAML file as a starting point.
5. Save and run. If you used the YAML, it gets triggered every time the main branch is updated.

Next steps
Learn more information about continuous integration and delivery in Data Factory: Continuous integration and
delivery in Azure Data Factory.
Azure Data Factory connector overview
6/1/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory supports the following data stores and formats via Copy, Data Flow, Look up, Get Metadata,
and Delete activities. Click each data store to learn the supported capabilities and the corresponding
configurations in details.

Supported data stores


GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

Azure Azure Blob ✓/✓ ✓/✓ ✓ ✓ ✓


Storage

Azure −/✓
Cognitive
Search Index

Azure ✓/✓ ✓/✓ ✓


Cosmos DB
(SQL API)

Azure ✓/✓
Cosmos DB's
API for
MongoDB

Azure Data ✓/✓ ✓


Explorer

Azure Data ✓/✓ ✓/✓ ✓ ✓ ✓


Lake Storage
Gen1

Azure Data ✓/✓ ✓/✓ ✓ ✓ ✓


Lake Storage
Gen2

Azure ✓/− ✓
Database for
MariaDB

Azure ✓/✓ ✓/✓ ✓


Database for
MySQL
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

Azure ✓/✓ ✓/✓ ✓


Database for
PostgreSQL

Azure ✓/✓ ✓/✓ Use delta ✓


Databricks format
Delta Lake

Azure File ✓/✓ ✓ ✓ ✓


Storage

Azure SQL ✓/✓ ✓/✓ ✓ ✓


Database

Azure SQL ✓/✓ ✓/✓ ✓ ✓


Managed
Instance

Azure ✓/✓ ✓/✓ ✓ ✓


Synapse
Analytics

Azure Table ✓/✓ ✓


Storage

Database Amazon ✓/− ✓


Redshift

DB2 ✓/− ✓

Drill ✓/− ✓

Google ✓/− ✓
BigQuery

Greenplum ✓/− ✓

HBase ✓/− ✓

Hive ✓/− ✓/− ✓

Apache ✓/− ✓
Impala

Informix ✓/✓ ✓

MariaDB ✓/− ✓

Microsoft ✓/✓ ✓
Access
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

MySQL ✓/− ✓

Netezza ✓/− ✓

Oracle ✓/✓ ✓

Phoenix ✓/− ✓

PostgreSQL ✓/− ✓

Presto ✓/− ✓
(Preview)

SAP Business ✓/− ✓


Warehouse
Open Hub

SAP Business ✓/− ✓


Warehouse
via MDX

SAP HANA ✓/✓ ✓

SAP Table ✓/− ✓

Snowflake ✓/✓ ✓/✓ ✓

Spark ✓/− ✓

SQL Server ✓/✓ ✓/✓ Use ✓ ✓


Managed VNET

Sybase ✓/− ✓

Teradata ✓/− ✓

Vertica ✓/− ✓

NoSQL Cassandra ✓/− ✓

Couchbase ✓/− ✓
(Preview)

MongoDB ✓/✓

MongoDB ✓/✓
Atlas

File Amazon S3 ✓/− ✓ ✓ ✓


GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

Amazon S3 ✓/− ✓ ✓ ✓
Compatible
Storage

File System ✓/✓ ✓ ✓ ✓

FTP ✓/− ✓ ✓ ✓

Google Cloud ✓/− ✓ ✓ ✓


Storage

HDFS ✓/− ✓ ✓

Oracle Cloud ✓/− ✓ ✓ ✓


Storage

SFTP ✓/✓ ✓ ✓ ✓

Generic Generic HTTP ✓/− ✓


protocol

Generic ✓/− ✓
OData

Generic ✓/✓ ✓
ODBC

Generic REST ✓/✓

Ser vices Amazon ✓/− ✓


and apps Marketplace
Web Service

Concur ✓/− ✓
(Preview)

Dataverse ✓/✓ ✓

Dynamics ✓/✓ ✓
365

Dynamics AX ✓/− ✓

Dynamics ✓/✓ ✓
CRM

GitHub For Common


Data Model
entity reference
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

Google ✓/− ✓
AdWords

HubSpot ✓/− ✓
(Preview)

Jira ✓/− ✓

Magento ✓/− ✓
(Preview)

Marketo ✓/− ✓
(Preview)

Microsoft 365 ✓/−

Oracle Eloqua ✓/− ✓


(Preview)

Oracle ✓/− ✓
Responsys
(Preview)

Oracle Service ✓/− ✓


Cloud
(Preview)

PayPal ✓/− ✓
(Preview)

QuickBooks ✓/− ✓
(Preview)

Salesforce ✓/✓ ✓

Salesforce ✓/✓ ✓
Service Cloud

Salesforce ✓/− ✓
Marketing
Cloud

SAP Cloud for ✓/✓ ✓


Customer
(C4C)

SAP ECC ✓/− ✓

ServiceNow ✓/− ✓
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y

SharePoint ✓/− ✓
Online List

Shopify ✓/− ✓
(Preview)

Square ✓/− ✓
(Preview)

Web Table ✓/− ✓


(HTML table)

Xero ✓/− ✓

Zoho ✓/− ✓
(Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency
on preview connectors in your solution, please contact Azure support.

Integrate with more data stores


Azure Data Factory can reach broader set of data stores than the list mentioned above. If you need to move data
to/from a data store that is not in the Azure Data Factory built-in connector list, here are some extensible
options:
For database and data warehouse, usually you can find a corresponding ODBC driver, with which you can use
generic ODBC connector.
For SaaS applications:
If it provides RESTful APIs, you can use generic REST connector.
If it has OData feed, you can use generic OData connector.
If it provides SOAP APIs, you can use generic HTTP connector.
If it has ODBC driver, you can use generic ODBC connector.
For others, check if you can load data to or expose data as any ADF supported data stores, e.g. Azure
Blob/File/FTP/SFTP/etc, then let ADF pick up from there. You can invoke custom data loading mechanism via
Azure Function, Custom activity, Databricks/HDInsight, Web activity, etc.

Supported file formats


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Common Data Model format
Delimited text format
Delta format
Excel format
JSON format
ORC format
Parquet format
XML format

Next steps
Copy activity
Mapping Data Flow
Lookup Activity
Get Metadata Activity
Delete Activity
Copy data from Amazon Marketplace Web Service
using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Amazon Marketplace
Web Service. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Amazon Marketplace Web Service connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Amazon Marketplace Web Service to any supported sink data store. For a list of data
stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Marketplace Web Service connector.

Linked service properties


The following properties are supported for Amazon Marketplace Web Service linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AmazonMWS

endpoint The endpoint of the Amazon MWS Yes


server, (that is,
mws.amazonservices.com)
P RO P ERT Y DESC RIP T IO N REQ UIRED

marketplaceID The Amazon Marketplace ID you want Yes


to retrieve data from. To retrieve data
from multiple Marketplace IDs,
separate them with a comma ( , ).
(that is, A2EUQ1WTGCTBG2)

sellerID The Amazon seller ID. Yes

mwsAuthToken The Amazon MWS authentication Yes


token. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

accessKeyId The access key ID used to access data. Yes

secretKey The secret key used to access data. Yes


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:
{
"name": "AmazonMWSLinkedService",
"properties": {
"type": "AmazonMWS",
"typeProperties": {
"endpoint" : "mws.amazonservices.com",
"marketplaceID" : "A2EUQ1WTGCTBG2",
"sellerID" : "<sellerID>",
"mwsAuthToken": {
"type": "SecureString",
"value": "<mwsAuthToken>"
},
"accessKeyId" : "<accessKeyId>",
"secretKey": {
"type": "SecureString",
"value": "<secretKey>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Marketplace Web Service dataset.
To copy data from Amazon Marketplace Web Service, set the type property of the dataset to
AmazonMWSObject . The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: AmazonMWSObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "AmazonMWSDataset",
"properties": {
"type": "AmazonMWSObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<AmazonMWS linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon Marketplace Web Service source.
Amazon MWS as source
To copy data from Amazon Marketplace Web Service, set the source type in the copy activity to
AmazonMWSSource . The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AmazonMWSSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Orders where
Amazon_Order_Id = 'xx'"
.

Example:

"activities":[
{
"name": "CopyFromAmazonMWS",
"type": "Copy",
"inputs": [
{
"referenceName": "<AmazonMWS input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonMWSSource",
"query": "SELECT * FROM Orders where Amazon_Order_Id = 'xx'"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Redshift using Azure Data
Factory
5/6/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an Amazon Redshift. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Amazon Redshift connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in
Redshift UNLOAD support.

TIP
To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift
UNLOAD through Amazon S3. See Use UNLOAD to copy data from Amazon Redshift section for details.

Prerequisites
If you are copying data to an on-premises data store using Self-hosted Integration Runtime, grant Integration
Runtime (use IP address of the machine) the access to Amazon Redshift cluster. See Authorize access to the
cluster for instructions.
If you are copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Redshift connector.
Linked service properties
The following properties are supported for Amazon Redshift linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AmazonRedshift

server IP address or host name of the Yes


Amazon Redshift server.

port The number of the TCP port that the No, default is 5439
Amazon Redshift server uses to listen
for client connections.

database Name of the Amazon Redshift Yes


database.

username Name of user who has access to the Yes


database.

password Password for the user account. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Example:

{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "<server name>",
"database": "<database name>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Redshift dataset.
To copy data from Amazon Redshift, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: AmazonRedshiftTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "AmazonRedshiftDataset",
"properties":
{
"type": "AmazonRedshiftTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Amazon Redshift linked service name>",
"type": "LinkedServiceReference"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Amazon Redshift source.
Amazon Redshift as source
To copy data from Amazon Redshift, set the source type in the copy activity to AmazonRedshiftSource . The
following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AmazonRedshiftSource
P RO P ERT Y DESC RIP T IO N REQ UIRED

query Use the custom query to read data. No (if "tableName" in dataset is
For example: select * from MyTable. specified)

redshiftUnloadSettings Property group when using Amazon No


Redshift UNLOAD.

s3LinkedServiceName Refers to an Amazon S3 to-be-used as Yes if using UNLOAD


an interim store by specifying a linked
service name of "AmazonS3" type.

bucketName Indicate the S3 bucket to store the Yes if using UNLOAD


interim data. If not provided, Data
Factory service generates it
automatically.

Example: Amazon Redshift source in copy activity using UNLOAD

"source": {
"type": "AmazonRedshiftSource",
"query": "<SQL query>",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "<Amazon S3 linked service>",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
}

Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.

Use UNLOAD to copy data from Amazon Redshift


UNLOAD is a mechanism provided by Amazon Redshift, which can unload the results of a query to one or more
files on Amazon Simple Storage Service (Amazon S3). It is the way recommended by Amazon for copying large
data set from Redshift.
Example: copy data from Amazon Redshift to Azure Synapse Analytics using UNLOAD, staged copy
and PolyBase
For this sample use case, copy activity unloads data from Amazon Redshift to Amazon S3 as configured in
"redshiftUnloadSettings", and then copy data from Amazon S3 to Azure Blob as specified in "stagingSettings",
lastly use PolyBase to load data into Azure Synapse Analytics. All the interim format is handled by copy activity
properly.
"activities":[
{
"name": "CopyFromAmazonRedshiftToSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "AmazonRedshiftDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonRedshiftSource",
"query": "select * from MyTable",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "AmazonS3LinkedService",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AzureStorageLinkedService",
"path": "adfstagingcopydata"
},
"dataIntegrationUnits": 32
}
}
]

Data type mapping for Amazon Redshift


When copying data from Amazon Redshift, the following mappings are used from Amazon Redshift data types
to Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

A M A Z O N REDSH IF T DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

BIGINT Int64

BOOLEAN String

CHAR String

DATE DateTime

DECIMAL Decimal
A M A Z O N REDSH IF T DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

DOUBLE PRECISION Double

INTEGER Int32

REAL Single

SMALLINT Int16

TEXT String

TIMESTAMP DateTime

VARCHAR String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Simple Storage Service by
using Azure Data Factory
5/14/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). To learn about Azure
Data Factory, read the introductory article.

TIP
To learn more about the data migration scenario from Amazon S3 to Azure Storage, see Use Azure Data Factory to
migrate data from Amazon S3 to Azure Storage.

Supported capabilities
This Amazon S3 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Amazon S3 connector supports copying files as is or parsing files with the supported file
formats and compression codecs. You can also choose to preserve file metadata during copy. The connector
uses AWS Signature Version 4 to authenticate requests to S3.

TIP
If you want to copy data from any S3-compatible storage provider, see Amazon S3 Compatible Storage.

Required permissions
To copy data from Amazon S3, make sure you've been granted the following permissions for Amazon S3 object
operations: s3:GetObject and s3:GetObjectVersion .
If you use Data Factory UI to author, additional s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation
permissions are required for operations like testing connection to linked service and browsing from root. If you
don't want to grant these permissions, you can choose "Test connection to file path" or "Browse from specified
path" options from the UI.
For the full list of Amazon S3 permissions, see Specifying Permissions in a Policy on the AWS site.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3.

Linked service properties


The following properties are supported for an Amazon S3 linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AmazonS3 .

authenticationType Specify the authentication type used to No


connect to Amazon S3. You can choose
to use access keys for an AWS Identity
and Access Management (IAM)
account, or temporary security
credentials.
Allowed values are: AccessKey
(default) and
TemporarySecurityCredentials .

accessKeyId ID of the secret access key. Yes

secretAccessKey The secret access key itself. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

sessionToken Applicable when using temporary No


security credentials authentication.
Learn how to request temporary
security credentials from AWS.
Note AWS temporary credential
expires between 15 minutes to 36
hours based on settings. Make sure
your credential is valid when activity
executes, especially for
operationalized workload - for
example, you can refresh it periodically
and store it in Azure Key Vault.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

serviceUrl Specify the custom S3 endpoint No


https://<service url> .
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

Example: using access key authentication

{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: using temporar y security credential authentication

{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"authenticationType": "TemporarySecurityCredentials",
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"sessionToken": {
"type": "SecureString",
"value": "<session token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 under location settings in a format-based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location Yes


in a dataset must be set to
AmazonS3Location .

bucketName The S3 bucket name. Yes

folderPath The path to the folder under the given No


bucket. If you want to use a wildcard
to filter the folder, skip this setting and
specify that in the activity source
settings.

fileName The file name under the given bucket No


and folder path. If you want to use a
wildcard to filter files, skip this setting
and specify that in the activity source
settings.

version The version of the S3 object, if S3 No


versioning is enabled. If it's not
specified, the latest version will be
fetched.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties that the Amazon S3 source supports.
Amazon S3 as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 under storeSettings settings in a format-based copy
source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AmazonS3ReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given bucket or


folder/file path specified in the dataset.
If you want to copy all files from a
bucket or folder, additionally specify
wildcardFileName as * .
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 2: S3 prefix Prefix for the S3 key name under the No


- prefix given bucket configured in a dataset to
filter source S3 files. S3 keys whose
names start with
bucket_in_dataset/this_prefix are
selected. It utilizes S3's service-side
filter, which provides better
performance than a wildcard filter.

When you use prefix and choose to


copy to file-based sink with preserving
hierarchy, note the sub-path after the
last "/" in prefix will be preserved. For
example, you have source
bucket/folder/subfolder/file.txt ,
and configure prefix as folder/sub ,
then the preserved file path is
subfolder/file.txt .

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given bucket
configured in a dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has a
wildcard or this escape character
inside.
See more examples in Folder and file
filter examples.

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given bucket and folder
path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your file name has a
wildcard or this escape character
inside. See more examples in Folder
and file filter examples.

OPTION 4: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When you're using this option, do not
specify a file name in the dataset. See
more examples in File list examples.

Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to a UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL , which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.
- When you use prefix, partition root
path is sub-path before the last "/".

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using a file list path in a Copy activity source.
Assume that you have the following source folder structure and want to copy the files in bold:

SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT DATA FA C TO RY C O N F IGURAT IO N

bucket File1.csv In dataset:


FolderA Subfolder1/File3.csv - Bucket: bucket
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In Copy activity source:
File3.csv - File list path:
File4.json bucket/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line, with the relative path to the path
configured in the dataset.

Preserve metadata during copy


When you copy files from Amazon S3 to Azure Data Lake Storage Gen2 or Azure Blob storage, you can choose
to preserve the file metadata along with data. Learn more from Preserve metadata.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity.

Delete activity properties


To learn details about the properties, check Delete activity.

Legacy models
NOTE
The following models are still supported as is for backward compatibility. We suggest that you use the new model
mentioned earlier. The Data Factory authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AmazonS3Object .

bucketName The S3 bucket name. The wildcard filter Yes for the Copy or Lookup activity, no
is not supported. for the GetMetadata activity

key The name or wildcard filter of the S3 No


object key under the specified bucket.
Applies only when the prefix property
is not specified.

The wildcard filter is supported for


both the folder part and the file name
part. Allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character).
- Example 1:
"key":
"rootfolder/subfolder/*.csv"
- Example 2:
"key": "rootfolder/subfolder/???
20180427.txt"
See more example in Folder and file
filter examples. Use ^ to escape if
your actual folder or file name has a
wildcard or this escape character
inside.
P RO P ERT Y DESC RIP T IO N REQ UIRED

prefix Prefix for the S3 object key. Objects No


whose keys start with this prefix are
selected. Applies only when the key
property is not specified.

version The version of the S3 object, if S3 No


versioning is enabled. If a version is
not specified, the latest version will be
fetched.

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified. The files will be selected
if their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

Be aware that enabling this setting will


affect the overall performance of data
movement when you want to filter
huge amounts of files.

The properties can be NULL , which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, the files whose last modified
attribute is less than the datetime
value will be selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeEnd Files are filtered based on the attribute: No


last modified. The files will be selected
if their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

Be aware that enabling this setting will


affect the overall performance of data
movement when you want to filter
huge amounts of files.

The properties can be NULL , which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat ,
ParquetFormat . Set the type
property under format to one of
these values. For more information,
see the Text format, JSON format, Avro
format, Orc format, and Parquet
format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are Optimal and
Fastest .
TIP
To copy all files under a folder, specify bucketName for the bucket and prefix for the folder part.
To copy a single file with a given name, specify bucketName for the bucket and key for the folder part plus file name.
To copy a subset of files under a folder, specify bucketName for the bucket and key for the folder part plus wildcard
filter.

Example: using prefix

{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3Object",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"prefix": "testFolder/test",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Example: using key and version (optional)


{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"key": "testFolder/testfile.csv.gz",
"version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy source model for the Copy activity


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


source must be set to
FileSystemSource .

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder will not be copied
or created at the sink.
Allowed values are true (default) and
false .

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Amazon S3 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Amazon S3 Compatible Storage by
using Azure Data Factory
5/14/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3) Compatible Storage.
To learn about Azure Data Factory, read the introductory article.

Supported capabilities
This Amazon S3 Compatible Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Amazon S3 Compatible Storage connector supports copying files as is or parsing files with the
supported file formats and compression codecs. The connector uses AWS Signature Version 4 to authenticate
requests to S3. You can use this Amazon S3 Compatible Storage connector to copy data from any S3-compatible
storage provider. Specify the corresponding service URL in the linked service configuration.

Required permissions
To copy data from Amazon S3 Compatible Storage, make sure you've been granted the following permissions
for Amazon S3 object operations: s3:GetObject and s3:GetObjectVersion .
If you use Data Factory UI to author, additional s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation
permissions are required for operations like testing connection to linked service and browsing from root. If you
don't want to grant these permissions, you can choose "Test connection to file path" or "Browse from specified
path" options from the UI.
For the full list of Amazon S3 permissions, see Specifying Permissions in a Policy on the AWS site.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3 Compatible Storage.
Linked service properties
The following properties are supported for an Amazon S3 Compatible linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AmazonS3Compatible .

accessKeyId ID of the secret access key. Yes

secretAccessKey The secret access key itself. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

serviceUrl Specify the custom S3 endpoint No


https://<service url> .

forcePathStyle Indicates whether to use S3 path-style No


access instead of virtual hosted-style
access. Allowed values are: false
(default), true .
Check each data store’s
documentation on if path-style access
is needed or not.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

Example:

{
"name": "AmazonS3CompatibleLinkedService",
"properties": {
"type": "AmazonS3Compatible",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 Compatible under location settings in a format-based
dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location Yes


in a dataset must be set to
AmazonS3CompatibleLocation .

bucketName The S3 Compatible Storage bucket Yes


name.

folderPath The path to the folder under the given No


bucket. If you want to use a wildcard
to filter the folder, skip this setting and
specify that in the activity source
settings.

fileName The file name under the given bucket No


and folder path. If you want to use a
wildcard to filter files, skip this setting
and specify that in the activity source
settings.

version The version of the S3 Compatible No


Storage object, if S3 Compatible
Storage versioning is enabled. If it's not
specified, the latest version will be
fetched.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 Compatible Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3CompatibleLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties that the Amazon S3 Compatible Storage source supports.
Amazon S3 Compatible Storage as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 Compatible Storage under storeSettings settings in a
format-based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AmazonS3CompatibleReadSetting
s.

Locate the files to copy:

OPTION 1: static path Copy from the given bucket or


folder/file path specified in the dataset.
If you want to copy all files from a
bucket or folder, additionally specify
wildcardFileName as * .
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 2: S3 Compatible Storage Prefix for the S3 Compatible Storage No


prefix key name under the given bucket
- prefix configured in a dataset to filter source
S3 Compatible Storage files. S3
Compatible Storage keys whose names
start with
bucket_in_dataset/this_prefix are
selected. It utilizes S3 Compatible
Storage's service-side filter, which
provides better performance than a
wildcard filter.

When you use prefix and choose to


copy to file-based sink with preserving
hierarchy, note the sub-path after the
last "/" in prefix will be preserved. For
example, you have source
bucket/folder/subfolder/file.txt ,
and configure prefix as folder/sub ,
then the preserved file path is
subfolder/file.txt .

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given bucket
configured in a dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has a
wildcard or this escape character
inside.
See more examples in Folder and file
filter examples.

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given bucket and folder
path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your file name has a
wildcard or this escape character
inside. See more examples in Folder
and file filter examples.

OPTION 4: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When you're using this option, do not
specify a file name in the dataset. See
more examples in File list examples.

Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to a UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL , which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.
- When you use prefix, partition root
path is sub-path before the last "/".

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromAmazonS3CompatibleStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3CompatibleReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using a file list path in a Copy activity source.
Assume that you have the following source folder structure and want to copy the files in bold:

SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT DATA FA C TO RY C O N F IGURAT IO N

bucket File1.csv In dataset:


FolderA Subfolder1/File3.csv - Bucket: bucket
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In Copy activity source:
File3.csv - File list path:
File4.json bucket/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line, with the relative path to the path
configured in the dataset.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity.

Delete activity properties


To learn details about the properties, check Delete activity.

Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Avro format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the Avro files or write the data into Avro format .
Avro format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob,
Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Avro dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Avro .

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .

avroCompressionCodec The compression codec to use when No


writing to Avro files. When reading
from Avro files, Data Factory
automatically determines the
compression codec based on the file
metadata.
Supported types are "none " (default),
"deflate ", "snappy ". Note currently
Copy activity doesn't support Snappy
when read/write Avro files.

NOTE
White space in column name is not supported for Avro files.

Below is an example of Avro dataset on Azure Blob Storage:


{
"name": "AvroDataset",
"properties": {
"type": "Avro",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"avroCompressionCodec": "snappy"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Avro source and sink.
Avro as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to AvroSource .

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Avro as sink
The following properties are supported in the copy activity *sink* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to AvroSink .

formatSettings A group of properties. Refer to Avro No


write settings table below.

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .
Supported Avro write settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to AvroWriteSettings .

maxRowsPerFile When writing data into a folder, you No


can choose to write to multiple files
and specify the max rows per file.

fileNamePrefix Applicable when maxRowsPerFile is No


configured.
Specify the file name prefix when
writing data to multiple files, resulted
in this pattern:
<fileNamePrefix>_00000.
<fileExtension>
. If not specified, file name prefix will be
auto generated. This property does
not apply when source is file-based
store or partition-option-enabled data
store.

Mapping data flow properties


In mapping data flows, you can read and write to avro format in the following data stores: Azure Blob Storage,
Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Source properties
The below table lists the properties supported by an avro source. You can edit these properties in the Source
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root ['<from>',
'<to>']

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Sink properties
The below table lists the properties supported by an avro sink. You can edit these properties in the Settings tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

File name option The naming format no Pattern: String filePattern


of the data written. Per partition: String[] partitionFileNames
By default, one file As data in column: rowUrlColumn
per partition in String partitionFileNames
format Output to single file:
part-#####-tid- ['<fileName>']
<guid>

Quote all Enclose all values in no true or false quoteAll


quotes

Data type support


Copy activity
Avro complex data types are not supported (records, enums, arrays, maps, unions, and fixed) in Copy Activity.
Data flows
When working with Avro files in data flows, you can read and write complex data types, but be sure to clear the
physical schema from the dataset first. In data flows, you can set your logical projection and derive columns that
are complex structures, then auto-map those fields to an Avro file.

Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Copy and transform data in Azure Blob storage by
using Azure Data Factory
7/15/2021 • 32 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Azure Blob
storage. It also describes how to use the Data Flow activity to transform data in Azure Blob storage. To learn
about Azure Data Factory, read the introductory article.

TIP
To learn about a migration scenario for a data lake or a data warehouse, see Use Azure Data Factory to migrate data from
your data lake or data warehouse to Azure.

Supported capabilities
This Azure Blob storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
For the Copy activity, this Blob storage connector supports:
Copying blobs to and from general-purpose Azure storage accounts and hot/cool blob storage.
Copying blobs by using an account key, a service shared access signature (SAS), a service principal, or
managed identities for Azure resource authentications.
Copying blobs from block, append, or page blobs and copying data to only block blobs.
Copying blobs as is, or parsing or generating blobs with supported file formats and compression codecs.
Preserving file metadata during copy.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Blob storage.
Linked service properties
This Blob storage connector supports the following authentication types. See the corresponding sections for
details.
Account key authentication
Shared access signature authentication
Service principal authentication
Managed identities for Azure resource authentication

NOTE
If want to use the public Azure integration runtime to connect to your Blob storage by leveraging the Allow trusted
Microsoft ser vices to access this storage account option enabled on Azure Storage firewall, you must use
managed identity authentication.
When you use PolyBase or COPY statement to load data into Azure Synapse Analytics, if your source or staging Blob
storage is configured with an Azure Virtual Network endpoint, you must use managed identity authentication as
required by Synapse. See the Managed identity authentication section for more configuration prerequisites.

NOTE
Azure HDInsight and Azure Machine Learning activities only support authentication that uses Azure Blob storage account
keys.

Account key authentication


Data Factory supports the following properties for storage account key authentication:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The typeproperty must be set to Yes


AzureBlobStorage (suggested) or
AzureStorage (see the following
notes).

connectionString Specify the information needed to Yes


connect to Storage for the
connectionString property.
You can also put the account key in
Azure Key Vault and pull the
accountKey configuration out of the
connection string. For more
information, see the following samples
and the Store credentials in Azure Key
Vault article.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.
NOTE
A secondary Blob service endpoint is not supported when you're using account key authentication. You can use other
authentication types.

NOTE
If you're using the AzureStorage type linked service, it's still supported as is. But we suggest that you use the new
AzureBlobStorage linked service type going forward.

Example:

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store the account key in Azure Key Vault

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Shared access signature authentication


A shared access signature provides delegated access to resources in your storage account. You can use a shared
access signature to grant a client limited permissions to objects in your storage account for a specified time.
You don't have to share your account access keys. The shared access signature is a URI that encompasses in its
query parameters all the information necessary for authenticated access to a storage resource. To access storage
resources with the shared access signature, the client only needs to pass in the shared access signature to the
appropriate constructor or method.
For more information about shared access signatures, see Shared access signatures: Understand the shared
access signature model.

NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about shared access signatures, see Grant limited access to Azure Storage resources using shared access
signatures.
In later dataset configurations, the folder path is the absolute path starting from the container level. You need to
configure one aligned with the path in your SAS URI.

Data Factory supports the following properties for using shared access signature authentication:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The typeproperty must be set to Yes


AzureBlobStorage (suggested) or
AzureStorage (see the following
note).

sasUri Specify the shared access signature Yes


URI to the Storage resources such as
blob or container.
Mark this field as SecureString to
store it securely in Data Factory. You
can also put the SAS token in Azure
Key Vault to use auto-rotation and
remove the token portion. For more
information, see the following samples
and Store credentials in Azure Key
Vault.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

NOTE
If you're using the AzureStorage type linked service, it's still supported as is. But we suggest that you use the new
AzureBlobStorage linked service type going forward.

Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<accountname>.blob.core.windows.net/?sv=<storage version>&st=<start time>&se=<expire time>&sr=
<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store the account key in Azure Key Vault

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<accountname>.blob.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName with value of SAS token e.g. ?sv=<storage version>&st=<start
time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expir y time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right container or blob based on the need. A shared access signature URI to
a blob allows Data Factory to access that particular blob. A shared access signature URI to a Blob storage
container allows Data Factory to iterate through blobs in that container. To provide access to more or fewer
objects later, or to update the shared access signature URI, remember to update the linked service with the
new URI.
Service principal authentication
For general information about Azure Storage service principal authentication, see Authenticate access to Azure
Storage using Azure Active Directory.
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of these values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission in Azure Blob storage. For more information on the roles,
see Use the Azure portal to assign an Azure role for access to blob and queue data.
As source , in Access control (IAM) , grant at least the Storage Blob Data Reader role.
As sink , in Access control (IAM) , grant at least the Storage Blob Data Contributor role.
These properties are supported for an Azure Blob storage linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureBlobStorage .

serviceEndpoint Specify the Azure Blob storage service Yes


endpoint with the pattern of
https://<accountName>.blob.core.windows.net/
.

accountKind Specify the kind of your storage No


account. Allowed values are: Storage
(general purpose v1), StorageV2
(general purpose v2), BlobStorage , or
BlockBlobStorage .

When using Azure Blob linked service


in data flow, managed identity or
service principal authentication is not
supported when account kind as
empty or "Storage". Specify the proper
account kind, choose a different
authentication, or upgrade your
storage account to general purpose
v2.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering over the upper-right corner
of the Azure portal.
P RO P ERT Y DESC RIP T IO N REQ UIRED

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment, to which your Azure
Active Directory application is
registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

NOTE
If your blob account enables soft delete, service principal authentication is not supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when service principal authentication is
used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint. Make sure you create the
corresponding private endpoint in ADF to enable access.

NOTE
Service principal authentication is supported only by the "AzureBlobStorage" type linked service, not the previous
"AzureStorage" type linked service.

Example:

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"accountKind": "StorageV2",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resource authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Blob storage authentication, which is similar to using your
own service principal. It allows this designated factory to access and copy data from or to Blob storage.
For general information about Azure Storage authentication, see Authenticate access to Azure Storage using
Azure Active Directory. To use managed identities for Azure resource authentication, follow these steps:
1. Retrieve Data Factory managed identity information by copying the value of the managed identity object
ID generated along with your factory.
2. Grant the managed identity permission in Azure Blob storage. For more information on the roles, see Use
the Azure portal to assign an Azure role for access to blob and queue data.
As source , in Access control (IAM) , grant at least the Storage Blob Data Reader role.
As sink , in Access control (IAM) , grant at least the Storage Blob Data Contributor role.

IMPORTANT
If you use PolyBase or COPY statement to load data from Blob storage (as a source or as staging) into Azure Synapse
Analytics, when you use managed identity authentication for Blob storage, make sure you also follow steps 1 to 3 in this
guidance. Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your
server. Data Factory handles the rest. If you configure Blob storage with an Azure Virtual Network endpoint, you also
need to have Allow trusted Microsoft ser vices to access this storage account turned on under Azure Storage
account Firewalls and Vir tual networks settings menu as required by Synapse.

These properties are supported for an Azure Blob storage linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureBlobStorage .

serviceEndpoint Specify the Azure Blob storage service Yes


endpoint with the pattern of
https://<accountName>.blob.core.windows.net/
.

accountKind Specify the kind of your storage No


account. Allowed values are: Storage
(general purpose v1), StorageV2
(general purpose v2), BlobStorage , or
BlockBlobStorage .

When using Azure Blob linked service


in data flow, managed identity or
service principal authentication is not
supported when account kind as
empty or "Storage". Specify the proper
account kind, choose a different
authentication, or upgrade your
storage account to general purpose
v2.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

NOTE
If your blob account enables soft delete, managed identity authentication is not supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when managed identity authentication
is used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint . Make sure you create the
corresponding private endpoint in ADF to enable access.

NOTE
Managed identities for Azure resource authentication are supported only by the "AzureBlobStorage" type linked service,
not the previous "AzureStorage" type linked service.

Example:

{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"accountKind": "StorageV2"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Blob storage under location settings in a format-based
dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the location in Yes


the dataset must be set to
AzureBlobStorageLocation .

container The blob container. Yes

folderPath The path to the folder under the given No


container. If you want to use a wildcard
to filter the folder, skip this setting and
specify that in activity source settings.

fileName The file name under the given No


container and folder path. If you want
to use wildcard to filter files, skip this
setting and specify that in activity
source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties that the Blob storage source and sink support.
Blob storage as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Blob storage under storeSettings settings in a format-based
copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureBlobStorageReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given container or


folder/file path specified in the dataset.
If you want to copy all blobs from a
container or folder, additionally specify
wildcardFileName as * .

OPTION 2: blob prefix Prefix for the blob name under the No
- prefix given container configured in a dataset
to filter source blobs. Blobs whose
names start with
container_in_dataset/this_prefix
are selected. It utilizes the service-side
filter for Blob storage, which provides
better performance than a wildcard
filter.

When you use prefix and choose to


copy to file-based sink with preserving
hierarchy, note the sub-path after the
last "/" in prefix will be preserved. For
example, you have source
container/folder/subfolder/file.txt
, and configure prefix as folder/sub ,
then the preserved file path is
subfolder/file.txt .

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given container
configured in a dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has
wildcard or this escape character
inside.
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given container and folder
path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your file name has a
wildcard or this escape character
inside. See more examples in Folder
and file filter examples.

OPTION 4: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When you're using this option, do not
specify a file name in the dataset. See
more examples in File list examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to a UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL , which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.
- When you use prefix, partition root
path is sub-path before the last "/".

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

NOTE
For Parquet/delimited text format, the BlobSource type for the Copy activity source mentioned in the next section is still
supported as is for backward compatibility. We suggest that you use the new model until the Data Factory authoring UI
has switched to generating these new types.

Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

NOTE
The $logs container, which is automatically created when Storage Analytics is enabled for a storage account, isn't shown
when a container listing operation is performed via the Data Factory UI. The file path must be provided directly for Data
Factory to consume files from the $logs container.

Blob storage as a sink type


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for Azure Blob storage under storeSettings settings in a format-based
copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureBlobStorageWriteSettings .

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
or blob name is specified, the merged
file name is the specified name.
Otherwise, it's an autogenerated file
name.

blockSizeInMB Specify the block size, in megabytes, No


used to write data to block blobs.
Learn more about Block Blobs.
Allowed value is between 4 MB and
100 MB.
By default, Data Factory automatically
determines the block size based on
your source store type and data. For
nonbinary copy into Blob storage, the
default block size is 100 MB so it can
fit in (at most) 4.95 TB of data. It might
be not optimal when your data is not
large, especially when you use the self-
hosted integration runtime with poor
network connections that result in
operation timeout or performance
issues. You can explicitly specify a block
size, while ensuring that
blockSizeInMB*50000 is big enough
to store the data. Otherwise, the Copy
activity run will fail.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
P RO P ERT Y DESC RIP T IO N REQ UIRED

metadata Set custom metadata when copy to No


sink. Each object under the metadata
array represents an extra column. The
name defines the metadata key name,
and the value indicates the data
value of that key. If preserve attributes
feature is used, the specified metadata
will union/overwrite with the source
file metadata.

Allowed data values are:


- $$LASTMODIFIED : a reserved
variable indicates to store the source
files' last modified time. Apply to file-
based source with binary format only.
- Expression
- Static value

Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobStorageWriteSettings",
"copyBehavior": "PreserveHierarchy",
"metadata": [
{
"name": "testKey1",
"value": "value1"
},
{
"name": "testKey2",
"value": "value2"
},
{
"name": "lastModifiedKey",
"value": "$$LASTMODIFIED"
}
]
}
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

container/Folder* (empty, use default) false container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

container/Folder* (empty, use default) true container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

container/Folder* *.csv false container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

container/Folder* *.csv true container


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using a file list path in the Copy activity source.
Assume that you have the following source folder structure and want to copy the files in bold:

SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT DATA FA C TO RY C O N F IGURAT IO N

container File1.csv In dataset:


FolderA Subfolder1/File3.csv - Container: container
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In Copy activity source:
File3.csv - File list path:
File4.json container/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line, with the relative path to the path
configured in the dataset.

Some recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

true preserveHierarchy Folder1 The target folder, Folder1, is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5

true flattenHierarchy Folder1 The target folder, Folder1, is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target folder, Folder1, is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File5 contents are
merged into one file with an
autogenerated file name.

false preserveHierarchy Folder1 The target folder, Folder1, is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 is not picked up.
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

false flattenHierarchy Folder1 The target folder, Folder1, is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 is not picked up.

false mergeFiles Folder1 The target folder, Folder1, is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with an
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 is not picked up.

Preserving metadata during copy


When you copy files from Amazon S3, Azure Blob storage, or Azure Data Lake Storage Gen2 to Azure Data Lake
Storage Gen2 or Azure Blob storage, you can choose to preserve the file metadata along with data. Learn more
from Preserve metadata.

Mapping data flow properties


When you're transforming data in mapping data flows, you can read and write files from Azure Blob storage in
the following formats:
Avro
Delimited text
Delta
Excel
JSON
Parquet
Format specific settings are located in the documentation for that format. For more information, see Source
transformation in mapping data flow and Sink transformation in mapping data flow.
Source transformation
In source transformation, you can read from a container, folder, or individual file in Azure Blob storage. Use the
Source options tab to manage how the files are read.
Wildcard paths: Using a wildcard pattern will instruct Data Factory to loop through each matching folder and
file in a single source transformation. This is an effective way to process multiple files within a single flow. Add
multiple wildcard matching patterns with the plus sign that appears when you hover over your existing wildcard
pattern.
From your source container, choose a series of files that match a pattern. Only a container can be specified in the
dataset. Your wildcard path must therefore also include your folder path from the root folder.
Wildcard examples:
* Represents any set of characters.
** Represents recursive directory nesting.
? Replaces one character.
[] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales.
/data/sales/20??/**/ Gets all files in the 20th century.
/data/sales/*/*/*.csv Gets .csv files two levels under /data/sales.
/data/sales/2004/*/12/[XY]1?.csv Gets all .csv files in December 2004 starting with X or Y prefixed by a
two-digit number.
Par tition root path: If you have partitioned folders in your file source with a key=value format (for example,
year=2019 ), then you can assign the top level of that partition folder tree to a column name in your data flow's
data stream.
First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you want to read.
Use the Par tition root path setting to define what the top level of the folder structure is. When you view the
contents of your data via a data preview, you'll see that Data Factory will add the resolved partitions found in
each of your folder levels.

List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this:
/data/sales/20??/**/*.csv

You can specify "from" as:


/data/sales

And you can specify "to" as:


/backup/priorSales

In this case, all files that were sourced under /data/sales are moved to /backup/priorSales .
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.

Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All datetimes are in UTC.
Sink properties
In the sink transformation, you can write to either a container or a folder in Azure Blob storage. Use the
Settings tab to manage how the files get written.

Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv , loans2.csv , and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Be aware that the merge operation can possibly fail based on node size. We don't
recommend this option for large datasets.
Quote all: Determines whether to enclose all values in quotation marks.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity.

Delete activity properties


To learn details about the properties, check Delete activity.

Legacy models
NOTE
The following models are still supported as is for backward compatibility. We suggest that you use the new model
mentioned earlier. The Data Factory authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset Yes


must be set to AzureBlob .

folderPath Path to the container and folder in Yes for the Copy or Lookup activity, No
Blob storage. for the GetMetadata activity

A wildcard filter is supported for the


path, excluding container name.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has a
wildcard or this escape character
inside.

An example is:
myblobcontainer/myblobfolder/ .
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

fileName Name or wildcard filter for the blobs No


under the specified folderPath
value. If you don't specify a value for
this property, the dataset points to all
blobs in the folder.

For the filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your file name has
a wildcard or this escape character
inside.

When fileName isn't specified for an


output dataset and
preserveHierarchy isn't specified in
the activity sink, the Copy activity
automatically generates the blob name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if configured].
[compression if configured]". For
example: "Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt.gz".

If you copy from a tabular source by


using a table name instead of a query,
the name pattern is
[table name].[format].
[compression if configured]
. For example: "MyTable.csv".
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified. The files will be selected
if their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

Be aware that enabling this setting will


affect the overall performance of data
movement when you want to filter
huge amounts of files.

The properties can be NULL , which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.

modifiedDatetimeEnd Files are filtered based on the attribute: No


last modified. The files will be selected
if their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

Be aware that enabling this setting will


affect the overall performance of data
movement when you want to filter
huge amounts of files.

The properties can be NULL , which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat , and
ParquetFormat . Set the type
property under format to one of
these values. For more information,
see the Text format, JSON format, Avro
format, Orc format, and Parquet
format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are Optimal and
Fastest .

TIP
To copy all blobs under a folder, specify folderPath only.
To copy a single blob with a given name, specify folderPath for the folder part and fileName for the file name.
To copy a subset of blobs under a folder, specify folderPath for the folder part and fileName with a wildcard filter.

Example:
{
"name": "AzureBlobDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "<Azure Blob storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy source model for the Copy activity


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


activity source must be set to
BlobSource .

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and
the sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Blob input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Legacy sink model for the Copy activity


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


activity sink must be set to BlobSink .

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
or blob name is specified, the merged
file name is the specified name.
Otherwise, it's an autogenerated file
name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
Example:

"activities":[
{
"name": "CopyToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Blob output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Next steps
For a list of data stores that the Copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Copy data to an Azure Cognitive Search index
using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data into Azure Cognitive Search
index. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from any supported source data store into search index. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Cognitive Search connector.

Linked service properties


The following properties are supported for Azure Cognitive Search linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureSearch

url URL for the search service. Yes

key Admin key for the search service. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

IMPORTANT
When copying data from a cloud data store into search index, in Azure Cognitive Search linked service, you need to refer
an Azure Integration Runtime with explicit region in connactVia. Set the region as the one where your search service
resides. Learn more from Azure Integration Runtime.

Example:

{
"name": "AzureSearchLinkedService",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://<service>.search.windows.net",
"key": {
"type": "SecureString",
"value": "<AdminKey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Cognitive Search dataset.
To copy data into Azure Cognitive Search, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: AzureSearchIndex

indexName Name of the search index. Data Yes


Factory does not create the index. The
index must exist in Azure Cognitive
Search.

Example:
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"typeProperties" : {
"indexName": "products"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Cognitive Search linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Cognitive Search source.
Azure Cognitive Search as sink
To copy data into Azure Cognitive Search, set the source type in the copy activity to AzureSearchIndexSink .
The following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AzureSearchIndexSink

writeBehavior Specifies whether to merge or replace No


when a document already exists in the
index. See the WriteBehavior property.

Allowed values are: Merge (default),


and Upload .

writeBatchSize Uploads data into the search index No


when the buffer size reaches
writeBatchSize. See the WriteBatchSize
property for details.

Allowed values are: integer 1 to 1,000;


default is 1000.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

WriteBehavior property
AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key
already exists in the search index, Azure Cognitive Search updates the existing document rather than throwing a
conflict exception.
The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK):
Merge : combine all the columns in the new document with the existing one. For columns with null value in
the new document, the value in the existing one is preserved.
Upload : The new document replaces the existing one. For columns not specified in the new document, the
value is set to null whether there is a non-null value in the existing document or not.
The default behavior is Merge .
WriteBatchSize Property
Azure Cognitive Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions.
An action handles one document to perform the upload/merge operation.
Example:

"activities":[
{
"name": "CopyToAzureSearch",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Cognitive Search output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSearchIndexSink",
"writeBehavior": "Merge"
}
}
}
]

Data type support


The following table specifies whether an Azure Cognitive Search data type is supported or not.

A Z URE C O GN IT IVE SEA RC H DATA T Y P E SUP P O RT ED IN A Z URE C O GN IT IVE SEA RC H SIN K

String Y

Int32 Y

Int64 Y

Double Y

Boolean Y

DataTimeOffset Y
A Z URE C O GN IT IVE SEA RC H DATA T Y P E SUP P O RT ED IN A Z URE C O GN IT IVE SEA RC H SIN K

String Array N

GeographyPoint N

Currently other data types e.g. ComplexType are not supported. For a full list of Azure Cognitive Search
supported data types, see Supported data types (Azure Cognitive Search).

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Cosmos DB (SQL
API) by using Azure Data Factory
5/25/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB
(SQL API), and use Data Flow to transform data in Azure Cosmos DB (SQL API). To learn about Azure Data
Factory, read the introductory article.

NOTE
This connector only support Cosmos DB SQL API. For MongoDB API, refer to connector for Azure Cosmos DB's API for
MongoDB. Other API types are not supported now.

Supported capabilities
This Azure Cosmos DB (SQL API) connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
For Copy activity, this Azure Cosmos DB (SQL API) connector supports:
Copy data from and to the Azure Cosmos DB SQL API using key, service principal, or managed identities for
Azure resources authentications.
Write to Azure Cosmos DB as inser t or upser t .
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import and export JSON documents.
Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when
you write to Azure Cosmos DB.

TIP
The Data Migration video walks you through the steps of copying data from Azure Blob storage to Azure Cosmos DB.
The video also describes performance-tuning considerations for ingesting data to Azure Cosmos DB in general.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB (SQL API).

Linked service properties


The Azure Cosmos DB (SQL API) connector supports the following authentication types. See the corresponding
sections for details:
Key authentication
Service principal authentication (Preview)
Managed identities for Azure resources authentication (Preview)
Key authentication
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


CosmosDb .

connectionString Specify information that's required to Yes


connect to the Azure Cosmos DB
database.
Note : You must specify database
information in the connection string as
shown in the examples that follow.
You can also put account key in Azure
Key Vault and pull the accountKey
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a
self-hosted integration runtime (if your
data store is located in a private
network). If this property isn't
specified, the default Azure Integration
Runtime is used.

Example

{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store account key in Azure Key Vault

{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;Database=<Database>",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication (Preview)

NOTE
Currently, the service principal authentication is not supported in data flow.

To use service principal authentication, follow these steps.


1. Register an application entity in Azure Active Directory (Azure AD) by following the steps in Register your
application with an Azure AD tenant. Make note of the following values, which you use to define the
linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission. See examples on how permission works in Cosmos DB
from Access control lists on files and directories. More specifically, create a role definition, and assign the
role to the service principle via service principle object ID.
These properties are supported for the linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


CosmosDb .

accountEndpoint Specify the account endpoint URL for Yes


the Azure Cosmos DB.

database Specify the name of the database. Yes

servicePrincipalId Specify the application's client ID. Yes


P RO P ERT Y DESC RIP T IO N REQ UIRED

servicePrincipalCredentialType The credential type to use for service Yes


principal authentication. Allowed
values are Ser vicePrincipalKey and
Ser vicePrincipalCer t .

servicePrincipalCredential The service principal credential. Yes


When you use Ser vicePrincipalKey
as the credential type, specify the the
application's key. Mark this field as
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
When you use Ser vicePrincipalCer t
as the credential, reference a certificate
in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the upper-right
corner of the Azure portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure
Active Directory application is
registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example: using ser vice principal key authentication


You can also store service principal key in Azure Key Vault.
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"accountEndpoint": "<account endpoint>",
"database": "<database name>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: using ser vice principal cer tificate authentication

{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"accountEndpoint": "<account endpoint>",
"database": "<database name>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication (Preview)

NOTE
Currently, the managed identity authentication is not supported in data flow.

A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Cosmos DB authentication, similar to using your own
service principal. It allows this designated factory to access and copy data to or from your Cosmos DB.
To use managed identities for Azure resource authentication, follow these steps.
1. Retrieve the Data Factory managed identity information by copying the value of the managed identity
object ID generated along with your factory.
2. Grant the managed identity proper permission. See examples on how permission works in Cosmos DB
from Access control lists on files and directories. More specifically, create a role definition, and assign the
role to the managed identity.
These properties are supported for the linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


CosmosDb .

accountEndpoint Specify the account endpoint URL for Yes


the Azure Cosmos DB.

database Specify the name of the database. Yes

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example:

{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"accountEndpoint": "<account endpoint>",
"database": "<database name>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB (SQL API) dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to
CosmosDbSqlApiCollection .

collectionName The name of the Azure Cosmos DB Yes


document collection.
If you use "DocumentDbCollection" type dataset, it is still supported as-is for backward compatibility for Copy
and Lookup activity, it's not supported for Data Flow. You are suggested to use the new model going forward.
Example

{
"name": "CosmosDbSQLAPIDataset",
"properties": {
"type": "CosmosDbSqlApiCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB linked service name>",
"type": "LinkedServiceReference"
},
"schema": [],
"typeProperties": {
"collectionName": "<collection name>"
}
}
}

Copy Activity properties


This section provides a list of properties that the Azure Cosmos DB (SQL API) source and sink support. For a full
list of sections and properties that are available for defining activities, see Pipelines.
Azure Cosmos DB (SQL API ) as source
To copy data from Azure Cosmos DB (SQL API), set the source type in Copy Activity to
DocumentDbCollectionSource .
The following properties are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
CosmosDbSqlApiSource .

query Specify the Azure Cosmos DB query to No


read data.
If not specified, this SQL statement is
Example: executed:
SELECT c.BusinessEntityID, select <columns defined in
c.Name.First AS FirstName, structure> from mycollection
c.Name.Middle AS MiddleName,
c.Name.Last AS LastName,
c.Suffix, c.EmailPromotion FROM
c WHERE c.ModifiedDate > \"2009-
01-01T00:00:00\"

preferredRegions The preferred list of regions to connect No


to when retrieving data from Cosmos
DB.

pageSize The number of documents per page of No


the query result. Default is "-1" which
means uses the service side dynamic
page size up to 1000.
P RO P ERT Y DESC RIP T IO N REQ UIRED

detectDatetime Whether to detect datetime from the No


string values in the documents.
Allowed values are: true (default),
false .

If you use "DocumentDbCollectionSource" type source, it is still supported as-is for backward compatibility. You
are suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos
DB.
Example

"activities":[
{
"name": "CopyFromCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cosmos DB SQL API input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbSqlApiSource",
"query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName,
c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"",
"preferredRegions": [
"East US"
]
},
"sink": {
"type": "<sink type>"
}
}
}
]

When copy data from Cosmos DB, unless you want to export JSON documents as-is, the best practice is to
specify the mapping in copy activity. Data Factory honors the mapping you specified on the activity - if a row
doesn't contain a value for a column, a null value is provided for the column value. If you don't specify a
mapping, Data Factory infers the schema by using the first row in the data. If the first row doesn't contain the full
schema, some columns will be missing in the result of the activity operation.
Azure Cosmos DB (SQL API ) as sink
To copy data to Azure Cosmos DB (SQL API), set the sink type in Copy Activity to
DocumentDbCollectionSink .
The following properties are supported in the Copy Activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity sink must be set to
CosmosDbSqlApiSink .

writeBehavior Describes how to write data to Azure No


Cosmos DB. Allowed values: inser t (the default is inser t )
and upser t .

The behavior of upser t is to replace


the document if a document with the
same ID already exists; otherwise,
insert the document.

Note : Data Factory automatically


generates an ID for a document if an
ID isn't specified either in the original
document or by column mapping. This
means that you must ensure that, for
upser t to work as expected, your
document has an ID.

writeBatchSize Data Factory uses the Azure Cosmos No


DB bulk executor library to write data (the default is 10,000 )
to Azure Cosmos DB. The
writeBatchSize property controls the
size of documents that ADF provides
to the library. You can try increasing
the value for writeBatchSize to
improve performance and decreasing
the value if your document size being
large - see below tips.

disableMetricsCollection Data Factory collects metrics such as No (default is false )


Cosmos DB RUs for copy performance
optimization and recommendations. If
you are concerned with this behavior,
specify true to turn it off.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Migrate from relational database to Cosmos DB.

TIP
Cosmos DB limits single request's size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If
you hit error saying "Request size is too large.", reduce the writeBatchSize value in copy sink configuration.

If you use "DocumentDbCollectionSink" type source, it is still supported as-is for backward compatibility. You are
suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos DB.
Example

"activities":[
{
"name": "CopyToCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbSqlApiSink",
"writeBehavior": "upsert"
}
}
}
]

Schema mapping
To copy data from Azure Cosmos DB to tabular sink or reversed, refer to schema mapping.

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to collections in Cosmos DB. For more
information, see the source transformation and sink transformation in mapping data flows.
Source transformation
Settings specific to Azure Cosmos DB are available in the Source Options tab of the source transformation.
Include system columns: If true, id , _ts , and other system columns will be included in your data flow
metadata from CosmosDB. When updating collections, it is important to include this so that you can grab the
existing row id.
Page size: The number of documents per page of the query result. Default is "-1" which uses the service
dynamic page up to 1000.
Throughput: Set an optional value for the number of RUs you'd like to apply to your CosmosDB collection for
each execution of this data flow during the read operation. Minimum is 400.
Preferred regions: Choose the preferred read regions for this process.
JSON Settings
Single document: Select this option if ADF is to treat the entire file as a single JSON doc.
Unquoted column names: Select this option if column names in the JSON as not quoted.
Has comments: Use this selection if your JSON documents have comments in the data.
Single quoted: This should be selected if the columns and values in your document are quoted with single
quotes.
Backslash escaped: If using backslashes to escape characters in your JSON, choose this option.
Sink transformation
Settings specific to Azure Cosmos DB are available in the Settings tab of the sink transformation.
Update method: Determines what operations are allowed on your database destination. The default is to only
allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those
actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter.
Collection action: Determines whether to recreate the destination collection prior to writing.
None: No action will be done to the collection.
Recreate: The collection will get dropped and recreated
Batch size : An integer that represents how many objects are being written to Cosmos DB collection in each
batch. Usually, starting with the default batch size is sufficient. To further tune this value, note:
Cosmos DB limits single request's size to 2MB. The formula is "Request Size = Single Document Size * Batch
Size". If you hit error saying "Request size is too large", reduce the batch size value.
The larger the batch size, the better throughput ADF can achieve, while make sure you allocate enough RUs
to empower your workload.
Par tition key: Enter a string that represents the partition key for your collection. Example: /movies/title

Throughput: Set an optional value for the number of RUs you'd like to apply to your CosmosDB collection for
each execution of this data flow. Minimum is 400.
Write throughput budget: An integer that represents the RUs you want to allocate for this Data Flow write
operation, out of the total throughput allocated to the collection.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Import and export JSON documents


You can use this Azure Cosmos DB (SQL API) connector to easily:
Copy documents between two Azure Cosmos DB collections as-is.
Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage,
Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
To achieve schema-agnostic copy:
When you use the Copy Data tool, select the Expor t as-is to JSON files or Cosmos DB collection
option.
When you use activity authoring, choose JSON format with the corresponding file store for source or sink.

Migrate from relational database to Cosmos DB


When migrating from a relational database e.g. SQL Server to Azure Cosmos DB, copy activity can easily map
tabular data from source to flatten JSON documents in Cosmos DB. In some cases, you may want to redesign
the data model to optimize it for the NoSQL use-cases according to Data modeling in Azure Cosmos DB, for
example, to denormalize the data by embedding all of the related sub-items within one JSON document. For
such case, refer to this article with a walkthrough on how to achieve it using Azure Data Factory copy activity.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Cosmos DB's API for
MongoDB by using Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB's
API for MongoDB. The article builds on Copy Activity in Azure Data Factory, which presents a general overview
of Copy Activity.

NOTE
This connector only supports copy data to/from Azure Cosmos DB's API for MongoDB. For SQL API, refer to Cosmos DB
SQL API connector. Other API types are not supported now.

Supported capabilities
You can copy data from Azure Cosmos DB's API for MongoDB to any supported sink data store, or copy data
from any supported source data store to Azure Cosmos DB's API for MongoDB. For a list of data stores that
Copy Activity supports as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB's API for MongoDB connector to:
Copy data from and to the Azure Cosmos DB's API for MongoDB.
Write to Azure Cosmos DB as inser t or upser t .
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB's API for MongoDB.

Linked service properties


The following properties are supported for the Azure Cosmos DB's API for MongoDB linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


CosmosDbMongoDbApi.

connectionString Specify the connection string for your Yes


Azure Cosmos DB's API for MongoDB.
You can find it in the Azure portal ->
your Cosmos DB blade -> primary or
secondary connection string, with the
pattern of
mongodb://<cosmosdb-name>:
<password>@<cosmosdb-
name>.documents.azure.com:10255/?
ssl=true&replicaSet=globaldb
.

You can also put a password in Azure K


ey Vault and pull the password
configuration out of the connection st
ring. Refer to Store credentials in Azure
Key Vault with more details.

database Name of the database that you want Yes


to access.

connectVia The Integration Runtime to use to No


connect to the data store. You can use
the Azure Integration Runtime or a
self-hosted integration runtime (if your
data store is located in a private
network). If this property isn't
specified, the default Azure Integration
Runtime is used.

Example

{
"name": "CosmosDbMongoDBAPILinkedService",
"properties": {
"type": "CosmosDbMongoDbApi",
"typeProperties": {
"connectionString": "mongodb://<cosmosdb-name>:<password>@<cosmosdb-
name>.documents.azure.com:10255/?ssl=true&replicaSet=globaldb",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB's API for MongoDB dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to
CosmosDbMongoDbApiCollection
.

collectionName The name of the Azure Cosmos DB Yes


collection.

Example

{
"name": "CosmosDbMongoDBAPIDataset",
"properties": {
"type": "CosmosDbMongoDbApiCollection",
"typeProperties": {
"collectionName": "<collection name>"
},
"schema": [],
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB's API for MongoDB linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy Activity properties


This section provides a list of properties that the Azure Cosmos DB's API for MongoDB source and sink support.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Azure Cosmos DB's API for MongoDB as source
The following properties are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
CosmosDbMongoDbApiSource .

filter Specifies selection filter using query No


operators. To return all documents in a
collection, omit this parameter or pass
an empty document ({}).

cursorMethods.project Specifies the fields to return in the No


documents for projection. To return all
fields in the matching documents, omit
this parameter.

cursorMethods.sort Specifies the order in which the query No


returns matching documents. Refer to
cursor.sort().
P RO P ERT Y DESC RIP T IO N REQ UIRED

cursorMethods.limit Specifies the maximum number of No


documents the server returns. Refer to
cursor.limit().

cursorMethods.skip Specifies the number of documents to No


skip and from where MongoDB begins
to return results. Refer to cursor.skip().

batchSize Specifies the number of documents to No


return in each batch of the response (the default is 100 )
from MongoDB instance. In most
cases, modifying the batch size will not
affect the user or the application.
Cosmos DB limits each batch cannot
exceed 40MB in size, which is the sum
of the batchSize number of
documents' size, so decrease this value
if your document size being large.

TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.

Example
"activities":[
{
"name": "CopyFromCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Cosmos DB's API for MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbMongoDbApiSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Cosmos DB's API for MongoDB as sink


The following properties are supported in the Copy Activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity sink must be set to
CosmosDbMongoDbApiSink .

writeBehavior Describes how to write data to Azure No


Cosmos DB. Allowed values: inser t (the default is inser t )
and upser t .

The behavior of upser t is to replace


the document if a document with the
same _id already exists; otherwise,
insert the document.

Note : Data Factory automatically


generates an _id for a document if
an _id isn't specified either in the
original document or by column
mapping. This means that you must
ensure that, for upser t to work as
expected, your document has an ID.
P RO P ERT Y DESC RIP T IO N REQ UIRED

writeBatchSize The writeBatchSize property controls No


the size of documents to write in each (the default is 10,000 )
batch. You can try increasing the value
for writeBatchSize to improve
performance and decreasing the value
if your document size being large.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out. (the default is 00:30:00 - 30 minutes)
The allowed value is timespan.

TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.

Example

"activities":[
{
"name": "CopyToCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbMongoDbApiSink",
"writeBehavior": "upsert"
}
}
}
]

Import and export JSON documents


You can use this Azure Cosmos DB connector to easily:
Copy documents between two Azure Cosmos DB collections as-is.
Import JSON documents from various sources to Azure Cosmos DB, including from MongoDB, Azure Blob
storage, Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
To achieve schema-agnostic copy:
When you use the Copy Data tool, select the Expor t as-is to JSON files or Cosmos DB collection
option.
When you use activity authoring, choose JSON format with the corresponding file store for source or sink.

Schema mapping
To copy data from Azure Cosmos DB's API for MongoDB to tabular sink or reversed, refer to schema mapping.
Specifically for writing into Cosmos DB, to make sure you populate Cosmos DB with the right object ID from
your source data, for example, you have an "id" column in SQL database table and want to use the value of that
as the document ID in MongoDB for insert/upsert, you need to set the proper schema mapping according to
MongoDB strict mode definition ( _id.$oid ) as the following:

After copy activity execution, below BSON ObjectId is generated in sink:

{
"_id": ObjectId("592e07800000000000000000")
}

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Data Explorer by using
Azure Data Factory
5/6/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to use the copy activity in Azure Data Factory to copy data to or from Azure Data
Explorer. It builds on the copy activity overview article, which offers a general overview of copy activity.

TIP
For Azure Data Factory and Azure Data Explorer integration in general, learn more from Integrate Azure Data Explorer
with Azure Data Factory.

Supported capabilities
This Azure Data Explorer connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from any supported source data store to Azure Data Explorer. You can also copy data from
Azure Data Explorer to any supported sink data store. For a list of data stores that the copy activity supports as
sources or sinks, see the Supported data stores table.

NOTE
Copying data to or from Azure Data Explorer through an on-premises data store by using self-hosted integration runtime
is supported in version 3.14 and later.

With the Azure Data Explorer connector, you can do the following:
Copy data by using Azure Active Directory (Azure AD) application token authentication with a ser vice
principal .
As a source, retrieve data by using a KQL (Kusto) query.
As a sink, append data to a destination table.

Getting started
TIP
For a walkthrough of Azure Data Explorer connector, see Copy data to/from Azure Data Explorer using Azure Data Factory
and Bulk copy from a database to Azure Data Explorer.

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Explorer connector.

Linked service properties


The Azure Data Explorer connector supports the following authentication types. See the corresponding sections
for details:
Service principal authentication
Managed identities for Azure resources authentication
Service principal authentication
To use service principal authentication, follow these steps to get a service principal and to grant permissions:
1. Register an application entity in Azure Active Directory by following the steps in Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal the correct permissions in Azure Data Explorer. See Manage Azure Data
Explorer database permissions for detailed information about roles and permissions and about managing
permissions. In general, you must:
As source , grant at least the Database viewer role to your database
As sink , grant at least the Database ingestor role to your database

NOTE
When you use the Data Factory UI to author, by default your login user account is used to list Azure Data Explorer
clusters, databases, and tables. You can choose to list the objects using the service principal by clicking the dropdown next
to the refresh button, or manually enter the name if you don't have permission for these operations.

The following properties are supported for the Azure Data Explorer linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureDataExplorer .

endpoint Endpoint URL of the Azure Data Yes


Explorer cluster, with the format as
https://<clusterName>.
<regionName>.kusto.windows.net
.

database Name of database. Yes


P RO P ERT Y DESC RIP T IO N REQ UIRED

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. This is known as
"Authority ID" in Kusto connection
string. Retrieve it by hovering the
mouse pointer in the upper-right
corner of the Azure portal.

servicePrincipalId Specify the application's client ID. This Yes


is known as "AAD application client ID"
in Kusto connection string.

servicePrincipalKey Specify the application's key. This is Yes


known as "AAD application key" in
Kusto connection string. Mark this field
as a SecureString to store it securely
in Data Factory, or reference secure
data stored in Azure Key Vault.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example: using ser vice principal key authentication

{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
"tenant": "<tenant name/id e.g. microsoft.onmicrosoft.com>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
}
}
}

Managed identities for Azure resources authentication


To use managed identities for Azure resource authentication, follow these steps to grant permissions:
1. Retrieve the Data Factory managed identity information by copying the value of the managed identity
object ID generated along with your factory.
2. Grant the managed identity the correct permissions in Azure Data Explorer. See Manage Azure Data
Explorer database permissions for detailed information about roles and permissions and about managing
permissions. In general, you must:
As source , grant at least the Database viewer role to your database
As sink , grant at least the Database ingestor role to your database
NOTE
When you use the Data Factory UI to author, your login user account is used to list Azure Data Explorer clusters,
databases, and tables. Manually enter the name if you don't have permission for these operations.

The following properties are supported for the Azure Data Explorer linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureDataExplorer .

endpoint Endpoint URL of the Azure Data Yes


Explorer cluster, with the format as
https://<clusterName>.
<regionName>.kusto.windows.net
.

database Name of database. Yes

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example: using managed identity authentication

{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets in Azure Data Factory. This
section lists properties that the Azure Data Explorer dataset supports.
To copy data to Azure Data Explorer, set the type property of the dataset to AzureDataExplorerTable .
The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureDataExplorerTable .
P RO P ERT Y DESC RIP T IO N REQ UIRED

table The name of the table that the linked Yes for sink; No for source
service refers to.

Dataset proper ties example:

{
"name": "AzureDataExplorerDataset",
"properties": {
"type": "AzureDataExplorerTable",
"typeProperties": {
"table": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Data Explorer linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see Pipelines and activities in Azure Data
Factory. This section provides a list of properties that Azure Data Explorer sources and sinks support.
Azure Data Explorer as source
To copy data from Azure Data Explorer, set the type property in the Copy activity source to
AzureDataExplorerSource . The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AzureDataExplorerSource

query A read-only request given in a KQL Yes


format. Use the custom KQL query as
a reference.

queryTimeout The wait time before the query request No


times out. Default value is 10 min
(00:10:00); allowed max value is 1 hour
(01:00:00).

noTruncation Indicates whether to truncate the No


returned result set. By default, result is
truncated after 500,000 records or 64
megabytes (MB). Truncation is strongly
recommended to ensure the correct
behavior of the activity.
NOTE
By default, Azure Data Explorer source has a size limit of 500,000 records or 64 MB. To retrieve all the records without
truncation, you can specify set notruncation; at the beginning of your query. For more information, see Query limits.

Example:

"activities":[
{
"name": "CopyFromAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataExplorerSource",
"query": "TestTable1 | take 10",
"queryTimeout": "00:10:00"
},
"sink": {
"type": "<sink type>"
}
},
"inputs": [
{
"referenceName": "<Azure Data Explorer input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
]
}
]

Azure Data Explorer as sink


To copy data to Azure Data Explorer, set the type property in the copy activity sink to AzureDataExplorerSink .
The following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to:
AzureDataExplorerSink .

ingestionMappingName Name of a pre-created mapping on a No


Kusto table. To map the columns from
source to Azure Data Explorer (which
applies to all supported source stores
and formats, including CSV/JSON/Avro
formats), you can use the copy activity
column mapping (implicitly by name or
explicitly as configured) and/or Azure
Data Explorer mappings.
P RO P ERT Y DESC RIP T IO N REQ UIRED

additionalProperties A property bag which can be used for No


specifying any of the ingestion
properties which aren't being set
already by the Azure Data Explorer
Sink. Specifically, it can be useful for
specifying ingestion tags. Learn more
from Azure Data Explore data
ingestion doc.

Example:

"activities":[
{
"name": "CopyToAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataExplorerSink",
"ingestionMappingName": "<optional Azure Data Explorer mapping name>",
"additionalProperties": {<additional settings for data ingestion>}
}
},
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Explorer output dataset name>",
"type": "DatasetReference"
}
]
}
]

Lookup activity properties


For more information about the properties, see Lookup activity.

Next steps
For a list of data stores that the copy activity in Azure Data Factory supports as sources and sinks, see
supported data stores.
Learn more about how to copy data from Azure Data Factory to Azure Data Explorer.
Copy data to or from Azure Data Lake Storage
Gen1 using Azure Data Factory
5/6/2021 • 25 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data to and from Azure Data Lake Storage Gen1. To learn about Azure Data
Factory, read the introductory article.

Supported capabilities
This Azure Data Lake Storage Gen1 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
Specifically, with this connector you can:
Copy files by using one of the following methods of authentication: service principal or managed identities
for Azure resources.
Copy files as is or parse or generate files with the supported file formats and compression codecs.
Preserve ACLs when copying into Azure Data Lake Storage Gen2.

IMPORTANT
If you copy data by using the self-hosted integration runtime, configure the corporate firewall to allow outbound traffic to
<ADLS account name>.azuredatalakestore.net and login.microsoftonline.com/<tenant>/oauth2/token on port
443. The latter is the Azure Security Token Service that the integration runtime needs to communicate with to get the
access token.

Get started
TIP
For a walk-through of how to use the Azure Data Lake Store connector, see Load data into Azure Data Lake Store.

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide information about properties that are used to define Data Factory entities
specific to Azure Data Lake Store.

Linked service properties


The following properties are supported for the Azure Data Lake Store linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureDataLakeStore .

dataLakeStoreUri Information about the Azure Data Yes


Lake Store account. This information
takes one of the following formats:
https://[accountname].azuredatalakestore.net/webhdfs/v1
or
adl://[accountname].azuredatalakestore.net/
.

subscriptionId The Azure subscription ID to which the Required for sink


Data Lake Store account belongs.

resourceGroupName The Azure resource group name to Required for sink


which the Data Lake Store account
belongs.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is located in a private network. If
this property isn't specified, the default
Azure integration runtime is used.

Use service principal authentication


To use service principal authentication, follow these steps.
1. Register an application entity in Azure Active Directory and grant it access to Data Lake Store. For detailed
steps, see Service-to-service authentication. Make note of the following values, which you use to define
the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission. See examples on how permission works in Data Lake
Storage Gen1 from Access control in Azure Data Lake Storage Gen1.
As source : In Data explorer > Access , grant at least Execute permission for ALL upstream folders
including the root, along with Read permission for the files to copy. You can choose to add to This
folder and all children for recursive, and add as an access permission and a default
permission entr y . There's no requirement on account-level access control (IAM).
As sink : In Data explorer > Access , grant at least Execute permission for ALL upstream folders
including the root, along with Write permission for the sink folder. You can choose to add to This
folder and all children for recursive, and add as an access permission and a default
permission entr y .
The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information, such as Yes


domain name or tenant ID, under
which your application resides. You can
retrieve it by hovering the mouse in
the upper-right corner of the Azure
portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure
Active Directory application is
registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

Example:

{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Data Lake Store authentication, similar to using your own
service principal. It allows this designated factory to access and copy data to or from Data Lake Store.
To use managed identities for Azure resources authentication, follow these steps.
1. Retrieve the data factory managed identity information by copying the value of the "Service Identity
Application ID" generated along with your factory.
2. Grant the managed identity access to Data Lake Store. See examples on how permission works in Data
Lake Storage Gen1 from Access control in Azure Data Lake Storage Gen1.
As source : In Data explorer > Access , grant at least Execute permission for ALL upstream folders
including the root, along with Read permission for the files to copy. You can choose to add to This
folder and all children for recursive, and add as an access permission and a default
permission entr y . There's no requirement on account-level access control (IAM).
As sink : In Data explorer > Access , grant at least Execute permission for ALL upstream folders
including the root, along with Write permission for the sink folder. You can choose to add to This
folder and all children for recursive, and add as an access permission and a default
permission entr y .
In Azure Data Factory, you don't need to specify any properties besides the general Data Lake Store information
in the linked service.
Example:

{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Data Lake Store Gen1 under location settings in the format-
based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


the dataset must be set to
AzureDataLakeStoreLocation .

folderPath The path to a folder. If you want to use No


a wildcard to filter folders, skip this
setting and specify it in activity source
settings.

fileName The file name under the given No


folderPath. If you want to use a
wildcard to filter files, skip this setting
and specify it in activity source
settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen1 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureDataLakeStoreLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see Pipelines. This section provides a list
of properties supported by Azure Data Lake Store source and sink.
Azure Data Lake Store as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Data Lake Store Gen1 under storeSettings settings in the
format-based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureDataLakeStoreReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given folder/file path


specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: name range Retrieve the folders/files whose name is No


- listAfter after this value alphabetically
(exclusive). It utilizes the service-side
filter for ADLS Gen1, which provides
better performance than a wildcard
filter.
Data factory applies this filter to the
path defined in dataset, and only one
entity level is supported. See more
examples in Name range filter
examples.

OPTION 2: name range Retrieve the folders/files whose name is No


- listBefore before this value alphabetically
(inclusive). It utilizes the service-side
filter for ADLS Gen1, which provides
better performance than a wildcard
filter.
Data factory applies this filter to the
path defined in dataset, and only one
entity level is supported. See more
examples in Name range filter
examples.

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

OPTION 4: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When using this option, do not specify
file name in dataset. See more
examples in File list examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL, which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureDataLakeStoreReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Data Lake Store as sink


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for Azure Data Lake Store Gen1 under storeSettings settings in the
format-based copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureDataLakeStoreWriteSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

expiryDateTime Specifies the expiry time of the written No


files. The time is applied to the UTC
time in the format of "2020-03-
01T08:00:00Z". By default it is NULL,
which means the written files are never
expired.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureDataLakeStoreWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Name range filter examples


This section describes the resulting behavior of name range filters.

SA M P L E SO URC E ST RUC T URE A DF C O N F IGURAT IO N RESULT

root In dataset: Then the following files will be copied:


a - Folder path: root
file.csv root
ax In copy activity source: ax
file2.csv - List after: a file2.csv
ax.csv - List before: b ax.csv
b b
file3.csv file3.csv
bx.csv
c
file4.csv
cx.csv

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (Empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (Empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using file list path in copy activity source.
Assuming you have the following source folder structure and want to copy the files in bold:
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT A DF C O N F IGURAT IO N

root File1.csv In dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy, one file per
line with the relative path to the path
configured in the dataset.

Examples of behavior of the copy operation


This section describes the resulting behavior of the copy operation for different combinations of recursive and
copyBehavior values.

SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

true preserveHierarchy Folder1 The target Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File5 contents are
merged into one file, with
an autogenerated file name.
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

false preserveHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 aren't picked up.

false flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 aren't picked up.

false mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 aren't picked up.

Preserve ACLs to Data Lake Storage Gen2


TIP
To copy data from Azure Data Lake Storage Gen1 into Gen2 in general, see Copy data from Azure Data Lake Storage
Gen1 to Gen2 with Azure Data Factory for a walk-through and best practices.

If you want to replicate the access control lists (ACLs) along with data files when you upgrade from Data Lake
Storage Gen1 to Data Lake Storage Gen2, see Preserve ACLs from Data Lake Storage Gen1.

Mapping data flow properties


When you're transforming data in mapping data flows, you can read and write files from Azure Data Lake
Storage Gen1 in the following formats:
Avro
Delimited text
Excel
JSON
Parquet
Format-specific settings are located in the documentation for that format. For more information, see Source
transformation in mapping data flow and Sink transformation in mapping data flow.
Source transformation
In the source transformation, you can read from a container, folder, or individual file in Azure Data Lake Storage
Gen1. The Source options tab lets you manage how the files get read.

Wildcard path: Using a wildcard pattern will instruct ADF to loop through each matching folder and file in a
single Source transformation. This is an effective way to process multiple files within a single flow. Add multiple
wildcard matching patterns with the + sign that appears when hovering over your existing wildcard pattern.
From your source container, choose a series of files that match a pattern. Only container can be specified in the
dataset. Your wildcard path must therefore also include your folder path from the root folder.
Wildcard examples:
* Represents any set of characters
** Represents recursive directory nesting
? Replaces one character
[] Matches one of more characters in the brackets
/data/sales/**/*.csv Gets all csv files under /data/sales
/data/sales/20??/**/ Gets all files in the 20th century
/data/sales/*/*/*.csv Gets csv files two levels under /data/sales
/data/sales/2004/*/12/[XY]1?.csv Gets all csv files in 2004 in December starting with X or Y prefixed by a
two-digit number
Par tition Root Path: If you have partitioned folders in your file source with a key=value format (for example,
year=2019), then you can assign the top level of that partition folder tree to a column name in your data flow
data stream.
First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you wish to read.
Use the Partition Root Path setting to define what the top level of the folder structure is. When you view the
contents of your data via a data preview, you'll see that ADF will add the resolved partitions found in each of
your folder levels.

List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this below:
/data/sales/20??/**/*.csv

You can specify "from" as


/data/sales

And "to" as
/backup/priorSales

In this case, all files that were sourced under /data/sales are moved to /backup/priorSales.
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.

Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All date-times are in UTC.
Sink properties
In the sink transformation, you can write to either a container or folder in Azure Data Lake Storage Gen1. the
Settings tab lets you manage how the files get written.

Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv, loans2.csv, and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Please be aware that te merge operation can possibly fail based upon node size. This
option is not recommended for large datasets.
Quote all: Determines whether to enclose all values in quotes

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Delete activity properties


To learn details about the properties, check Delete activity

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureDataLakeStoreFile .

folderPath Path to the folder in Data Lake Store. If No


not specified, it points to the root.

Wildcard filter is supported. Allowed


wildcards are * (matches zero or
more characters) and ? (matches
zero or single character). Use ^ to
escape if your actual folder name has a
wildcard or this escape char inside.

For example: rootfolder/subfolder/. See


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the files No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, the wildcards allowed are *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has a wildcard or this escape
char inside.

When fileName isn't specified for an


output dataset and
preser veHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if configured].
[compression if configured]", for
example, "Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt.gz". If you copy
from a tabular source by using a table
name instead of a query, the name
pattern is "[table name].[format].
[compression if configured]", for
example, "MyTable.csv".
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute Last No


Modified. The files are selected if their
last modified time is within the time
range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

The overall performance of data


movement is affected by enabling this
setting when you want to do file filter
with huge amounts of files.

The properties can be NULL, which


means no file attribute filter is applied
to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal to
the datetime value are selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means the files whose last
modified attribute is less than the
datetime value are selected.

modifiedDatetimeEnd Files filter based on the attribute Last No


Modified. The files are selected if their
last modified time is within the time
range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

The overall performance of data


movement is affected by enabling this
setting when you want to do file filter
with huge amounts of files.

The properties can be NULL, which


means no file attribute filter is applied
to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal to
the datetime value are selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means the files whose last
modified attribute is less than the
datetime value are selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat , and
ParquetFormat . Set the type
property under format to one of
these values. For more information,
see the Text format, JSON format, Avro
format, Orc format, and Parquet
format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are Optimal and
Fastest .

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a particular name, specify folderPath with a folder part and fileName with a file name.
To copy a subset of files under a folder, specify folderPath with a folder part and fileName with a wildcard filter.

Example:
{
"name": "ADLSDataset",
"properties": {
"type": "AzureDataLakeStoreFile",
"linkedServiceName":{
"referenceName": "<ADLS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "datalake/myfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy Yes


activity source must be set to
AzureDataLakeStoreSource .

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink
is a file-based store, an empty folder or
subfolder isn't copied or created at the
sink. Allowed values are true (default)
and false .

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Legacy copy activity sink model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy Yes


activity sink must be set to
AzureDataLakeStoreSink .

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
the file name is autogenerated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
Example:

"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen1 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Data Lake
Storage Gen2 using Azure Data Factory
7/15/2021 • 29 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics built into
Azure Blob storage. You can use it to interface with your data by using both file system and object storage
paradigms.
This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Data Lake
Storage Gen2, and use Data Flow to transform data in Azure Data Lake Storage Gen2. To learn about Azure Data
Factory, read the introductory article.

TIP
For data lake or data warehouse migration scenario, learn more from Use Azure Data Factory to migrate data from your
data lake or data warehouse to Azure.

Supported capabilities
This Azure Data Lake Storage Gen2 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
For Copy activity, with this connector you can:
Copy data from/to Azure Data Lake Storage Gen2 by using account key, service principal, or managed
identities for Azure resources authentications.
Copy files as-is or parse or generate files with supported file formats and compression codecs.
Preserve file metadata during copy.
Preserve ACLs when copying from Azure Data Lake Storage Gen1/Gen2.

Get started
TIP
For a walk-through of how to use the Data Lake Storage Gen2 connector, see Load data into Azure Data Lake Storage
Gen2.

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide information about properties that are used to define Data Factory entities
specific to Data Lake Storage Gen2.

Linked service properties


The Azure Data Lake Storage Gen2 connector supports the following authentication types. See the
corresponding sections for details:
Account key authentication
Service principal authentication
Managed identities for Azure resources authentication

NOTE
If want to use the public Azure integration runtime to connect to the Data Lake Storage Gen2 by leveraging the Allow
trusted Microsoft ser vices to access this storage account option enabled on Azure Storage firewall, you must
use managed identity authentication.
When you use PolyBase or COPY statement to load data into Azure Synapse Analytics, if your source or staging Data
Lake Storage Gen2 is configured with an Azure Virtual Network endpoint, you must use managed identity
authentication as required by Synapse. See the managed identity authentication section with more configuration
prerequisites.

Account key authentication


To use storage account key authentication, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for Data Lake Storage Gen2 Yes


with the pattern of
https://<accountname>.dfs.core.windows.net
.

accountKey Account key for Data Lake Storage Yes


Gen2. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If this
property isn't specified, the default
Azure integration runtime is used.
NOTE
Secondary ADLS file system endpoint is not supported when using account key authentication. You can use other
authentication types.

Example:

{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"accountkey": {
"type": "SecureString",
"value": "<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use service principal authentication, follow these steps.
1. Register an application entity in Azure Active Directory (Azure AD) by following the steps in Register your
application with an Azure AD tenant. Make note of the following values, which you use to define the
linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission. See examples on how permission works in Data Lake
Storage Gen2 from Access control lists on files and directories
As source : In Storage Explorer, grant at least Execute permission for ALL upstream folders and the
file system, along with Read permission for the files to copy. Alternatively, in Access control (IAM),
grant at least the Storage Blob Data Reader role.
As sink : In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file
system, along with Write permission for the sink folder. Alternatively, in Access control (IAM), grant at
least the Storage Blob Data Contributor role.

NOTE
If you use Data Factory UI to author and the service principal is not set with "Storage Blob Data Reader/Contributor" role
in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse from
specified path", and specify a path with Read + Execute permission to continue.

These properties are supported for the linked service:


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for Data Lake Storage Gen2 Yes


with the pattern of
https://<accountname>.dfs.core.windows.net
.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalCredentialType The credential type to use for service Yes


principal authentication. Allowed
values are Ser vicePrincipalKey and
Ser vicePrincipalCer t .

servicePrincipalCredential The service principal credential. Yes


When you use Ser vicePrincipalKey
as the credential type, specify the the
application's key. Mark this field as
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
When you use Ser vicePrincipalCer t
as the credential, reference a certificate
in Azure Key Vault.

servicePrincipalKey Specify the application's key. Mark this No


field as SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
This property is still supported as-is
for servicePrincipalId +
servicePrincipalKey . As ADF adds
new service principal certificate
authentication, the new model for
service principal authentication is
servicePrincipalId +
servicePrincipalCredentialType +
servicePrincipalCredential .

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the upper-right
corner of the Azure portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure
Active Directory application is
registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example: using ser vice principal key authentication


You can also store service principal key in Azure Key Vault.

{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: using ser vice principal cer tificate authentication

{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Data Lake Storage Gen2 authentication, similar to using
your own service principal. It allows this designated factory to access and copy data to or from your Data Lake
Storage Gen2.
To use managed identities for Azure resource authentication, follow these steps.
1. Retrieve the Data Factory managed identity information by copying the value of the managed identity
object ID generated along with your factory.
2. Grant the managed identity proper permission. See examples on how permission works in Data Lake
Storage Gen2 from Access control lists on files and directories.
As source : In Storage Explorer, grant at least Execute permission for ALL upstream folders and the
file system, along with Read permission for the files to copy. Alternatively, in Access control (IAM),
grant at least the Storage Blob Data Reader role.
As sink : In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file
system, along with Write permission for the sink folder. Alternatively, in Access control (IAM), grant at
least the Storage Blob Data Contributor role.

NOTE
If you use Data Factory UI to author and the managed identity is not set with "Storage Blob Data Reader/Contributor"
role in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse
from specified path", and specify a path with Read + Execute permission to continue.

IMPORTANT
If you use PolyBase or COPY statement to load data from Data Lake Storage Gen2 into Azure Synapse Analytics, when
you use managed identity authentication for Data Lake Storage Gen2, make sure you also follow steps 1 to 3 in this
guidance. Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your
server. Data Factory handles the rest. If you configure Blob storage with an Azure Virtual Network endpoint, you also
need to have Allow trusted Microsoft ser vices to access this storage account turned on under Azure Storage
account Firewalls and Vir tual networks settings menu as required by Synapse.

These properties are supported for the linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureBlobFS.

url Endpoint for Data Lake Storage Gen2 Yes


with the pattern of
https://<accountname>.dfs.core.windows.net
.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is in a private network. If not
specified, the default Azure integration
runtime is used.

Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Data Lake Storage Gen2 under location settings in the format-
based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


the dataset must be set to
AzureBlobFSLocation .

fileSystem The Data Lake Storage Gen2 file No


system name.

folderPath The path to a folder under the given No


file system. If you want to use a
wildcard to filter folders, skip this
setting and specify it in activity source
settings.

fileName The file name under the given No


fileSystem + folderPath. If you want to
use a wildcard to filter files, skip this
setting and specify it in activity source
settings.

Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "filesystemname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see Copy activity configurations and
Pipelines and activities. This section provides a list of properties supported by the Data Lake Storage Gen2
source and sink.
Azure Data Lake Storage Gen2 as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
You have several options to copy data from ADLS Gen2:
Copy from the given path specified in the dataset.
Wildcard filter against folder path or file name, see wildcardFolderPath and wildcardFileName .
Copy the files defined in a given text file as file set, see fileListPath .

The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-
based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureBlobFSReadSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

Locate the files to copy:

OPTION 1: static path Copy from the given file system or


folder/file path specified in the dataset.
If you want to copy all files from a file
system/folder, additionally specify
wildcardFileName as * .

OPTION 2: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given file system
configured in dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.

OPTION 2: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given file system +
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

OPTION 3: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When using this option, do not specify
file name in dataset. See more
examples in File list examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .
P RO P ERT Y DESC RIP T IO N REQ UIRED

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL, which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Data Lake Storage Gen2 as a sink type


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-
based copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureBlobFSWriteSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

blockSizeInMB Specify the block size in MB used to No


write data to ADLS Gen2. Learn more
about Block Blobs.
Allowed value is between 4 MB and
100 MB .
By default, ADF automatically
determines the block size based on
your source store type and data. For
non-binary copy into ADLS Gen2, the
default block size is 100 MB so as to fit
in at most 4.95-TB data. It may be not
optimal when your data is not large,
especially when you use Self-hosted
Integration Runtime with poor
network resulting in operation timeout
or performance issue. You can explicitly
specify a block size, while ensure
blockSizeInMB*50000 is big enough to
store the data, otherwise copy activity
run will fail.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
P RO P ERT Y DESC RIP T IO N REQ UIRED

metadata Set custom metadata when copy to No


sink. Each object under the metadata
array represents an extra column. The
name defines the metadata key name,
and the value indicates the data
value of that key. If preserve attributes
feature is used, the specified metadata
will union/overwrite with the source
file metadata.

Allowed data values are:


- $$LASTMODIFIED : a reserved
variable indicates to store the source
files' last modified time. Apply to file-
based source with binary format only.
- Expression
- Static value

Example:
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "PreserveHierarchy",
"metadata": [
{
"name": "testKey1",
"value": "value1"
},
{
"name": "testKey2",
"value": "value2"
},
{
"name": "lastModifiedKey",
"value": "$$LASTMODIFIED"
}
]
}
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (Empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (Empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using file list path in copy activity source.
Assuming you have the following source folder structure and want to copy the files in bold:

SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT A DF C O N F IGURAT IO N

filesystem File1.csv In dataset:


FolderA Subfolder1/File3.csv - File system: filesystem
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In copy activity source:
File3.csv - File list path:
File4.json filesystem/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line with the relative path to the path
configured in the dataset.

Some recursive and copyBehavior examples


This section describes the resulting behavior of the copy operation for different combinations of recursive and
copyBehavior values.
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

true preserveHierarchy Folder1 The target Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File5 contents are
merged into one file with an
autogenerated file name.

false preserveHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 isn't picked up.
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

false flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 isn't picked up.

false mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with an
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 isn't picked up.

Preserve metadata during copy


When you copy files from Amazon S3/Azure Blob/Azure Data Lake Storage Gen2 to Azure Data Lake Storage
Gen2/Azure Blob, you can choose to preserve the file metadata along with data. Learn more from Preserve
metadata.

Preserve ACLs from Data Lake Storage Gen1/Gen2


When you copy files from Azure Data Lake Storage Gen1/Gen2 to Gen2, you can choose to preserve the POSIX
access control lists (ACLs) along with data. Learn more from Preserve ACLs from Data Lake Storage Gen1/Gen2
to Gen2.

TIP
To copy data from Azure Data Lake Storage Gen1 into Gen2 in general, see Copy data from Azure Data Lake Storage
Gen1 to Gen2 with Azure Data Factory for a walk-through and best practices.

Mapping data flow properties


When you're transforming data in mapping data flows, you can read and write files from Azure Data Lake
Storage Gen2 in the following formats:
Avro
Common Data Model (preview)
Delimited text
Delta
Excel
JSON
Parquet
Format specific settings are located in the documentation for that format. For more information, see Source
transformation in mapping data flow and Sink transformation in mapping data flow.
Source transformation
In the source transformation, you can read from a container, folder, or individual file in Azure Data Lake Storage
Gen2. The Source options tab lets you manage how the files get read.

Wildcard path: Using a wildcard pattern will instruct ADF to loop through each matching folder and file in a
single Source transformation. This is an effective way to process multiple files within a single flow. Add multiple
wildcard matching patterns with the + sign that appears when hovering over your existing wildcard pattern.
From your source container, choose a series of files that match a pattern. Only container can be specified in the
dataset. Your wildcard path must therefore also include your folder path from the root folder.
Wildcard examples:
* Represents any set of characters
** Represents recursive directory nesting
? Replaces one character
[] Matches one of more characters in the brackets
/data/sales/**/*.csv Gets all csv files under /data/sales
/data/sales/20??/**/ Gets all files in the 20th century
/data/sales/*/*/*.csv Gets csv files two levels under /data/sales
/data/sales/2004/*/12/[XY]1?.csv Gets all csv files in 2004 in December starting with X or Y prefixed by a
two-digit number
Par tition Root Path: If you have partitioned folders in your file source with a key=value format (for example,
year=2019), then you can assign the top level of that partition folder tree to a column name in your data flow
data stream.
First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you wish to read.
Use the Partition Root Path setting to define what the top level of the folder structure is. When you view the
contents of your data via a data preview, you'll see that ADF will add the resolved partitions found in each of
your folder levels.

List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this below:
/data/sales/20??/**/*.csv

You can specify "from" as


/data/sales

And "to" as
/backup/priorSales

In this case, all files that were sourced under /data/sales are moved to /backup/priorSales.
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.

Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All date-times are in UTC.
Sink properties
In the sink transformation, you can write to either a container or folder in Azure Data Lake Storage Gen2. the
Settings tab lets you manage how the files get written.

Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv, loans2.csv, and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Please be aware that te merge operation can possibly fail based upon node size. This
option is not recommended for large datasets.
Quote all: Determines whether to enclose all values in quotes

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Delete activity properties


To learn details about the properties, check Delete activity

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureBlobFSFile .

folderPath Path to the folder in Data Lake Storage No


Gen2. If not specified, it points to the
root.

Wildcard filter is supported. Allowed


wildcards are * (matches zero or
more characters) and ? (matches
zero or single character). Use ^ to
escape if your actual folder name has a
wildcard or this escape char is inside.

Examples: filesystem/folder/. See more


examples in Folder and file filter
examples.

fileName Name or wildcard filter for the files No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, the wildcards allowed are *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has a wildcard or this escape
char is inside.

When fileName isn't specified for an


output dataset and
preser veHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if configured].
[compression if configured]", for
example, "Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt.gz". If you copy
from a tabular source using a table
name instead of a query, the name
pattern is "[table name].[format].
[compression if configured]", for
example, "MyTable.csv".
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute Last No


Modified. The files are selected if their
last modified time is within the time
range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

The overall performance of data


movement is affected by enabling this
setting when you want to do file filter
with huge amounts of files.

The properties can be NULL, which


means no file attribute filter is applied
to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal to
the datetime value are selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means the files whose last
modified attribute is less than the
datetime value are selected.

modifiedDatetimeEnd Files filter based on the attribute Last No


Modified. The files are selected if their
last modified time is within the time
range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".

The overall performance of data


movement is affected by enabling this
setting when you want to do file filter
with huge amounts of files.

The properties can be NULL, which


means no file attribute filter is applied
to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal to
the datetime value are selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means the files whose last
modified attribute is less than the
datetime value are selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat , and
ParquetFormat . Set the type
property under format to one of
these values. For more information,
see the Text format, JSON format, Avro
format, ORC format, and Parquet
format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are Optimal and
Fastest .

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with a folder part and fileName with a file name.
To copy a subset of files under a folder, specify folderPath with a folder part and fileName with a wildcard filter.

Example:
{
"name": "ADLSGen2Dataset",
"properties": {
"type": "AzureBlobFSFile",
"linkedServiceName": {
"referenceName": "<Azure Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "myfilesystem/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
AzureBlobFSSource .

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink is a
file-based store, an empty folder or
subfolder isn't copied or created at the
sink.
Allowed values are true (default) and
false .

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureBlobFSSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Legacy copy activity sink model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to
AzureBlobFSSink .

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
Example:

"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen2 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Azure Database for MariaDB using
Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Azure Database for
MariaDB. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Azure Database for MariaDB connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Azure Database for MariaDB to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MariaDB connector.

Linked service properties


The following properties are supported for Azure Database for MariaDB linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureMariaDB
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectionString A connection string to connect to Yes


Azure Database for MariaDB. You can
find it from the Azure portal -> your
Azure Database for MariaDB ->
Connection strings -> ADO.NET one.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "AzureMariaDB",
"typeProperties": {
"connectionString": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; Pwd={your_password}; SslMode=Preferred;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "AzureMariaDB",
"typeProperties": {
"connectionString": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; SslMode=Preferred;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MariaDB dataset.
To copy data from Azure Database for MariaDB, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: AzureMariaDBTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "AzureDatabaseForMariaDBDataset",
"properties": {
"type": "AzureMariaDBTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Database for MariaDB linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Database for MariaDB source.
Azure Database for MariaDB as source
To copy data from Azure Database for MariaDB, the following properties are supported in the copy activity
source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AzureMariaDBSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Database for MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Database for
MySQL by using Azure Data Factory
5/6/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Database for
MySQL, and use Data Flow to transform data in Azure Database for MySQL. To learn about Azure Data Factory,
read the introductory article.
This connector is specialized for Azure Database for MySQL service. To copy data from generic MySQL database
located on-premises or in the cloud, use MySQL connector.

Supported capabilities
This Azure Database for MySQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MySQL connector.

Linked service properties


The following properties are supported for Azure Database for MySQL linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureMySql
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectionString Specify information needed to connect Yes


to the Azure Database for MySQL
instance.
You can also put password in Azure
Key Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

A typical connection string is


Server=<server>.mysql.database.azure.com;Port=<port>;Database=<database>;UID=<username>;PWD=<password> . More
properties you can set per your case:

P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED

SSLMode This option specifies DISABLED (0) / PREFERRED No


whether the driver uses TLS (1) (Default) / REQUIRED
encryption and verification (2) / VERIFY_CA (3) /
when connecting to VERIFY_IDENTITY (4)
MySQL. E.g.
SSLMode=<0/1/2/3/4>

UseSystemTrustStore This option specifies Enabled (1) / Disabled (0) No


whether to use a CA (Default)
certificate from the system
trust store, or from a
specified PEM file. E.g.
UseSystemTrustStore=
<0/1>;

Example:

{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=
<database>;UID=<username>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=
<database>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MySQL dataset.
To copy data from Azure Database for MySQL, set the type property of the dataset to AzureMySqlTable . The
following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: AzureMySqlTable

tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)

Example

{
"name": "AzureMySQLDataset",
"properties": {
"type": "AzureMySqlTable",
"linkedServiceName": {
"referenceName": "<Azure MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure Database for MySQL source and sink.
Azure Database for MySQL as source
To copy data from Azure Database for MySQL, the following properties are supported in the copy activity
source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
AzureMySqlSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

queryCommandTimeout The wait time before the query request No


times out. Default is 120 minutes
(02:00:00)

Example:

"activities":[
{
"name": "CopyFromAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMySqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Database for MySQL as sink


To copy data to Azure Database for MySQL, the following properties are supported in the copy activity sink
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to: AzureMySqlSink
P RO P ERT Y DESC RIP T IO N REQ UIRED

preCopyScript Specify a SQL query for the copy No


activity to execute before writing data
into Azure Database for MySQL in
each run. You can use this property to
clean up the preloaded data.

writeBatchSize Inserts data into the Azure Database No (default is 10,000)


for MySQL table when the buffer size
reaches writeBatchSize.
Allowed value is integer representing
number of rows.

writeBatchTimeout Wait time for the batch insert No (default is 00:00:30)


operation to complete before it times
out.
Allowed values are Timespan. An
example is 00:30:00 (30 minutes).

Example:

"activities":[
{
"name": "CopyToAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure MySQL output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureMySqlSink",
"preCopyScript": "<custom SQL script>",
"writeBatchSize": 100000
}
}
}
]

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from Azure Database for
MySQL. For more information, see the source transformation and sink transformation in mapping data flows.
You can choose to use an Azure Database for MySQL dataset or an inline dataset as source and sink type.
Source transformation
The below table lists the properties supported by Azure Database for MySQL source. You can edit these
properties in the Source options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table If you select Table as No - (for inline dataset


input, data flow only)
fetches all the data tableName
from the table
specified in the
dataset.

Query If you select Query No String query


as input, specify a
SQL query to fetch
data from source,
which overrides any
table you specify in
dataset. Using
queries is a great way
to reduce rows for
testing or lookups.

Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
select * from
mytable where
customerId > 1000
and customerId <
2000
or
select * from
"MyTable"
.

Batch size Specify a batch size No Integer batchSize


to chunk large data
into batches.

Isolation Level Choose one of the No READ_COMMITTED isolationLevel


following isolation READ_UNCOMMITTED
levels: REPEATABLE_READ
- Read Committed SERIALIZABLE
- Read Uncommitted NONE
(default)
- Repeatable Read
- Serializable
- None (ignore
isolation level)

Azure Database for MySQL source script example


When you use Azure Database for MySQL as source type, the associated data flow script is:
source(allowSchemaDrift: true,
validateSchema: false,
isolationLevel: 'READ_UNCOMMITTED',
query: 'select * from mytable',
format: 'query') ~> AzureMySQLSource

Sink transformation
The below table lists the properties supported by Azure Database for MySQL sink. You can edit these properties
in the Sink options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Update method Specify what Yes true or false deletable


operations are insertable
allowed on your updateable
database destination. upsertable
The default is to only
allow inserts.
To update, upsert, or
delete rows, an Alter
row transformation is
required to tag rows
for those actions.

Key columns For updates, upserts No Array keys


and deletes, key
column(s) must be
set to determine
which row to alter.
The column name
that you pick as the
key will be used as
part of the
subsequent update,
upsert, delete.
Therefore, you must
pick a column that
exists in the Sink
mapping.

Skip writing key If you wish to not No true or false skipKeyWrites


columns write the value to the
key column, select
"Skip writing key
columns".
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table action Determines whether No true or false recreate


to recreate or truncate
remove all rows from
the destination table
prior to writing.
- None : No action
will be done to the
table.
- Recreate : The table
will get dropped and
recreated. Required if
creating a new table
dynamically.
- Truncate : All rows
from the target table
will get removed.

Batch size Specify how many No Integer batchSize


rows are being
written in each batch.
Larger batch sizes
improve compression
and memory
optimization, but risk
out of memory
exceptions when
caching data.

Pre and Post SQL Specify multi-line No String preSQLs


scripts SQL scripts that will postSQLs
execute before (pre-
processing) and after
(post-processing)
data is written to
your Sink database.

Azure Database for MySQL sink script example


When you use Azure Database for MySQL as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,


validateSchema: false,
deletable:false,
insertable:true,
updateable:true,
upsertable:true,
keys:['keyColumn'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> AzureMySQLSink

Lookup activity properties


To learn details about the properties, check Lookup activity.

Data type mapping for Azure Database for MySQL


When copying data from Azure Database for MySQL, the following mappings are used from MySQL data types
to Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

A Z URE DATA B A SE F O R M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

bigint unsigned Decimal

bit Boolean

bit(M), M>1 Byte[]

blob Byte[]

bool Int16

char String

date Datetime

datetime Datetime

decimal Decimal, String

double Double

double precision Double

enum String

float Single

int Int32

int unsigned Int64

integer Int32

integer unsigned Int64

long varbinary Byte[]

long varchar String

longblob Byte[]

longtext String

mediumblob Byte[]
A Z URE DATA B A SE F O R M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

mediumint Int32

mediumint unsigned Int64

mediumtext String

numeric Decimal

real Double

set String

smallint Int16

smallint unsigned Int32

text String

time TimeSpan

timestamp Datetime

tinyblob Byte[]

tinyint Int16

tinyint unsigned Int16

tinytext String

varchar String

year Int32

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Database for
PostgreSQL by using Azure Data Factory
6/16/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Database for
PostgreSQL, and use Data Flow to transform data in Azure Database for PostgreSQL. To learn about Azure Data
Factory, read the introductory article.
This connector is specialized for the Azure Database for PostgreSQL service. To copy data from a generic
PostgreSQL database located on-premises or in the cloud, use the PostgreSQL connector.

Supported capabilities
This Azure Database for PostgreSQL connector is supported for the following activities:
Copy activity with a supported source/sink matrix
Mapping data flow
Lookup activity
Currently, data flow in Azure Data Factory supports Azure database for PostgreSQL Single Server but not
Flexible Server or Hyperscale (Citus); data flow in Azure Synapse Analytics supports all PostgreSQL flavors.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections offer details about properties that are used to define Data Factory entities specific to
Azure Database for PostgreSQL connector.

Linked service properties


The following properties are supported for the Azure Database for PostgreSQL linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzurePostgreSql.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectionString An ODBC connection string to connect Yes


to Azure Database for PostgreSQL.
You can also put a password in Azure
Key Vault and pull the password
configuration out of the connection
string. See the following samples and
Store credentials in Azure Key Vault for
more details.

connectVia This property represents the No


integration runtime to be used to
connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

A typical connection string is


Server=<server>.postgres.database.azure.com;Database=<database>;Port=<port>;UID=<username>;Password=
<Password>
. Here are more properties you can set per your case:

P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED

EncryptionMethod (EM) The method the driver uses 0 (No Encryption) No


to encrypt data sent (Default) / 1 (SSL) / 6
between the driver and the (RequestSSL)
database server. For
example,
EncryptionMethod=
<0/1/6>;

ValidateServerCertificate Determines whether the 0 (Disabled) (Default) / 1 No


(VSC) driver validates the (Enabled)
certificate that's sent by the
database server when SSL
encryption is enabled
(Encryption Method=1). For
example,
ValidateServerCertificate=
<0/1>;

Example :

{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=
<port>;UID=<username>;Password=<Password>"
}
}
}

Example :
Store password in Azure Key Vault

{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=
<port>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets in Azure Data Factory. This
section provides a list of properties that Azure Database for PostgreSQL supports in datasets.
To copy data from Azure Database for PostgreSQL, set the type property of the dataset to
AzurePostgreSqlTable . The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzurePostgreSqlTable

tableName Name of the table No (if "query" in activity source is


specified)

Example :

{
"name": "AzurePostgreSqlDataset",
"properties": {
"type": "AzurePostgreSqlTable",
"linkedServiceName": {
"referenceName": "<AzurePostgreSql linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see Pipelines and activities in Azure Data
Factory. This section provides a list of properties supported by an Azure Database for PostgreSQL source.
Azure Database for PostgreSql as source
To copy data from Azure Database for PostgreSQL, set the source type in the copy activity to
AzurePostgreSqlSource . The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
AzurePostgreSqlSource

query Use the custom SQL query to read No (if the tableName property in the
data. For example: dataset is specified)
SELECT * FROM mytable or
SELECT * FROM "MyTable" . Note in
PostgreSQL, the entity name is treated
as case-insensitive if not quoted.

Example :

"activities":[
{
"name": "CopyFromAzurePostgreSql",
"type": "Copy",
"inputs": [
{
"referenceName": "<AzurePostgreSql input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzurePostgreSqlSource",
"query": "<custom query e.g. SELECT * FROM mytable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure Database for PostgreSQL as sink


To copy data to Azure Database for PostgreSQL, the following properties are supported in the copy activity sink
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to
AzurePostgreSQLSink .
P RO P ERT Y DESC RIP T IO N REQ UIRED

preCopyScript Specify a SQL query for the copy No


activity to execute before you write
data into Azure Database for
PostgreSQL in each run. You can use
this property to clean up the
preloaded data.

writeMethod The method used to write data into No


Azure Database for PostgreSQL.
Allowed values are: CopyCommand
(default, which is more performant),
BulkInser t .

writeBatchSize The number of rows loaded into Azure No (default is 1,000,000)


Database for PostgreSQL per batch.
Allowed value is an integer that
represents the number of rows.

writeBatchTimeout Wait time for the batch insert No (default is 00:30:00)


operation to complete before it times
out.
Allowed values are Timespan strings.
An example is 00:30:00 (30 minutes).

Example :

"activities":[
{
"name": "CopyToAzureDatabaseForPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure PostgreSQL output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzurePostgreSQLSink",
"preCopyScript": "<custom SQL script>",
"writeMethod": "CopyCommand",
"writeBatchSize": 1000000
}
}
}
]

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from Azure Database for
PostgreSQL. For more information, see the source transformation and sink transformation in mapping data
flows. You can choose to use an Azure Database for PostgreSQL dataset or an inline dataset as source and sink
type.
Source transformation
The below table lists the properties supported by Azure Database for PostgreSQL source. You can edit these
properties in the Source options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table If you select Table as No - (for inline dataset


input, data flow only)
fetches all the data tableName
from the table
specified in the
dataset.

Query If you select Query No String query


as input, specify a
SQL query to fetch
data from source,
which overrides any
table you specify in
dataset. Using
queries is a great way
to reduce rows for
testing or lookups.

Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
select * from
mytable where
customerId > 1000
and customerId <
2000
or
select * from
"MyTable"
. Note in PostgreSQL,
the entity name is
treated as case-
insensitive if not
quoted.

Batch size Specify a batch size No Integer batchSize


to chunk large data
into batches.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Isolation Level Choose one of the No READ_COMMITTED isolationLevel


following isolation READ_UNCOMMITTED
levels: REPEATABLE_READ
- Read Committed SERIALIZABLE
- Read Uncommitted NONE
(default)
- Repeatable Read
- Serializable
- None (ignore
isolation level)

Azure Database for PostgreSQL source script example


When you use Azure Database for PostgreSQL as source type, the associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
isolationLevel: 'READ_UNCOMMITTED',
query: 'select * from mytable',
format: 'query') ~> AzurePostgreSQLSource

Sink transformation
The below table lists the properties supported by Azure Database for PostgreSQL sink. You can edit these
properties in the Sink options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Update method Specify what Yes true or false deletable


operations are insertable
allowed on your updateable
database destination. upsertable
The default is to only
allow inserts.
To update, upsert, or
delete rows, an Alter
row transformation is
required to tag rows
for those actions.

Key columns For updates, upserts No Array keys


and deletes, key
column(s) must be
set to determine
which row to alter.
The column name
that you pick as the
key will be used as
part of the
subsequent update,
upsert, delete.
Therefore, you must
pick a column that
exists in the Sink
mapping.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Skip writing key If you wish to not No true or false skipKeyWrites


columns write the value to the
key column, select
"Skip writing key
columns".

Table action Determines whether No true or false recreate


to recreate or truncate
remove all rows from
the destination table
prior to writing.
- None : No action
will be done to the
table.
- Recreate : The table
will get dropped and
recreated. Required if
creating a new table
dynamically.
- Truncate : All rows
from the target table
will get removed.

Batch size Specify how many No Integer batchSize


rows are being
written in each batch.
Larger batch sizes
improve compression
and memory
optimization, but risk
out of memory
exceptions when
caching data.

Pre and Post SQL Specify multi-line No String preSQLs


scripts SQL scripts that will postSQLs
execute before (pre-
processing) and after
(post-processing)
data is written to
your Sink database.

Azure Database for PostgreSQL sink script example


When you use Azure Database for PostgreSQL as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,


validateSchema: false,
deletable:false,
insertable:true,
updateable:true,
upsertable:true,
keys:['keyColumn'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> AzurePostgreSQLSink

Lookup activity properties


For more information about the properties, see Lookup activity in Azure Data Factory.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data to and from Azure Databricks Delta Lake
by using Azure Data Factory
6/17/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy activity in Azure Data Factory to copy data to and from Azure
Databricks Delta Lake. It builds on the Copy activity in Azure Data Factory article, which presents a general
overview of copy activity.

Supported capabilities
This Azure Databricks Delta Lake connector is supported for the following activities:
Copy activity with a supported source/sink matrix table
Lookup activity
In general, Azure Data Factory supports Delta Lake with the following capabilities to meet your various needs.
Copy activity supports Azure Databricks Delta Lake connector to copy data from any supported source data
store to Azure Databricks delta lake table, and from delta lake table to any supported sink data store. It
leverages your Databricks cluster to perform the data movement, see details in Prerequisites section.
Mapping Data Flow supports generic Delta format on Azure Storage as source and sink to read and write
Delta files for code-free ETL, and runs on managed Azure Integration Runtime.
Databricks activities supports orchestrating your code-centric ETL or machine learning workload on top of
delta lake.

Prerequisites
To use this Azure Databricks Delta Lake connector, you need to set up a cluster in Azure Databricks.
To copy data to delta lake, Copy activity invokes Azure Databricks cluster to read data from an Azure Storage,
which is either your original source or a staging area to where Data Factory firstly writes the source data via
built-in staged copy. Learn more from Delta lake as the sink.
Similarly, to copy data from delta lake, Copy activity invokes Azure Databricks cluster to write data to an
Azure Storage, which is either your original sink or a staging area from where Data Factory continues to write
data to final sink via built-in staged copy. Learn more from Delta lake as the source.
The Databricks cluster needs to have access to Azure Blob or Azure Data Lake Storage Gen2 account, both the
storage container/file system used for source/sink/staging and the container/file system where you want to
write the Delta Lake tables.
To use Azure Data Lake Storage Gen2 , you can configure a ser vice principal on the Databricks
cluster as part of the Apache Spark configuration. Follow the steps in Access directly with service
principal.
To use Azure Blob storage , you can configure a storage account access key or SAS token on the
Databricks cluster as part of the Apache Spark configuration. Follow the steps in Access Azure Blob
storage using the RDD API.
During copy activity execution, if the cluster you configured has been terminated, Data Factory automatically
starts it. If you author pipeline using Data Factory authoring UI, for operations like data preview, you need to
have a live cluster, Data Factory won't start the cluster on your behalf.
Specify the cluster configuration
1. In the Cluster Mode drop-down, select Standard .
2. In the Databricks Runtime Version drop-down, select a Databricks runtime version.
3. Turn on Auto Optimize by adding the following properties to your Spark configuration:

spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true

4. Configure your cluster depending on your integration and scaling needs.


For cluster configuration details, see Configure clusters.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure
Databricks Delta Lake connector.

Linked service properties


The following properties are supported for an Azure Databricks Delta Lake linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureDatabricksDeltaLake .

domain Specify the Azure Databricks


workspace URL, e.g.
https://adb-
xxxxxxxxx.xx.azuredatabricks.net
.

clusterId Specify the cluster ID of an existing


cluster. It should be an already created
Interactive Cluster.
You can find the Cluster ID of an
Interactive Cluster on Databricks
workspace -> Clusters -> Interactive
Cluster Name -> Configuration ->
Tags. Learn more.
P RO P ERT Y DESC RIP T IO N REQ UIRED

accessToken Access token is required for Data


Factory to authenticate to Azure
Databricks. Access token needs to be
generated from the databricks
workspace. More detailed steps to find
the access token can be found here.

connectVia The integration runtime that is used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure integration runtime.

Example:

{
"name": "AzureDatabricksDeltaLakeLinkedService",
"properties": {
"type": "AzureDatabricksDeltaLake",
"typeProperties": {
"domain": "https://adb-xxxxxxxxx.xx.azuredatabricks.net",
"clusterId": "<cluster id>",
"accessToken": {
"type": "SecureString",
"value": "<access token>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for the Azure Databricks Delta Lake dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to
AzureDatabricksDeltaLakeDataset
.

database Name of the database. No for source, yes for sink

table Name of the delta table. No for source, yes for sink

Example:
{
"name": "AzureDatabricksDeltaLakeDataset",
"properties": {
"type": "AzureDatabricksDeltaLakeDataset",
"typeProperties": {
"database": "<database name>",
"table": "<delta table name>"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure Databricks Delta Lake source and sink.
Delta lake as source
To copy data from Azure Databricks Delta Lake, the following properties are supported in the Copy activity
source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


source must be set to
AzureDatabricksDeltaLakeSource .

query Specify the SQL query to read data. No


For the time travel control, follow the
below pattern:
-
SELECT * FROM events TIMESTAMP
AS OF timestamp_expression
-
SELECT * FROM events VERSION AS
OF version

exportSettings Advanced settings used to retrieve No


data from delta table.

Under exportSettings :

type The type of export command, set to Yes


AzureDatabricksDeltaLakeExpor tC
ommand .

dateFormat Format date type to string with a date No


format. Custom date formats follow
the formats at datetime pattern. If not
specified, it uses the default value
yyyy-MM-dd .
P RO P ERT Y DESC RIP T IO N REQ UIRED

timestampFormat Format timestamp type to string with No


a timestamp format. Custom date
formats follow the formats at datetime
pattern. If not specified, it uses the
default value
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] .

Direct copy from delta lake


If your sink data store and format meet the criteria described in this section, you can use the Copy activity to
directly copy from Azure Databricks Delta table to sink. Data Factory checks the settings and fails the Copy
activity run if the following criteria is not met:
The sink linked ser vice is Azure Blob storage or Azure Data Lake Storage Gen2. The account credential
should be pre-configured in Azure Databricks cluster configuration, learn more from Prerequisites.
The sink data format is of Parquet , delimited text , or Avro with the following configurations, and
points to a folder instead of file.
For Parquet format, the compression codec is none , snappy , or gzip .
For delimited text format:
rowDelimiter is any single character.
compression can be none , bzip2 , gzip .
encodingName UTF-7 is not supported.
For Avro format, the compression codec is none , deflate , or snappy .
In the Copy activity source, additionalColumns is not specified.
If copying data to delimited text, in copy activity sink, fileExtension need to be ".csv".
In the Copy activity mapping, type conversion is not enabled.
Example:
"activities":[
{
"name": "CopyFromDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delta lake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDatabricksDeltaLakeSource",
"sqlReaderQuery": "SELECT * FROM events TIMESTAMP AS OF timestamp_expression"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Staged copy from delta lake


When your sink data store or format does not match the direct copy criteria, as mentioned in the last section,
enable the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides
you better throughput. Data Factory exports data from Azure Databricks Delta Lake into staging storage, then
copies the data to sink, and finally cleans up your temporary data from the staging storage. See Staged copy for
details about copying data by using staging.
To use this feature, create an Azure Blob storage linked service or Azure Data Lake Storage Gen2 linked service
that refers to the storage account as the interim staging. Then specify the enableStaging and stagingSettings
properties in the Copy activity.

NOTE
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more
from Prerequisites.

Example:
"activities":[
{
"name": "CopyFromDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delta lake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDatabricksDeltaLakeSource",
"sqlReaderQuery": "SELECT * FROM events TIMESTAMP AS OF timestamp_expression"
},
"sink": {
"type": "<sink type>"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]

Delta lake as sink


To copy data to Azure Databricks Delta Lake, the following properties are supported in the Copy activity sink
section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


sink, set to
AzureDatabricksDeltaLakeSink .

preCopyScript Specify a SQL query for the Copy No


activity to run before writing data into
Databricks delta table in each run.
Example :
VACUUM eventsTable DRY RUN You
can use this property to clean up the
preloaded data, or add a truncate table
or Vacuum statement.

importSettings Advanced settings used to write data No


into delta table.

Under importSettings :
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of import command, set to Yes


AzureDatabricksDeltaLakeImpor t
Command .

dateFormat Format string to date type with a date No


format. Custom date formats follow
the formats at datetime pattern. If not
specified, it uses the default value
yyyy-MM-dd .

timestampFormat Format string to timestamp type with No


a timestamp format. Custom date
formats follow the formats at datetime
pattern. If not specified, it uses the
default value
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] .

Direct copy to delta lake


If your source data store and format meet the criteria described in this section, you can use the Copy activity to
directly copy from source to Azure Databricks Delta Lake. Azure Data Factory checks the settings and fails the
Copy activity run if the following criteria is not met:
The source linked ser vice is Azure Blob storage or Azure Data Lake Storage Gen2. The account
credential should be pre-configured in Azure Databricks cluster configuration, learn more from
Prerequisites.
The source data format is of Parquet , delimited text , or Avro with the following configurations, and
points to a folder instead of file.
For Parquet format, the compression codec is none , snappy , or gzip .
For delimited text format:
rowDelimiter is default, or any single character.
compression can be none , bzip2 , gzip .
encodingName UTF-7 is not supported.
For Avro format, the compression codec is none , deflate , or snappy .
In the Copy activity source:
wildcardFileName only contains wildcard * but not ? , and wildcardFolderName is not specified.
prefix , modifiedDateTimeStart , modifiedDateTimeEnd , and enablePartitionDiscovery are not
specified.
additionalColumns is not specified.
In the Copy activity mapping, type conversion is not enabled.
Example:
"activities":[
{
"name": "CopyToDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Delta lake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDatabricksDeltaLakeSink",
"sqlReadrQuery": "VACUUM eventsTable DRY RUN"
}
}
}
]

Staged copy to delta lake


When your source data store or format does not match the direct copy criteria, as mentioned in the last section,
enable the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides
you better throughput. Data Factory automatically converts the data to meet the data format requirements into
staging storage, then load data into delta lake from there. Finally, it cleans up your temporary data from the
storage. See Staged copy for details about copying data using staging.
To use this feature, create an Azure Blob storage linked service or Azure Data Lake Storage Gen2 linked service
that refers to the storage account as the interim staging. Then specify the enableStaging and stagingSettings
properties in the Copy activity.

NOTE
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more
from Prerequisites.

Example:
"activities":[
{
"name": "CopyToDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Delta lake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDatabricksDeltaLakeSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]

Monitoring
Azure Data Factory provides the same copy activity monitoring experience as other connectors. In addition,
because loading data from/to delta lake is running on your Azure Databricks cluster, you can further view
detailed cluster logs and monitor performance.

Lookup activity properties


For more information about the properties, see Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by Copy activity in Data Factory, see supported data
stores and formats.
Copy data from or to Azure File Storage by using
Azure Data Factory
5/6/2021 • 19 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data to and from Azure File Storage. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This Azure File Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
You can copy data from Azure File Storage to any supported sink data store, or copy data from any supported
source data store to Azure File Storage. For a list of data stores that Copy Activity supports as sources and sinks,
see Supported data stores and formats.
Specifically, this Azure File Storage connector supports:
Copying files by using account key or service shared access signature (SAS) authentications.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure File Storage.

Linked service properties


This Azure File Storage connector supports the following authentication types. See the corresponding sections
for details.
Account key authentication
Shared access signature authentication
NOTE
If you were using Azure File Storage linked service with legacy model, where on ADF authoring UI shown as "Basic
authentication", it is still supported as-is, while you are suggested to use the new model going forward. The legacy model
transfers data from/to storage over Server Message Block (SMB), while the new model utilizes the storage SDK which has
better throughput. To upgrade, you can edit your linked service to switch the authentication method to "Account key" or
"SAS URI"; no change needed on dataset or copy activity.

Account key authentication


Data Factory supports the following properties for Azure File Storage account key authentication:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureFileStorage .

connectionString Specify the information needed to Yes


connect to Azure File Storage.
You can also put the account key in
Azure Key Vault and pull the
accountKey configuration out of the
connection string. For more
information, see the following samples
and the Store credentials in Azure Key
Vault article.

fileShare Specify the file share. Yes

snapshot Specify the date of the file share No


snapshot if you want to copy from a
snapshot.

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Example:

{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>;EndpointSuffix=core.windows.net;",
"fileShare": "<file share name>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store the account key in Azure Key Vault

{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;",
"fileShare": "<file share name>",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Shared access signature authentication


A shared access signature provides delegated access to resources in your storage account. You can use a shared
access signature to grant a client limited permissions to objects in your storage account for a specified time. For
more information about shared access signatures, see Shared access signatures: Understand the shared access
signature model.
Data Factory supports the following properties for using shared access signature authentication:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureFileStorage .

sasUri Specify the shared access signature Yes


URI to the resources.
Mark this field as SecureString to
store it securely in Data Factory. You
can also put the SAS token in Azure
Key Vault to use auto-rotation and
remove the token portion. For more
information, see the following samples
and Store credentials in Azure Key
Vault.

fileShare Specify the file share. Yes

snapshot Specify the date of the file share No


snapshot if you want to copy from a
snapshot.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Example:

{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the resource e.g. https://<accountname>.file.core.windows.net/?sv=
<storage version>&st=<start time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=
<protocol>&sig=<signature>>"
},
"fileShare": "<file share name>",
"snapshot": "<snapshot version>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store the account key in Azure Key Vault

{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<accountname>.file.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName with value of SAS token e.g. ?sv=<storage version>&st=<start
time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Legacy model
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureFileStorage .

host Specifies the Azure File Storage Yes


endpoint as:
-Using UI: specify
\\<storage
name>.file.core.windows.net\
<file service name>
- Using JSON:
"host": "\\\\<storage
name>.file.core.windows.net\\
<file service name>"
.

userid Specify the user to access the Azure Yes


File Storage as:
-Using UI: specify
AZURE\<storage name>
-Using JSON:
"userid": "AZURE\\<storage
name>"
.

password Specify the storage access key. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.

Example:

{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"host": "\\\\<storage name>.file.core.windows.net\\<file service name>",
"userid": "AZURE\\<storage name>",
"password": {
"type": "SecureString",
"value": "<storage access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure File Storage under location settings in format-based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


dataset must be set to
AzureFileStorageLocation .

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureFileStorageLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Azure File Storage source and sink.
Azure File Storage as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure File Storage under storeSettings settings in format-based
copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureFileStorageReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given folder/file path


specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: file prefix Prefix for the file name under the given No
- prefix file share configured in a dataset to
filter source files. Files with name
starting with
fileshare_in_linked_service/this_prefix
are selected. It utilizes the service-side
filter for Azure File Storage, which
provides better performance than a
wildcard filter. This feature is not
supported when using a legacy linked
service model.

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

OPTION 4: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When using this option, do not specify
file name in dataset. See more
examples in File list examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink is a
file-based store, an empty folder or
subfolder isn't copied or created at the
sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL, which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureFileStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Azure File Storage as sink


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for Azure File Storage under storeSettings settings in format-based
copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
AzureFileStorageWriteSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureFileStorageWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using file list path in copy activity source.
Assuming you have the following source folder structure and want to copy the files in bold:
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT A DF C O N F IGURAT IO N

root File1.csv In dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy, one file per
line with the relative path to the path
configured in the dataset.

recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File 5 contents are
merged into one file with
autogenerated file name
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 are not picked up.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Delete activity properties


To learn details about the properties, check Delete activity

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Yes

Wildcard filter is supported, allowed


wildcards are: * (matches zero or
more characters) and ? (matches
zero or single character); use ^ to
escape if your actual folder name has
wildcard or this escape char inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the No


file(s) under the specified "folderPath".
If you don't specify a value for this
property, the dataset points to all files
in the folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preser veHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if configured].
[compression if configured]", for
example "Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt.gz"; if you copy
from tabular source using table name
instead of query, the name pattern is
"[table name].[format].[compression if
configured]", for example
"MyTable.csv".
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of
files.

The properties can be NULL that


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of
files.

The properties can be NULL that


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat ,
ParquetFormat . Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are: Optimal and
Fastest .
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:

{
"name": "AzureFileStorageDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default),
false
P RO P ERT Y DESC RIP T IO N REQ UIRED

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure File Storage input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Legacy copy activity sink model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to: FileSystemSink
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy : all files from the
source folder are in the first level of
target folder. The target files have
autogenerated name.
- MergeFiles : merges all files from
the source folder to one file. If the File
Name is specified, the merged file
name would be the specified name;
otherwise, would be autogenerated file
name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure File Storage output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure SQL Database
by using Azure Data Factory
7/16/2021 • 29 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure SQL
Database, and use Data Flow to transform data in Azure SQL Database. To learn about Azure Data Factory, read
the introductory article.

Supported capabilities
This Azure SQL Database connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure SQL Database connector supports these functions:
Copying data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from an Azure SQL Database source, see the Parallel copy from SQL database section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during the copy.
If you use Azure SQL Database serverless tier, note when the server is paused, activity run fails instead of
waiting for the auto resume to be ready. You can add activity retry or chain additional activities to make sure the
server is live upon the actual execution.

IMPORTANT
If you copy data by using the Azure integration runtime, configure a server-level firewall rule so that Azure services can
access the server. If you copy data by using a self-hosted integration runtime, configure the firewall to allow the
appropriate IP range. This range includes the machine's IP that's used to connect to Azure SQL Database.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Azure Data Factory entities
specific to an Azure SQL Database connector.

Linked service properties


These properties are supported for an Azure SQL Database linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureSqlDatabase .

connectionString Specify information needed to connect Yes


to the Azure SQL Database instance
for the connectionString property.
You also can put a password or service
principal key in Azure Key Vault. If it's
SQL authentication, pull the
password configuration out of the
connection string. For more
information, see the JSON example
following the table and Store
credentials in Azure Key Vault.

servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal

servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as SecureString to store it authentication with a service principal
securely in Azure Data Factory or
reference a secret stored in Azure Key
Vault.

tenant Specify the tenant information, like the Yes, when you use Azure AD
domain name or tenant ID, under authentication with a service principal
which your application resides. Retrieve
it by hovering the mouse in the upper-
right corner of the Azure portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure AD
application is registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

alwaysEncryptedSettings Specify alwaysencr yptedsettings No


information that's needed to enable
Always Encrypted to protect sensitive
data stored in SQL server by using
either managed identity or service
principal. For more information, see
the JSON example following the table
and Using Always Encrypted section. If
not specified, the default always
encrypted setting is disabled.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia This integration runtime is used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime if your data
store is located in a private network. If
not specified, the default Azure
integration runtime is used.

NOTE
Azure SQL Database Always Encr ypted is not supported in data flow.

For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources

TIP
If you hit an error with the error code "UserErrorFailedToConnectToSqlServer" and a message like "The session limit for the
database is XXX and has been reached," add Pooling=false to your connection string and try again. Pooling=false is
also recommended for SHIR(Self Hosted Integration Runtime) type linked service setup. Pooling and other
connection parameters can be added as new parameter names and values in Additional connection proper ties
section of linked service creation form.

SQL authentication
Example: using SQL authentication

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: password in Azure Key Vault


{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Use Always Encr ypted

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use a service principal-based Azure AD application token authentication, follow these steps:
1. Create an Azure Active Directory application from the Azure portal. Make note of the application name
and the following values that define the linked service:
Application ID
Application key
Tenant ID
2. Provision an Azure Active Directory administrator for your server on the Azure portal if you haven't
already done so. The Azure AD administrator must be an Azure AD user or Azure AD group, but it can't be
a service principal. This step is done so that, in the next step, you can use an Azure AD identity to create a
contained database user for the service principal.
3. Create contained database users for the service principal. Connect to the database from or to which you
want to copy data by using tools like SQL Server Management Studio, with an Azure AD identity that has
at least ALTER ANY USER permission. Run the following T-SQL:

CREATE USER [your application name] FROM EXTERNAL PROVIDER;

4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code. For more options, see this document.

ALTER ROLE [role name] ADD MEMBER [your application name];

5. Configure an Azure SQL Database linked service in Azure Data Factory.


Linked service example that uses service principal authentication

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;Connection Timeout=30",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources that represents the specific data
factory. You can use this managed identity for Azure SQL Database authentication. The designated factory can
access and copy data from or to your database by using this identity.
To use managed identity authentication, follow these steps.
1. Provision an Azure Active Directory administrator for your server on the Azure portal if you haven't
already done so. The Azure AD administrator can be an Azure AD user or an Azure AD group. If you grant
the group with managed identity an admin role, skip steps 3 and 4. The administrator has full access to
the database.
2. Create contained database users for the Azure Data Factory managed identity. Connect to the database
from or to which you want to copy data by using tools like SQL Server Management Studio, with an
Azure AD identity that has at least ALTER ANY USER permission. Run the following T-SQL:

CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER;

3. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.

ALTER ROLE [role name] ADD MEMBER [your Data Factory name];

4. Configure an Azure SQL Database linked service in Azure Data Factory.


Example

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available to define datasets, see Datasets.
The following properties are supported for Azure SQL Database dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureSqlTable .

schema Name of the schema. No for source, Yes for sink

table Name of the table/view. No for source, Yes for sink

tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .

Dataset properties example


{
"name": "AzureSQLDbDataset",
"properties":
{
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Database linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see Pipelines. This section provides a list
of properties supported by the Azure SQL Database source and sink.
Azure SQL Database as the source

TIP
To load data from Azure SQL Database efficiently by using data partitioning, learn more from Parallel copy from SQL
database.

To copy data from Azure SQL Database, the following properties are supported in the copy activity source
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
AzureSqlSource . "SqlSource" type is
still supported for backward
compatibility.

sqlReaderQuery This property uses the custom SQL No


query to read data. An example is
select * from MyTable .

sqlReaderStoredProcedureName The name of the stored procedure that No


reads data from the source table. The
last SQL statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name or value
pairs. The names and casing of
parameters must match the names
and casing of the stored procedure
parameters.
P RO P ERT Y DESC RIP T IO N REQ UIRED

isolationLevel Specifies the transaction locking No


behavior for the SQL source. The
allowed values are: ReadCommitted ,
ReadUncommitted ,
RepeatableRead , Serializable ,
Snapshot . If not specified, the
database's default isolation level is
used. Refer to this doc for more details.

partitionOptions Specifies the data partitioning options No


used to load data from Azure SQL
Database.
Allowed values are: None (default),
PhysicalPar titionsOfTable , and
DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from an Azure SQL Database is
controlled by the parallelCopies
setting on the copy activity.

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when the partition option isn't
None .

Under partitionSettings :

partitionColumnName Specify the name of the source column No


in integer or date/datetime type (
int , smallint , bigint , date ,
smalldatetime , datetime ,
datetime2 , or datetimeoffset )
that will be used by range partitioning
for parallel copy. If not specified, the
index or the primary key of the table is
autodetected and used as the partition
column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?
AdfDynamicRangePartitionCondition
in the WHERE clause. For an example,
see the Parallel copy from SQL
database section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionUpperBound The maximum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

partitionLowerBound The minimum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

Note the following points:


If sqlReaderQuer y is specified for AzureSqlSource , the copy activity runs this query against the Azure
SQL Database source to get the data. You also can specify a stored procedure by specifying
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
When using stored procedure in source to retrieve data, note if your stored procedure is designed as
returning different schema when different parameter value is passed in, you may encounter failure or see
unexpected result when importing schema from UI or when copying data to SQL database with auto table
creation.
SQL query example
"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure example

"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure SQL Database as the sink

TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into Azure SQL Database.

To copy data to Azure SQL Database, the following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to AzureSqlSink .
"SqlSink" type is still supported for
backward compatibility.

preCopyScript Specify a SQL query for the copy No


activity to run before writing data into
Azure SQL Database. It's invoked only
once per copy run. Use this property
to clean up the preloaded data.

tableOption Specifies whether to automatically No


create the sink table if not exists based
on the source schema.
Auto table creation is not supported
when sink specifies stored procedure.
Allowed values are: none (default),
autoCreate .

sqlWriterStoredProcedureName The name of the stored procedure that No


defines how to apply source data into
a target table.
This stored procedure is invoked per
batch. For operations that run only
once and have nothing to do with
source data, for example, delete or
truncate, use the preCopyScript
property.
See example from Invoke a stored
procedure from a SQL sink.

storedProcedureTableTypeParameterNa The parameter name of the table type No


me specified in the stored procedure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

sqlWriterTableType The table type name to be used in the No


stored procedure. The copy activity
makes the data being moved available
in a temp table with this table type.
Stored procedure code can then merge
the data that's being copied with
existing data.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name and value
pairs. Names and casing of parameters
must match the names and casing of
the stored procedure parameters.

writeBatchSize Number of rows to insert into the SQL No


table per batch.
The allowed value is integer (number
of rows). By default, Azure Data
Factory dynamically determines the
appropriate batch size based on the
row size.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out.
The allowed value is timespan . An
example is "00:30:00" (30 minutes).

disableMetricsCollection Data Factory collects metrics such as No (default is false )


Azure SQL Database DTUs for copy
performance optimization and
recommendations, which introduces
additional master DB access. If you are
concerned with this behavior, specify
true to turn it off.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example 1: Append data


"activities":[
{
"name": "CopyToAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure SQL Database output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSqlSink",
"tableOption": "autoCreate",
"writeBatchSize": 100000
}
}
}
]

Example 2: Invoke a stored procedure during copy


Learn more details from Invoke a stored procedure from a SQL sink.
"activities":[
{
"name": "CopyToAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure SQL Database output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSqlSink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"storedProcedureTableTypeParameterName": "MyTable",
"sqlWriterTableType": "MyTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Parallel copy from SQL database


The Azure SQL Database connector in copy activity provides built-in data partitioning to copy data in parallel.
You can find data partitioning options on the Source tab of the copy activity.

When you enable partitioned copy, copy activity runs parallel queries against your Azure SQL Database source
to load data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity.
For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based
on your specified partition option and settings, and each query retrieves a portion of data from your Azure SQL
Database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Azure SQL Database. The following are suggested configurations for different scenarios. When
copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table, with physical partitions. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partitions, and copies data by partitions.

To check if your table has physical partition or not, you can


refer to this query.

Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.

For example, if your partition column "ID" has values range


from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions - IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.
SC EN A RIO SUGGEST ED SET T IN GS

Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.

During execution, Data Factory replaces


?AdfRangePartitionColumnName with the actual column
name and value ranges for each partition, and sends to
Azure SQL Database.
For example, if your partition column "ID" has values range
from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions- IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.

Here are more sample queries for different scenarios:


1. Query the whole table:
SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition
2. Query from a table with column selection and additional
where-clause filters:
SELECT <column_list> FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
3. Query with subqueries:
SELECT <column_list> FROM (<your_sub_query>) AS T
WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
4. Query with partition in subquery:
SELECT <column_list> FROM (SELECT
<your_sub_query_column_list> FROM <TableName> WHERE
?AdfDynamicRangePartitionCondition) AS T

Best practices to load data with partition option:


1. Choose distinctive column as partition column (like primary key or unique key) to avoid data skew.
2. If the table has built-in partition, use partition option "Physical partitions of table" to get better performance.
3. If you use Azure Integration Runtime to copy data, you can set larger "Data Integration Units (DIU)" (>4) to
utilize more computing resource. Check the applicable scenarios there.
4. "Degree of copy parallelism" control the partition numbers, setting this number too large sometime hurts
the performance, recommend setting this number as (DIU or number of Self-hosted IR nodes) * (2 to 4).
Example: full load from large table with physical par titions

"source": {
"type": "AzureSqlSource",
"partitionOption": "PhysicalPartitionsOfTable"
}

Example: quer y with dynamic range par tition


"source": {
"type": "AzureSqlSource",
"query":"SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride,
not as data filter>",
"partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride,
not as data filter>"
}
}

Sample query to check physical partition

SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, pf.name AS PartitionFunctionName, c.name AS


ColumnName, iif(pf.name is null, 'no', 'yes') AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id
= t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.partition_schemes ps ON i.data_space_id = ps.data_space_id
LEFT JOIN sys.partition_functions pf ON pf.function_id = ps.function_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'

If the table has physical partition, you would see "HasPartition" as "yes" like the following.

Best practice for loading data into Azure SQL Database


When you copy data into Azure SQL Database, you might require different write behavior:
Append: My source data has only new records.
Upsert: My source data has both inserts and updates.
Overwrite: I want to reload an entire dimension table each time.
Write with custom logic: I need extra processing before the final insertion into the destination table.
Refer to the respective sections about how to configure in Azure Data Factory and best practices.
Append data
Appending data is the default behavior of this Azure SQL Database sink connector. Azure Data Factory does a
bulk insert to write to your table efficiently. You can configure the source and sink accordingly in the copy
activity.
Upsert data
Option 1: When you have a large amount of data to copy, you can bulk load all records into a staging table by
using the copy activity, then run a stored procedure activity to apply a MERGE or INSERT/UPDATE statement in
one shot.
Copy activity currently doesn't natively support loading data into a database temporary table. There is an
advanced way to set it up with a combination of multiple activities, refer to Optimize Azure SQL Database Bulk
Upsert scenarios. Below shows a sample of using a permanent table as staging.
As an example, in Azure Data Factory, you can create a pipeline with a Copy activity chained with a Stored
Procedure activity . The former copies data from your source store into an Azure SQL Database staging table,
for example, Upser tStagingTable , as the table name in the dataset. Then the latter invokes a stored procedure
to merge source data from the staging table into the target table and clean up the staging table.

In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.

CREATE PROCEDURE [dbo].[spMergeData]


AS
BEGIN
MERGE TargetTable AS target
USING UpsertStagingTable AS source
ON (target.[ProfileID] = source.[ProfileID])
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT matched THEN
INSERT ([ProfileID], [State], [Category])
VALUES (source.ProfileID, source.State, source.Category);
TRUNCATE TABLE UpsertStagingTable
END

Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Option 3: You can use Mapping Data Flow which offers built-in insert/upsert/update methods.
Overwrite the entire table
You can configure the preCopyScript property in the copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data, or use Mapping Data Flow.

Invoke a stored procedure from a SQL sink


When you copy data into Azure SQL Database, you also can configure and invoke a user-specified stored
procedure with additional parameters on each batch of the source table. The stored procedure feature takes
advantage of table-valued parameters.
You can use a stored procedure when built-in copy mechanisms don't serve the purpose. An example is when
you want to apply extra processing before the final insertion of source data into the destination table. Some
extra processing examples are when you want to merge columns, look up additional values, and insert into
more than one table.
The following sample shows how to use a stored procedure to do an upsert into a table in Azure SQL Database.
Assume that the input data and the sink Marketing table each have three columns: ProfileID , State , and
Categor y . Do the upsert based on the ProfileID column, and only apply it for a specific category called
"ProductA".
1. In your database, define the table type with the same name as sqlWriterTableType . The schema of the
table type is the same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category


varchar(256)
AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

3. In Azure Data Factory, define the SQL sink section in the copy activity as follows:

"sink": {
"type": "AzureSqlSink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from Azure SQL Database. For
more information, see the source transformation and sink transformation in mapping data flows.
Source transformation
Settings specific to Azure SQL Database are available in the Source Options tab of the source transformation.
Input: Select whether you point your source at a table (equivalent of Select * from <table-name> ) or enter a
custom SQL query.
Quer y : If you select Query in the input field, enter a SQL query for your source. This setting overrides any table
that you've chosen in the dataset. Order By clauses aren't supported here, but you can set a full SELECT FROM
statement. You can also use user-defined table functions. select * from udfGetData() is a UDF in SQL that
returns a table. This query will produce a source table that you can use in your data flow. Using queries is also a
great way to reduce rows for testing or for lookups.
Stored procedure : Choose this option if you wish to generate a projection and source data from a stored
procedure that is executed from your source database. You can type in the schema, procedure name, and
parameters, or click on Refresh to ask ADF to discover the schemas and procedure names. Then you can click on
Import to import all procedure parameters using the form @paraName .

SQL Example: Select * from MyTable where customerId > 1000 and customerId < 2000
Parameterized SQL Example: "select * from {$tablename} where orderyear > {$year}"
Batch size : Enter a batch size to chunk large data into reads.
Isolation Level : The default for SQL sources in mapping data flow is read uncommitted. You can change the
isolation level here to one of these values:
Read Committed
Read Uncommitted
Repeatable Read
Serializable
None (ignore isolation level)
Sink transformation
Settings specific to Azure SQL Database are available in the Settings tab of the sink transformation.
Update method: Determines what operations are allowed on your database destination. The default is to only
allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those
actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter.

The column name that you pick as the key here will be used by ADF as part of the subsequent update, upsert,
delete. Therefore, you must pick a column that exists in the Sink mapping. If you wish to not write the value to
this key column, then click "Skip writing key columns".
You can parameterize the key column used here for updating your target Azure SQL Database table. If you have
multiple columns for a composite key, the click on "Custom Expression" and you will be able to add dynamic
content using the ADF data flow expression language, which can include an array of strings with column names
for a composite key.
Table action: Determines whether to recreate or remove all rows from the destination table prior to writing.
None: No action will be done to the table.
Recreate: The table will get dropped and recreated. Required if creating a new table dynamically.
Truncate: All rows from the target table will get removed.
Batch size : Controls how many rows are being written in each bucket. Larger batch sizes improve compression
and memory optimization, but risk out of memory exceptions when caching data.
Use TempDB: By default, Data Factory will use a global temporary table to store data as part of the loading
process. You can alternatively uncheck the "Use TempDB" option and instead, ask Data Factory to store the
temporary holding table in a user database that is located in the database that is being used for this Sink.
Pre and Post SQL scripts : Enter multi-line SQL scripts that will execute before (pre-processing) and after
(post-processing) data is written to your Sink database

Error row handling


When writing to Azure SQL DB, certain rows of data may fail due to constraints set by the destination. Some
common errors include:
String or binary data would be truncated in table
Cannot insert the value NULL into column
The INSERT statement conflicted with the CHECK constraint
By default, a data flow run will fail on the first error it gets. You can choose to Continue on error that allows
your data flow to complete even if individual rows have errors. Azure Data Factory provides different options for
you to handle these error rows.
Transaction Commit: Choose whether your data gets written in a single transaction or in batches. Single
transaction will provide worse performance but no data written will be visible to others until the transaction
completes.
Output rejected data: If enabled, you can output the error rows into a csv file in Azure Blob Storage or an
Azure Data Lake Storage Gen2 account of your choosing. This will write the error rows with three additional
columns: the SQL operation like INSERT or UPDATE, the data flow error code, and the error message on the row.
Repor t success on error : If enabled, the data flow will be marked as a success even if error rows are found.

Data type mapping for Azure SQL Database


When data is copied from or to Azure SQL Database, the following mappings are used from Azure SQL
Database data types to Azure Data Factory interim data types. To learn how the copy activity maps the source
schema and data type to the sink, see Schema and data type mappings.

A Z URE SQ L DATA B A SE DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]


A Z URE SQ L DATA B A SE DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal

sql_variant Object

text String, Char[]

time TimeSpan

timestamp Byte[]

tinyint Byte

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml String

NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
with precision larger than 28, consider converting to a string in SQL query.
Lookup activity properties
To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Using Always Encrypted


When you copy data from/to SQL Server with Always Encrypted, follow below steps:
1. Store the Column Master Key (CMK) in an Azure Key Vault. Learn more on how to configure Always
Encrypted by using Azure Key Vault
2. Make sure to great access to the key vault where the Column Master Key (CMK) is stored. Refer to this
article for required permissions.
3. Create linked service to connect to your SQL database and enable 'Always Encrypted' function by using
either managed identity or service principal.

NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores and formats.
Copy and transform data in Azure SQL Managed
Instance by using Azure Data Factory
7/16/2021 • 27 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure SQL
Managed Instance, and use Data Flow to transform data in Azure SQL Managed Instance. To learn about Azure
Data Factory, read the introductory article.

Supported capabilities
This SQL Managed Instance connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure SQL Database connector supports these functions:
Copying data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from SQL MI source, see the Parallel copy from SQL MI section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during copy.

Prerequisites
To access the SQL Managed Instance public endpoint, you can use an Azure Data Factory managed Azure
integration runtime. Make sure that you enable the public endpoint and also allow public endpoint traffic on the
network security group so that Azure Data Factory can connect to your database. For more information, see this
guidance.
To access the SQL Managed Instance private endpoint, set up a self-hosted integration runtime that can access
the database. If you provision the self-hosted integration runtime in the same virtual network as your managed
instance, make sure that your integration runtime machine is in a different subnet than your managed instance.
If you provision your self-hosted integration runtime in a different virtual network than your managed instance,
you can use either a virtual network peering or a virtual network to virtual network connection. For more
information, see Connect your application to SQL Managed Instance.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Azure Data Factory entities
specific to the SQL Managed Instance connector.

Linked service properties


The following properties are supported for the SQL Managed Instance linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureSqlMI .

connectionString This property specifies the Yes


connectionString information that's
needed to connect to SQL Managed
Instance by using SQL authentication.
For more information, see the
following examples.
The default port is 1433. If you're
using SQL Managed Instance with a
public endpoint, explicitly specify port
3342.
You also can put a password in Azure
Key Vault. If it's SQL authentication,
pull the password configuration out
of the connection string. For more
information, see the JSON example
following the table and Store
credentials in Azure Key Vault.

servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal

servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as SecureString to store it authentication with a service principal
securely in Azure Data Factory or
reference a secret stored in Azure Key
Vault.

tenant Specify the tenant information, like the Yes, when you use Azure AD
domain name or tenant ID, under authentication with a service principal
which your application resides. Retrieve
it by hovering the mouse in the upper-
right corner of the Azure portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure AD
application is registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.
P RO P ERT Y DESC RIP T IO N REQ UIRED

alwaysEncryptedSettings Specify alwaysencr yptedsettings No


information that's needed to enable
Always Encrypted to protect sensitive
data stored in SQL server by using
either managed identity or service
principal. For more information, see
the JSON example following the table
and Using Always Encrypted section. If
not specified, the default always
encrypted setting is disabled.

connectVia This integration runtime is used to Yes


connect to the data store. You can use
a self-hosted integration runtime or an
Azure integration runtime if your
managed instance has a public
endpoint and allows Azure Data
Factory to access it. If not specified, the
default Azure integration runtime is
used.

NOTE
SQL Managed Instance Always Encr ypted is not supported in data flow.

For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
SQL authentication
Example 1: use SQL authentication

{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: use SQL authentication with a password in Azure Key Vault


{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: use SQL authentication with Always Encr ypted

{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use a service principal-based Azure AD application token authentication, follow these steps:
1. Follow the steps to Provision an Azure Active Directory administrator for your Managed Instance.
2. Create an Azure Active Directory application from the Azure portal. Make note of the application name
and the following values that define the linked service:
Application ID
Application key
Tenant ID
3. Create logins for the Azure Data Factory managed identity. In SQL Server Management Studio (SSMS),
connect to your managed instance using a SQL Server account that is a sysadmin . In master database,
run the following T-SQL:

CREATE LOGIN [your application name] FROM EXTERNAL PROVIDER

4. Create contained database users for the Azure Data Factory managed identity. Connect to the database
from or to which you want to copy data, run the following T-SQL:

CREATE USER [your application name] FROM EXTERNAL PROVIDER

5. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.

ALTER ROLE [role name e.g. db_owner] ADD MEMBER [your application name]

6. Configure a SQL Managed Instance linked service in Azure Data Factory.


Example: use ser vice principal authentication

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources that represents the specific data
factory. You can use this managed identity for SQL Managed Instance authentication. The designated factory can
access and copy data from or to your database by using this identity.
To use managed identity authentication, follow these steps.
1. Follow the steps to Provision an Azure Active Directory administrator for your Managed Instance.
2. Create logins for the Azure Data Factory managed identity. In SQL Server Management Studio (SSMS),
connect to your managed instance using a SQL Server account that is a sysadmin . In master database,
run the following T-SQL:

CREATE LOGIN [your Data Factory name] FROM EXTERNAL PROVIDER

3. Create contained database users for the Azure Data Factory managed identity. Connect to the database
from or to which you want to copy data, run the following T-SQL:
CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER

4. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.

ALTER ROLE [role name e.g. db_owner] ADD MEMBER [your Data Factory name]

5. Configure a SQL Managed Instance linked service in Azure Data Factory.


Example: uses managed identity authentication

{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for use to define datasets, see the datasets article. This section
provides a list of properties supported by the SQL Managed Instance dataset.
To copy data to and from SQL Managed Instance, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureSqlMITable .

schema Name of the schema. No for source, Yes for sink

table Name of the table/view. No for source, Yes for sink

tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .

Example
{
"name": "AzureSqlMIDataset",
"properties":
{
"type": "AzureSqlMITable",
"linkedServiceName": {
"referenceName": "<SQL Managed Instance linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for use to define activities, see the Pipelines article. This section
provides a list of properties supported by the SQL Managed Instance source and sink.
SQL Managed Instance as a source

TIP
To load data from SQL MI efficiently by using data partitioning, learn more from Parallel copy from SQL MI.

To copy data from SQL Managed Instance, the following properties are supported in the copy activity source
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to SqlMISource .

sqlReaderQuery This property uses the custom SQL No


query to read data. An example is
select * from MyTable .

sqlReaderStoredProcedureName This property is the name of the No


stored procedure that reads data from
the source table. The last SQL
statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters These parameters are for the stored No


procedure.
Allowed values are name or value
pairs. The names and casing of the
parameters must match the names
and casing of the stored procedure
parameters.
P RO P ERT Y DESC RIP T IO N REQ UIRED

isolationLevel Specifies the transaction locking No


behavior for the SQL source. The
allowed values are: ReadCommitted ,
ReadUncommitted ,
RepeatableRead , Serializable ,
Snapshot . If not specified, the
database's default isolation level is
used. Refer to this doc for more details.

partitionOptions Specifies the data partitioning options No


used to load data from SQL MI.
Allowed values are: None (default),
PhysicalPar titionsOfTable , and
DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from SQL MI is controlled by the
parallelCopies setting on the copy
activity.

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when the partition option isn't
None .

Under partitionSettings :

partitionColumnName Specify the name of the source column No


in integer or date/datetime type (
int , smallint , bigint , date ,
smalldatetime , datetime ,
datetime2 , or datetimeoffset )
that will be used by range partitioning
for parallel copy. If not specified, the
index or the primary key of the table is
auto-detected and used as the
partition column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?
AdfDynamicRangePartitionCondition
in the WHERE clause. For an example,
see the Parallel copy from SQL
database section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionUpperBound The maximum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

partitionLowerBound The minimum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

Note the following points:


If sqlReaderQuer y is specified for SqlMISource , the copy activity runs this query against the SQL
Managed Instance source to get the data. You also can specify a stored procedure by specifying
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
When using stored procedure in source to retrieve data, note if your stored procedure is designed as
returning different schema when different parameter value is passed in, you may encounter failure or see
unexpected result when importing schema from UI or when copying data to SQL database with auto table
creation.
Example: Use a SQL quer y
"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlMISource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example: Use a stored procedure

"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlMISource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

The stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

SQL Managed Instance as a sink

TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into SQL Managed Instance.

To copy data to SQL Managed Instance, the following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to SqlMISink .

preCopyScript This property specifies a SQL query for No


the copy activity to run before writing
data into SQL Managed Instance. It's
invoked only once per copy run. You
can use this property to clean up
preloaded data.

tableOption Specifies whether to automatically No


create the sink table if not exists based
on the source schema. Auto table
creation is not supported when sink
specifies stored procedure. Allowed
values are: none (default),
autoCreate .

sqlWriterStoredProcedureName The name of the stored procedure that No


defines how to apply source data into
a target table.
This stored procedure is invoked per
batch. For operations that run only
once and have nothing to do with
source data, for example, delete or
truncate, use the preCopyScript
property.
See example from Invoke a stored
procedure from a SQL sink.

storedProcedureTableTypeParameterNa The parameter name of the table type No


me specified in the stored procedure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

sqlWriterTableType The table type name to be used in the No


stored procedure. The copy activity
makes the data being moved available
in a temp table with this table type.
Stored procedure code can then merge
the data that's being copied with
existing data.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name and value
pairs. Names and casing of parameters
must match the names and casing of
the stored procedure parameters.

writeBatchSize Number of rows to insert into the SQL No


table per batch.
Allowed values are integers for the
number of rows. By default, Azure
Data Factory dynamically determines
the appropriate batch size based on
the row size.

writeBatchTimeout This property specifies the wait time No


for the batch insert operation to
complete before it times out.
Allowed values are for the timespan.
An example is "00:30:00," which is 30
minutes.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example 1: Append data


"activities":[
{
"name": "CopyToAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Managed Instance output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlMISink",
"tableOption": "autoCreate",
"writeBatchSize": 100000
}
}
}
]

Example 2: Invoke a stored procedure during copy


Learn more details from Invoke a stored procedure from a SQL MI sink.
"activities":[
{
"name": "CopyToAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Managed Instance output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlMISink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"storedProcedureTableTypeParameterName": "MyTable",
"sqlWriterTableType": "MyTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Parallel copy from SQL MI


The Azure SQL Managed Instance connector in copy activity provides built-in data partitioning to copy data in
parallel. You can find data partitioning options on the Source tab of the copy activity.

When you enable partitioned copy, copy activity runs parallel queries against your SQL MI source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your SQL MI.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your SQL MI. The following are suggested configurations for different scenarios. When copying data into
file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in which
case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table, with physical partitions. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partitions, and copies data by partitions.

To check if your table has physical partition or not, you can


refer to this query.

Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.

For example, if your partition column "ID" has values range


from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions - IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.
SC EN A RIO SUGGEST ED SET T IN GS

Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.

During execution, Data Factory replaces


?AdfRangePartitionColumnName with the actual column
name and value ranges for each partition, and sends to SQL
MI.
For example, if your partition column "ID" has values range
from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions- IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.

Here are more sample queries for different scenarios:


1. Query the whole table:
SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition
2. Query from a table with column selection and additional
where-clause filters:
SELECT <column_list> FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
3. Query with subqueries:
SELECT <column_list> FROM (<your_sub_query>) AS T
WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
4. Query with partition in subquery:
SELECT <column_list> FROM (SELECT
<your_sub_query_column_list> FROM <TableName> WHERE
?AdfDynamicRangePartitionCondition) AS T

Best practices to load data with partition option:


1. Choose distinctive column as partition column (like primary key or unique key) to avoid data skew.
2. If the table has built-in partition, use partition option "Physical partitions of table" to get better performance.
3. If you use Azure Integration Runtime to copy data, you can set larger "Data Integration Units (DIU)" (>4) to
utilize more computing resource. Check the applicable scenarios there.
4. "Degree of copy parallelism" control the partition numbers, setting this number too large sometime hurts
the performance, recommend setting this number as (DIU or number of Self-hosted IR nodes) * (2 to 4).
Example: full load from large table with physical par titions

"source": {
"type": "SqlMISource",
"partitionOption": "PhysicalPartitionsOfTable"
}

Example: quer y with dynamic range par tition


"source": {
"type": "SqlMISource",
"query":"SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride,
not as data filter>",
"partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride,
not as data filter>"
}
}

Sample query to check physical partition

SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, pf.name AS PartitionFunctionName, c.name AS


ColumnName, iif(pf.name is null, 'no', 'yes') AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id
= t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.partition_schemes ps ON i.data_space_id = ps.data_space_id
LEFT JOIN sys.partition_functions pf ON pf.function_id = ps.function_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'

If the table has physical partition, you would see "HasPartition" as "yes" like the following.

Best practice for loading data into SQL Managed Instance


When you copy data into SQL Managed Instance, you might require different write behavior:
Append: My source data has only new records.
Upsert: My source data has both inserts and updates.
Overwrite: I want to reload the entire dimension table each time.
Write with custom logic: I need extra processing before the final insertion into the destination table.
See the respective sections for how to configure in Azure Data Factory and best practices.
Append data
Appending data is the default behavior of the SQL Managed Instance sink connector. Azure Data Factory does a
bulk insert to write to your table efficiently. You can configure the source and sink accordingly in the copy
activity.
Upsert data
Option 1: When you have a large amount of data to copy, you can bulk load all records into a staging table by
using the copy activity, then run a stored procedure activity to apply a MERGE or INSERT/UPDATE statement in
one shot.
Copy activity currently doesn't natively support loading data into a database temporary table. There is an
advanced way to set it up with a combination of multiple activities, refer to Optimize SQL Database Bulk Upsert
scenarios. Below shows a sample of using a permanent table as staging.
As an example, in Azure Data Factory, you can create a pipeline with a Copy activity chained with a Stored
Procedure activity . The former copies data from your source store into an Azure SQL Managed Instance
staging table, for example, Upser tStagingTable , as the table name in the dataset. Then the latter invokes a
stored procedure to merge source data from the staging table into the target table and clean up the staging
table.

In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.

CREATE PROCEDURE [dbo].[spMergeData]


AS
BEGIN
MERGE TargetTable AS target
USING UpsertStagingTable AS source
ON (target.[ProfileID] = source.[ProfileID])
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT matched THEN
INSERT ([ProfileID], [State], [Category])
VALUES (source.ProfileID, source.State, source.Category);

TRUNCATE TABLE UpsertStagingTable


END

Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Overwrite the entire table
You can configure the preCopyScript property in a copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data.

Invoke a stored procedure from a SQL sink


When you copy data into SQL Managed Instance, you also can configure and invoke a user-specified stored
procedure with additional parameters on each batch of the source table. The stored procedure feature takes
advantage of table-valued parameters.
You can use a stored procedure when built-in copy mechanisms don't serve the purpose. An example is when
you want to apply extra processing before the final insertion of source data into the destination table. Some
extra processing examples are when you want to merge columns, look up additional values, and insert into
more than one table.
The following sample shows how to use a stored procedure to do an upsert into a table in the SQL Server
database. Assume that the input data and the sink Marketing table each have three columns: ProfileID , State ,
and Categor y . Do the upsert based on the ProfileID column, and only apply it for a specific category called
"ProductA".
1. In your database, define the table type with the same name as sqlWriterTableType . The schema of the
table type is the same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category


varchar(256)
AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

3. In Azure Data Factory, define the SQL MI sink section in the copy activity as follows:

"sink": {
"type": "SqlMISink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from Azure SQL Managed
Instance. For more information, see the source transformation and sink transformation in mapping data flows.
Source transformation
The below table lists the properties supported by Azure SQL Managed Instance source. You can edit these
properties in the Source options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table If you select Table as No - -


input, data flow
fetches all the data
from the table
specified in the
dataset.

Query If you select Query No String query


as input, specify a
SQL query to fetch
data from source,
which overrides any
table you specify in
dataset. Using
queries is a great way
to reduce rows for
testing or lookups.

Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
Select * from
MyTable where
customerId > 1000
and customerId <
2000

Batch size Specify a batch size No Integer batchSize


to chunk large data
into reads.

Isolation Level Choose one of the No READ_COMMITTED isolationLevel


following isolation READ_UNCOMMITTED
levels: REPEATABLE_READ
- Read Committed SERIALIZABLE
- Read Uncommitted NONE
(default)
- Repeatable Read
- Serializable
- None (ignore
isolation level)

Azure SQL Managed Instance source script example


When you use Azure SQL Managed Instance as source type, the associated data flow script is:
source(allowSchemaDrift: true,
validateSchema: false,
isolationLevel: 'READ_UNCOMMITTED',
query: 'select * from MYTABLE',
format: 'query') ~> SQLMISource

Sink transformation
The below table lists the properties supported by Azure SQL Managed Instance sink. You can edit these
properties in the Sink options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Update method Specify what Yes true or false deletable


operations are insertable
allowed on your updateable
database destination. upsertable
The default is to only
allow inserts.
To update, upsert, or
delete rows, an Alter
row transformation is
required to tag rows
for those actions.

Key columns For updates, upserts No Array keys


and deletes, key
column(s) must be
set to determine
which row to alter.
The column name
that you pick as the
key will be used as
part of the
subsequent update,
upsert, delete.
Therefore, you must
pick a column that
exists in the Sink
mapping.

Skip writing key If you wish to not No true or false skipKeyWrites


columns write the value to the
key column, select
"Skip writing key
columns".
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table action Determines whether No true or false recreate


to recreate or truncate
remove all rows from
the destination table
prior to writing.
- None : No action
will be done to the
table.
- Recreate : The table
will get dropped and
recreated. Required if
creating a new table
dynamically.
- Truncate : All rows
from the target table
will get removed.

Batch size Specify how many No Integer batchSize


rows are being
written in each batch.
Larger batch sizes
improve compression
and memory
optimization, but risk
out of memory
exceptions when
caching data.

Pre and Post SQL Specify multi-line No String preSQLs


scripts SQL scripts that will postSQLs
execute before (pre-
processing) and after
(post-processing)
data is written to
your Sink database.

Azure SQL Managed Instance sink script example


When you use Azure SQL Managed Instance as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,


validateSchema: false,
deletable:false,
insertable:true,
updateable:true,
upsertable:true,
keys:['keyColumn'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> SQLMISink

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity
Data type mapping for SQL Managed Instance
When data is copied to and from SQL Managed Instance using copy activity, the following mappings are used
from SQL Managed Instance data types to Azure Data Factory interim data types. To learn how the copy activity
maps from the source schema and data type to the sink, see Schema and data type mappings.

SQ L M A N A GED IN STA N C E DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal
SQ L M A N A GED IN STA N C E DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

sql_variant Object

text String, Char[]

time TimeSpan

timestamp Byte[]

tinyint Int16

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml String

NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
that requires precision larger than 28, consider converting to a string in a SQL query.

Using Always Encrypted


When you copy data from/to SQL Server with Always Encrypted, follow below steps:
1. Store the Column Master Key (CMK) in an Azure Key Vault. Learn more on how to configure Always
Encrypted by using Azure Key Vault
2. Make sure to great access to the key vault where the Column Master Key (CMK) is stored. Refer to this
article for required permissions.
3. Create linked service to connect to your SQL database and enable 'Always Encrypted' function by using
either managed identity or service principal.

NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy and transform data in Azure Synapse
Analytics by using Azure Data Factory
5/11/2021 • 34 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Synapse
Analytics, and use Data Flow to transform data in Azure Data Lake Storage Gen2. To learn about Azure Data
Factory, read the introductory article.

Supported capabilities
This Azure Synapse Analytics connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure Synapse Analytics connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure. You can also choose to parallel copy
from an Azure Synapse Analytics source, see the Parallel copy from Azure Synapse Analytics section for
details.
As a sink, load data by using PolyBase or COPY statement or bulk insert. We recommend PolyBase or COPY
statement for better copy performance. The connector also supports automatically creating destination table
if not exists based on the source schema.

IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure a server-level firewall rule so that Azure
services can access the logical SQL server. If you copy data by using a self-hosted integration runtime, configure the
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure Synapse
Analytics.

Get started
TIP
To achieve best performance, use PolyBase or COPY statement to load data into Azure Synapse Analytics. The Use
PolyBase to load data into Azure Synapse Analytics and Use COPY statement to load data into Azure Synapse Analytics
sections have details. For a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics under 15 minutes
with Azure Data Factory.

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure
Synapse Analytics connector.

Linked service properties


The following properties are supported for an Azure Synapse Analytics linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureSqlDW .

connectionString Specify the information needed to Yes


connect to the Azure Synapse
Analytics instance for the
connectionString property.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put password/service principal
key in Azure Key Vault,and if it's SQL
authentication pull the password
configuration out of the connection
string. See the JSON example below
the table and Store credentials in Azure
Key Vault article with more details.

servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.

servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. You can retrieve it
by hovering the mouse in the top-
right corner of the Azure portal.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your Azure AD
application is registered.
Allowed values are AzurePublic ,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or a self-
hosted integration runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources

TIP
When creating linked service for Azure Synapse ser verless SQL pool from UI, choose "enter manually" instead of
browsing from subscription.

TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.

SQL authentication
Linked service example that uses SQL authentication

{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Password in Azure Key Vault:


{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Service principal authentication


To use service principal-based Azure AD application token authentication, follow these steps:
1. Create an Azure Active Director y application from the Azure portal. Make note of the application
name and the following values that define the linked service:
Application ID
Application key
Tenant ID
2. Provision an Azure Active Director y administrator for your server in the Azure portal if you
haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If you
grant the group with managed identity an admin role, skip steps 3 and 4. The administrator will have full
access to the database.
3. Create contained database users for the service principal. Connect to the data warehouse from or to
which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER
ANY USER permission. Run the following T-SQL:

CREATE USER [your application name] FROM EXTERNAL PROVIDER;

4. Grant the ser vice principal needed permissions as you normally do for SQL users or others. Run
the following code, or refer to more options here. If you want to use PolyBase to load the data, learn the
required database permission.

EXEC sp_addrolemember db_owner, [your application name];

5. Configure an Azure Synapse Analytics linked ser vice in Azure Data Factory.
Linked service example that uses service principal authentication
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Managed identities for Azure resources authentication


A data factory can be associated with a managed identity for Azure resources that represents the specific
factory. You can use this managed identity for Azure Synapse Analytics authentication. The designated factory
can access and copy data from or to your data warehouse by using this identity.
To use managed identity authentication, follow these steps:
1. Provision an Azure Active Director y administrator for your server on the Azure portal if you
haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If you
grant the group with managed identity an admin role, skip steps 3 and 4. The administrator will have full
access to the database.
2. Create contained database users for the Data Factory Managed Identity. Connect to the data
warehouse from or to which you want to copy data by using tools like SSMS, with an Azure AD identity
that has at least ALTER ANY USER permission. Run the following T-SQL.

CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER;

3. Grant the Data Factor y Managed Identity needed permissions as you normally do for SQL users
and others. Run the following code, or refer to more options here. If you want to use PolyBase to load the
data, learn the required database permission.

EXEC sp_addrolemember db_owner, [your Data Factory name];

4. Configure an Azure Synapse Analytics linked ser vice in Azure Data Factory.
Example:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for Azure Synapse Analytics dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureSqlDWTable .

schema Name of the schema. No for source, Yes for sink

table Name of the table/view. No for source, Yes for sink

tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .

Dataset properties example

{
"name": "AzureSQLDWDataset",
"properties":
{
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "<Azure Synapse Analytics linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}

Copy Activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure Synapse Analytics source and sink.
Azure Synapse Analytics as the source

TIP
To load data from Azure Synapse Analytics efficiently by using data partitioning, learn more from Parallel copy from Azure
Synapse Analytics.

To copy data from Azure Synapse Analytics, set the type property in the Copy Activity source to SqlDWSource .
The following properties are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity source must be set to
SqlDWSource .

sqlReaderQuery Use the custom SQL query to read No


data. Example:
select * from MyTable .

sqlReaderStoredProcedureName The name of the stored procedure that No


reads data from the source table. The
last SQL statement must be a SELECT
statement in the stored procedure.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name or value
pairs. Names and casing of parameters
must match the names and casing of
the stored procedure parameters.

isolationLevel Specifies the transaction locking No


behavior for the SQL source. The
allowed values are: ReadCommitted ,
ReadUncommitted ,
RepeatableRead , Serializable ,
Snapshot . If not specified, the
database's default isolation level is
used. For more information, see
system.data.isolationlevel.

partitionOptions Specifies the data partitioning options No


used to load data from Azure Synapse
Analytics.
Allowed values are: None (default),
PhysicalPar titionsOfTable , and
DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from an Azure Synapse Analytics is
controlled by the parallelCopies
setting on the copy activity.

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when the partition option isn't
None .
P RO P ERT Y DESC RIP T IO N REQ UIRED

Under partitionSettings :

partitionColumnName Specify the name of the source column No


in integer or date/datetime type (
int , smallint , bigint , date ,
smalldatetime , datetime ,
datetime2 , or datetimeoffset )
that will be used by range partitioning
for parallel copy. If not specified, the
index or the primary key of the table is
detected automatically and used as the
partition column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?
AdfDynamicRangePartitionCondition
in the WHERE clause. For an example,
see the Parallel copy from SQL
database section.

partitionUpperBound The maximum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

partitionLowerBound The minimum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

Note the following point:


When using stored procedure in source to retrieve data, note if your stored procedure is designed as
returning different schema when different parameter value is passed in, you may encounter failure or see
unexpected result when importing schema from UI or when copying data to SQL database with auto table
creation.
Example: using SQL query
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Synapse Analytics input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example: using stored procedure

"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Synapse Analytics input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample stored procedure:


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure Synapse Analytics as sink


Azure Data Factory supports three ways to load data into Azure Synapse Analytics.

Use PolyBase
Use COPY statement
Use bulk insert
The fastest and most scalable way to load data is through PolyBase or the COPY statement.
To copy data to Azure Synapse Analytics, set the sink type in Copy Activity to SqlDWSink . The following
properties are supported in the Copy Activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity sink must be set to
SqlDWSink .

allowPolyBase Indicates whether to use PolyBase to No.


load data into Azure Synapse Analytics. Apply when using PolyBase.
allowCopyCommand and
allowPolyBase cannot be both true.

See Use PolyBase to load data into


Azure Synapse Analytics section for
constraints and details.

Allowed values are True and False


(default).
P RO P ERT Y DESC RIP T IO N REQ UIRED

polyBaseSettings A group of properties that can be No.


specified when the allowPolybase Apply when using PolyBase.
property is set to true .

allowCopyCommand Indicates whether to use COPY No.


statement to load data into Azure Apply when using COPY.
Synapse Analytics. allowCopyCommand
and allowPolyBase cannot be both
true.

See Use COPY statement to load data


into Azure Synapse Analytics section
for constraints and details.

Allowed values are True and False


(default).

copyCommandSettings A group of properties that can be No.


specified when allowCopyCommand Apply when using COPY.
property is set to TRUE.

writeBatchSize Number of rows to inserts into the No.


SQL table per batch . Apply when using bulk insert.

The allowed value is integer (number


of rows). By default, Data Factory
dynamically determines the
appropriate batch size based on the
row size.

writeBatchTimeout Wait time for the batch insert No.


operation to finish before it times out. Apply when using bulk insert.

The allowed value is timespan .


Example: "00:30:00" (30 minutes).

preCopyScript Specify a SQL query for Copy Activity No


to run before writing data into Azure
Synapse Analytics in each run. Use this
property to clean up the preloaded
data.

tableOption Specifies whether to automatically No


create the sink table if not exists based
on the source schema. Allowed values
are: none (default), autoCreate .

disableMetricsCollection Data Factory collects metrics such as No (default is false )


Azure Synapse Analytics DWUs for
copy performance optimization and
recommendations, which introduce
additional master DB access. If you are
concerned with this behavior, specify
true to turn it off.
P RO P ERT Y DESC RIP T IO N REQ UIRED

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Azure Synapse Analytics sink example

"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}

Parallel copy from Azure Synapse Analytics


The Azure Synapse Analytics connector in copy activity provides built-in data partitioning to copy data in
parallel. You can find data partitioning options on the Source tab of the copy activity.

When you enable partitioned copy, copy activity runs parallel queries against your Azure Synapse Analytics
source to load data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy
activity. For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four
queries based on your specified partition option and settings, and each query retrieves a portion of data from
your Azure Synapse Analytics.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Azure Synapse Analytics. The following are suggested configurations for different scenarios. When
copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS


SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table, with physical partitions. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partitions, and copies data by partitions.

To check if your table has physical partition or not, you can


refer to this query.

Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.

For example, if your partition column "ID" has values range


from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions - IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.
SC EN A RIO SUGGEST ED SET T IN GS

Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.

During execution, Data Factory replaces


?AdfRangePartitionColumnName with the actual column
name and value ranges for each partition, and sends to
Azure Synapse Analytics.
For example, if your partition column "ID" has values range
from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions- IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.

Here are more sample queries for different scenarios:


1. Query the whole table:
SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition
2. Query from a table with column selection and additional
where-clause filters:
SELECT <column_list> FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
3. Query with subqueries:
SELECT <column_list> FROM (<your_sub_query>) AS T
WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
4. Query with partition in subquery:
SELECT <column_list> FROM (SELECT
<your_sub_query_column_list> FROM <TableName> WHERE
?AdfDynamicRangePartitionCondition) AS T

Best practices to load data with partition option:


1. Choose distinctive column as partition column (like primary key or unique key) to avoid data skew.
2. If the table has built-in partition, use partition option "Physical partitions of table" to get better performance.
3. If you use Azure Integration Runtime to copy data, you can set larger "Data Integration Units (DIU)" (>4) to
utilize more computing resource. Check the applicable scenarios there.
4. "Degree of copy parallelism" control the partition numbers, setting this number too large sometime hurts
the performance, recommend setting this number as (DIU or number of Self-hosted IR nodes) * (2 to 4).
5. Note Azure Synapse Analytics can execute a maximum of 32 queries at a moment, setting "Degree of copy
parallelism" too large may cause Synapse throttling issue.
Example: full load from large table with physical par titions
"source": {
"type": "SqlDWSource",
"partitionOption": "PhysicalPartitionsOfTable"
}

Example: quer y with dynamic range par tition

"source": {
"type": "SqlDWSource",
"query":"SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride,
not as data filter>",
"partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride,
not as data filter>"
}
}

Sample query to check physical partition

SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, c.name AS ColumnName, CASE WHEN c.name IS NULL
THEN 'no' ELSE 'yes' END AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id
= t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.types AS y ON c.system_type_id = y.system_type_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'

If the table has physical partition, you would see "HasPartition" as "yes".

Use PolyBase to load data into Azure Synapse Analytics


Using PolyBase is an efficient way to load a large amount of data into Azure Synapse Analytics with high
throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT
mechanism. For a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics.
If your source data is in Azure Blob, Azure Data Lake Storage Gen1 or Azure Data Lake Storage
Gen2 , and the format is PolyBase compatible , you can use copy activity to directly invoke PolyBase to let
Azure Synapse Analytics pull the data from source. For details, see Direct copy by using PolyBase .
If your source data store and format isn't originally supported by PolyBase, use the Staged copy by using
PolyBase feature instead. The staged copy feature also provides you better throughput. It automatically
converts the data into PolyBase-compatible format, stores the data in Azure Blob storage, then calls PolyBase
to load data into Azure Synapse Analytics.

TIP
Learn more on Best practices for using PolyBase. When using PolyBase with Azure Integration Runtime, effective Data
Integration Units (DIU) for direct or staged storage-to-Synapse is always 2. Tuning the DIU doesn't impact the
performance, as loading data from storage is powered by Synapse engine.
The following PolyBase settings are supported under polyBaseSettings in copy activity:

P RO P ERT Y DESC RIP T IO N REQ UIRED

rejectValue Specifies the number or percentage of No


rows that can be rejected before the
query fails.

Learn more about PolyBase's reject


options in the Arguments section of
CREATE EXTERNAL TABLE (Transact-
SQL).

Allowed values are 0 (default), 1, 2, etc.

rejectType Specifies whether the rejectValue No


option is a literal value or a
percentage.

Allowed values are Value (default) and


Percentage .

rejectSampleValue Determines the number of rows to Yes, if the rejectType is percentage .


retrieve before PolyBase recalculates
the percentage of rejected rows.

Allowed values are 1, 2, etc.

useTypeDefault Specifies how to handle missing values No


in delimited text files when PolyBase
retrieves data from the text file.

Learn more about this property from


the Arguments section in CREATE
EXTERNAL FILE FORMAT (Transact-
SQL).

Allowed values are True and False


(default).

Direct copy by using PolyBase


Azure Synapse Analytics PolyBase directly supports Azure Blob, Azure Data Lake Storage Gen1 and Azure Data
Lake Storage Gen2. If your source data meets the criteria described in this section, use PolyBase to copy directly
from the source data store to Azure Synapse Analytics. Otherwise, use Staged copy by using PolyBase.

TIP
To copy data efficiently to Azure Synapse Analytics, learn more from Azure Data Factory makes it even easier and
convenient to uncover insights from data when using Data Lake Store with Azure Synapse Analytics.

If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the
BULKINSERT mechanism for the data movement.
1. The source linked ser vice is with the following types and authentication methods:
SUP P O RT ED SO URC E DATA STO RE T Y P E SUP P O RT ED SO URC E A UT H EN T IC AT IO N T Y P E

Azure Blob Account key authentication, managed identity


authentication

Azure Data Lake Storage Gen1 Service principal authentication

Azure Data Lake Storage Gen2 Account key authentication, managed identity
authentication

IMPORTANT
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service
Endpoints with Azure storage.

2. The source data format is of Parquet , ORC , or Delimited text , with the following configurations:
a. Folder path doesn't contain wildcard filter.
b. File name is empty, or points to a single file. If you specify wildcard file name in copy activity, it can
only be * or *.* .
c. rowDelimiter is default , \n , \r\n , or \r .
d. nullValue is left as default or set to empty string (""), and treatEmptyAsNull is left as default or set
to true.
e. encodingName is left as default or set to utf-8 .
f. quoteChar , escapeChar , and skipLineCount aren't specified. PolyBase support skip header row, which
can be configured as firstRowAsHeader in ADF.
g. compression can be no compression , GZip , or Deflate .
3. If your source is a folder, recursive in copy activity must be set to true.
4. wildcardFolderPath , wildcardFilename , modifiedDateTimeStart , modifiedDateTimeEnd , prefix ,
enablePartitionDiscovery , and additionalColumns are not specified.

NOTE
If your source is a folder, note PolyBase retrieves files from the folder and all of its subfolders, and it doesn't retrieve data
from files for which the file name begins with an underline (_) or a period (.), as documented here - LOCATION argument.
"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "ParquetDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings":{
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
}
}
]

Staged copy by using PolyBase


When your source data is not natively compatible with PolyBase, enable data copying via an interim staging
Azure Blob or Azure Data Lake Storage Gen2 (it can't be Azure Premium Storage). In this case, Azure Data
Factory automatically converts the data to meet the data format requirements of PolyBase. Then it invokes
PolyBase to load data into Azure Synapse Analytics. Finally, it cleans up your temporary data from the storage.
See Staged copy for details about copying data via a staging.
To use this feature, create an Azure Blob Storage linked service or Azure Data Lake Storage Gen2 linked service
with account key or managed identity authentication that refers to the Azure storage account as the
interim storage.

IMPORTANT
When you use managed identity authentication for your staging linked service, learn the needed configurations for
Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your staging Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service Endpoints
with Azure storage.

IMPORTANT
If your staging Azure Storage is configured with Managed Private Endpoint and has the storage firewall enabled, you
must use managed identity authentication and grant Storage Blob Data Reader permissions to the Synapse SQL Server to
ensure it can access the staged files during the PolyBase load.
"activities":[
{
"name": "CopyFromSQLServerToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "SQLServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
}
}
}
}
]

Best practices for using PolyBase


The following sections provide best practices in addition to those practices mentioned in Best practices for Azure
Synapse Analytics.
Required database permission
To use PolyBase, the user that loads data into Azure Synapse Analytics must have "CONTROL" permission on the
target database. One way to achieve that is to add the user as a member of the db_owner role. Learn how to do
that in the Azure Synapse Analytics overview.
Row size and data type limits
PolyBase loads are limited to rows smaller than 1 MB. It cannot be used to load to VARCHR(MAX),
NVARCHAR(MAX), or VARBINARY(MAX). For more information, see Azure Synapse Analytics service capacity
limits.
When your source data has rows greater than 1 MB, you might want to vertically split the source tables into
several small ones. Make sure that the largest size of each row doesn't exceed the limit. The smaller tables can
then be loaded by using PolyBase and merged together in Azure Synapse Analytics.
Alternatively, for data with such wide columns, you can use non-PolyBase to load the data using ADF, by turning
off "allow PolyBase" setting.
Azure Synapse Analytics resource class
To achieve the best possible throughput, assign a larger resource class to the user that loads data into Azure
Synapse Analytics via PolyBase.
PolyBase troubleshooting
Loading to Decimal column
If your source data is in text format or other non-PolyBase compatible stores (using staged copy and PolyBase),
and it contains empty value to be loaded into Azure Synapse Analytics Decimal column, you may get the
following error:

ErrorCode=FailedDbOperation, ......HadoopSqlException: Error converting data type VARCHAR to


DECIMAL.....Detailed Message=Empty string can't be converted to DECIMAL.....

The solution is to unselect "Use type default " option (as false) in copy activity sink -> PolyBase settings.
"USE_TYPE_DEFAULT" is a PolyBase native configuration, which specifies how to handle missing values in
delimited text files when PolyBase retrieves data from the text file.
Check the tableName property in Azure Synapse Analytics
The following table gives examples of how to specify the tableName property in the JSON dataset. It shows
several combinations of schema and table names.

DB SC H EM A TA B L E N A M E TA B L EN A M E JSO N P RO P ERT Y

dbo MyTable MyTable or dbo.MyTable or [dbo].


[MyTable]

dbo1 MyTable dbo1.MyTable or [dbo1].[MyTable]

dbo My.Table [My.Table] or [dbo].[My.Table]

dbo1 My.Table [dbo1].[My.Table]

If you see the following error, the problem might be the value you specified for the tableName property. See
the preceding table for the correct way to specify values for the tableName JSON property.

Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net


SqlClient Data Provider

Columns with default values


Currently, the PolyBase feature in Data Factory accepts only the same number of columns as in the target table.
An example is a table with four columns where one of them is defined with a default value. The input data still
needs to have four columns. A three-column input dataset yields an error similar to the following message:

All columns of the table must be specified in the INSERT BULK statement.

The NULL value is a special form of the default value. If the column is nullable, the input data in the blob for that
column might be empty. But it can't be missing from the input dataset. PolyBase inserts NULL for missing values
in Azure Synapse Analytics.
External file access failed
If you receive the following error, ensure that you are using managed identity authentication and have granted
Storage Blob Data Reader permissions to the Azure Synapse workspace's managed identity.

Job failed due to reason: at Sink '[SinkName]':


shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: External file access failed due to
internal error: 'Error occurred while accessing HDFS: Java exception raised on call to
HdfsBridge_IsDirExist. Java exception message:\r\nHdfsBridge::isDirExist

For more information, see Grant permissions to managed identity after workspace creation.
Use COPY statement to load data into Azure Synapse Analytics
Azure Synapse Analytics COPY statement directly supports loading data from Azure Blob and Azure Data
Lake Storage Gen2 . If your source data meets the criteria described in this section, you can choose to use
COPY statement in ADF to load data into Azure Synapse Analytics. Azure Data Factory checks the settings and
fails the copy activity run if the criteria is not met.

NOTE
Currently Data Factory only support copy from COPY statement compatible sources mentioned below.

TIP
When using COPY statement with Azure Integration Runtime, effective Data Integration Units (DIU) is always 2. Tuning
the DIU doesn't impact the performance, as loading data from storage is powered by Synapse engine.

Using COPY statement supports the following configuration:


1. The source linked ser vice and format are with the following types and authentication methods:

SUP P O RT ED SO URC E DATA STO RE SUP P O RT ED SO URC E


TYPE SUP P O RT ED F O RM AT A UT H EN T IC AT IO N T Y P E

Azure Blob Delimited text Account key authentication, shared


access signature authentication,
service principal authentication,
managed identity authentication

Parquet Account key authentication, shared


access signature authentication

ORC Account key authentication, shared


access signature authentication

Azure Data Lake Storage Gen2 Delimited text Account key authentication, service
Parquet principal authentication, managed
ORC identity authentication

IMPORTANT
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service
Endpoints with Azure storage.

2. Format settings are with the following:


a. For Parquet : compression can be no compression , Snappy , or GZip .
b. For ORC : compression can be no compression , zlib , or Snappy .
c. For Delimited text :
a. rowDelimiter is explicitly set as single character or "\r\n ", the default value is not supported.
b. nullValue is left as default or set to empty string ("").
c. encodingName is left as default or set to utf-8 or utf-16 .
d. escapeChar must be same as quoteChar , and is not empty.
e. skipLineCount is left as default or set to 0.
f. compression can be no compression or GZip .
3. If your source is a folder, recursive in copy activity must be set to true, and wildcardFilename need to be
* .

4. wildcardFolderPath , wildcardFilename (other than * ), modifiedDateTimeStart , modifiedDateTimeEnd ,


prefix , enablePartitionDiscovery and additionalColumns are not specified.
The following COPY statement settings are supported under allowCopyCommand in copy activity:

P RO P ERT Y DESC RIP T IO N REQ UIRED

defaultValues Specifies the default values for each No


target column in Azure Synapse
Analytics. The default values in the
property overwrite the DEFAULT
constraint set in the data warehouse,
and identity column cannot have a
default value.

additionalOptions Additional options that will be passed No


to an Azure Synapse Analytics COPY
statement directly in "With" clause in
COPY statement. Quote the value as
needed to align with the COPY
statement requirements.
"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaCOPY",
"type": "Copy",
"inputs": [
{
"referenceName": "ParquetDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings":{
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "SqlDWSink",
"allowCopyCommand":true,
"copyCommandSettings":{
"defaultValues":[
{
"columnName":"col_string",
"defaultValue":"DefaultStringValue"
}
],
"additionalOptions":{
"MAXERRORS":"10000",
"DATEFORMAT":"'ymd'"
}
}
},
"enableSkipIncompatibleRow": true
}
}
]

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from Azure Synapse Analytics.
For more information, see the source transformation and sink transformation in mapping data flows.
Source transformation
Settings specific to Azure Synapse Analytics are available in the Source Options tab of the source
transformation.
Input Select whether you point your source at a table (equivalent of Select * from <table-name> ) or enter a
custom SQL query.
Enable Staging It is highly recommended that you use this option in production workloads with Azure
Synapse Analytics sources. When you execute a data flow activity with Azure Synapse Analytics sources from a
pipeline, ADF will prompt you for a staging location storage account and will use that for staged data loading. It
is the fastest mechanism to load data from Azure Synapse Analytics.
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity
authentication with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using
VNet Service Endpoints with Azure storage.
When you use Azure Synapse ser verless SQL pool as source, enable staging is not supported.
Quer y : If you select Query in the input field, enter a SQL query for your source. This setting overrides any table
that you've chosen in the dataset. Order By clauses aren't supported here, but you can set a full SELECT FROM
statement. You can also use user-defined table functions. select * from udfGetData() is a UDF in SQL that
returns a table. This query will produce a source table that you can use in your data flow. Using queries is also a
great way to reduce rows for testing or for lookups.
SQL Example: Select * from MyTable where customerId > 1000 and customerId < 2000

Batch size : Enter a batch size to chunk large data into reads. In data flows, ADF will use this setting to set Spark
columnar caching. This is an option field, which will use Spark defaults if it is left blank.
Isolation Level : The default for SQL sources in mapping data flow is read uncommitted. You can change the
isolation level here to one of these values:
Read Committed
Read Uncommitted
Repeatable Read
Serializable
None (ignore isolation level)

Sink transformation
Settings specific to Azure Synapse Analytics are available in the Settings tab of the sink transformation.
Update method: Determines what operations are allowed on your database destination. The default is to only
allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those
actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter.
Table action: Determines whether to recreate or remove all rows from the destination table prior to writing.
None: No action will be done to the table.
Recreate: The table will get dropped and recreated. Required if creating a new table dynamically.
Truncate: All rows from the target table will get removed.
Enable staging: This enables loading into Azure Synapse Analytics SQL Pools using the copy command and is
recommended for most Synpase sinks. The staging storage is configured in Execute Data Flow activity.
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity
authentication with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using
VNet Service Endpoints with Azure storage.
Batch size : Controls how many rows are being written in each bucket. Larger batch sizes improve compression
and memory optimization, but risk out of memory exceptions when caching data.
Pre and Post SQL scripts : Enter multi-line SQL scripts that will execute before (pre-processing) and after
(post-processing) data is written to your Sink database

Error row handling


When writing to Azure Synapse Analytics, certain rows of data may fail due to constraints set by the destination.
Some common errors include:
String or binary data would be truncated in table
Cannot insert the value NULL into column
Conversion failed when converting the value to data type
By default, a data flow run will fail on the first error it gets. You can choose to Continue on error that allows
your data flow to complete even if individual rows have errors. Azure Data Factory provides different options for
you to handle these error rows.
Transaction Commit: Choose whether your data gets written in a single transaction or in batches. Single
transaction will provide better performance and no data written will be visible to others until the transaction
completes. Batch transactions have worse performance but can work for large datasets.
Output rejected data: If enabled, you can output the error rows into a csv file in Azure Blob Storage or an
Azure Data Lake Storage Gen2 account of your choosing. This will write the error rows with three additional
columns: the SQL operation like INSERT or UPDATE, the data flow error code, and the error message on the row.
Repor t success on error : If enabled, the data flow will be marked as a success even if error rows are found.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Data type mapping for Azure Synapse Analytics


When you copy data from or to Azure Synapse Analytics, the following mappings are used from Azure Synapse
Analytics data types to Azure Data Factory interim data types. See schema and data type mappings to learn how
Copy Activity maps the source schema and data type to the sink.

TIP
Refer to Table data types in Azure Synapse Analytics article on Azure Synapse Analytics supported data types and the
workarounds for unsupported ones.
A Z URE SY N A P SE A N A LY T IC S DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal

time TimeSpan

tinyint Byte

uniqueidentifier Guid
A Z URE SY N A P SE A N A LY T IC S DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

varbinary Byte[]

varchar String, Char[]

Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores and formats.
Copy data to and from Azure Table storage by
using Azure Data Factory
5/28/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data to and from Azure Table
storage. It builds on the Copy Activity overview article that presents a general overview of Copy Activity.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Supported capabilities
This Azure Table storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from any supported source data store to Table storage. You also can copy data from Table
storage to any supported sink data store. For a list of data stores that are supported as sources or sinks by the
copy activity, see the Supported data stores table.
Specifically, this Azure Table connector supports copying data by using account key and service shared access
signature authentications.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Table storage.

Linked service properties


Use an account key
You can create an Azure Storage linked service by using the account key. It provides the data factory with global
access to Storage. The following properties are supported.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureTableStorage .

connectionString Specify the information needed to Yes


connect to Storage for the
connectionString property.
You can also put account key in Azure
Key Vault and pull the accountKey
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.

Example:

{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault


{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use shared access signature authentication


You also can create a Storage linked service by using a shared access signature. It provides the data factory with
restricted/time-bound access to all/specific resources in the storage.
A shared access signature provides delegated access to resources in your storage account. You can use it to
grant a client limited permissions to objects in your storage account for a specified time and with a specified set
of permissions. You don't have to share your account access keys. The shared access signature is a URI that
encompasses in its query parameters all the information necessary for authenticated access to a storage
resource. To access storage resources with the shared access signature, the client only needs to pass in the
shared access signature to the appropriate constructor or method. For more information about shared access
signatures, see Shared access signatures: Understand the shared access signature model.

NOTE
Data Factory now supports both ser vice shared access signatures and account shared access signatures . For
more information about shared access signatures, see Grant limited access to Azure Storage resources using shared
access signatures (SAS).

TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri

To use shared access signature authentication, the following properties are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


AzureTableStorage .
P RO P ERT Y DESC RIP T IO N REQ UIRED

sasUri Specify SAS URI of the shared access Yes


signature URI to the table.
Mark this field as a SecureString to
store it securely in Data Factory. You
can also put SAS token in Azure Key
Vault to leverage auto rotation and
remove the token portion. Refer to the
following samples and Store credentials
in Azure Key Vault article with more
details.

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure Integration Runtime or the
Self-hosted Integration Runtime (if
your data store is located in a private
network). If not specified, it uses the
default Azure Integration Runtime.

NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.

Example:

{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<account>.table.core.windows.net/<table>?sv=<storage version>&amp;st=<start time>&amp;se=<expire
time>&amp;sr=<resource>&amp;sp=<permissions>&amp;sip=<ip range>&amp;spr=<protocol>&amp;sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store account key in Azure Key Vault


{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<account>.table.core.windows.net/<table>>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expir y time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right table level based on the need.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure Table dataset.
To copy data to and from Azure Table, set the type property of the dataset to AzureTable . The following
properties are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to AzureTable .

tableName The name of the table in the Table Yes


storage database instance that the
linked service refers to.

Example:
{
"name": "AzureTableDataset",
"properties":
{
"type": "AzureTable",
"typeProperties": {
"tableName": "MyTable"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Table storage linked service name>",
"type": "LinkedServiceReference"
}
}
}

Schema by Data Factory


For schema-free data stores such as Azure Table, Data Factory infers the schema in one of the following ways:
If you specify the column mapping in copy activity, Data Factory uses the source side column list to retrieve
data. In this case, if a row doesn't contain a value for a column, a null value is provided for it.
If you don't specify the column mapping in copy activity, Data Factory infers the schema by using the first
row in the data. In this case, if the first row doesn't contain the full schema (e.g. some columns have null
value), some columns are missed in the result of the copy operation.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Azure Table source and sink.
Azure Table as a source type
To copy data from Azure Table, set the source type in the copy activity to AzureTableSource . The following
properties are supported in the copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
AzureTableSource .

azureTableSourceQuery Use the custom Table storage query to No


read data.
The source query is a direct map from
the $filter query option supported
by Azure Table Storage, learn more
about the syntax from this doc, and
see the examples in the following
azureTableSourceQuery examples
section.

azureTableSourceIgnoreTableNotFound Indicates whether to allow the No


exception of the table to not exist.
Allowed values are True and False
(default).

azureTableSourceQuery examples
NOTE
Azure Table query operation times out in 30 seconds as enforced by Azure Table service. Learn how to optimize the query
from Design for querying article.

In Azure Data Factory, if you want to filter the data against a datetime type column, refer to this example:

"azureTableSourceQuery": "LastModifiedTime gt datetime'2017-10-01T00:00:00' and LastModifiedTime le


datetime'2017-10-02T00:00:00'"

If you want to filter the data against a string type column, refer to this example:

"azureTableSourceQuery": "LastModifiedTime ge '201710010000_0000' and LastModifiedTime le


'201710010000_9999'"

If you use the pipeline parameter, cast the datetime value to proper format according to the previous samples.
Azure Table as a sink type
To copy data to Azure Table, set the sink type in the copy activity to AzureTableSink . The following properties
are supported in the copy activity sink section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to AzureTableSink .

azureTableDefaultPartitionKeyValue The default partition key value that No


can be used by the sink.

azureTablePartitionKeyName Specify the name of the column whose No


values are used as partition keys. If not
specified,
"AzureTableDefaultPartitionKeyValue" is
used as the partition key.

azureTableRowKeyName Specify the name of the column whose No


column values are used as the row key.
If not specified, use a GUID for each
row.

azureTableInsertType The mode to insert data into Azure No


Table. This property controls whether
existing rows in the output table with
matching partition and row keys have
their values replaced or merged.

Allowed values are merge (default)


and replace .

This setting applies at the row level not


the table level. Neither option deletes
rows in the output table that do not
exist in the input. To learn about how
the merge and replace settings work,
see Insert or merge entity and Insert
or replace entity.
P RO P ERT Y DESC RIP T IO N REQ UIRED

writeBatchSize Inserts data into Azure Table when No (default is 10,000)


writeBatchSize or writeBatchTimeout is
hit.
Allowed values are integer (number of
rows).

writeBatchTimeout Inserts data into Azure Table when No (default is 90 seconds, storage
writeBatchSize or writeBatchTimeout is client's default timeout)
hit.
Allowed values are timespan. An
example is "00:20:00" (20 minutes).

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example:

"activities":[
{
"name": "CopyToAzureTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Table output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "<column name>",
"azureTableRowKeyName": "<column name>"
}
}
}
]

azureTablePartitionKeyName
Map a source column to a destination column by using the "translator" property before you can use the
destination column as azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column DivisionID:
"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}

"DivisionID" is specified as the partition key.

"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "DivisionID"
}

Data type mapping for Azure Table


When you copy data from and to Azure Table, the following mappings are used from Azure Table data types to
Data Factory interim data types. To learn about how the copy activity maps the source schema and data type to
the sink, see Schema and data type mappings.
When you move data to and from Azure Table, the following mappings defined by Azure Table are used from
Azure Table OData types to .NET type and vice versa.

A Z URE TA B L E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E DETA IL S

Edm.Binary byte[] An array of bytes up to 64 KB.

Edm.Boolean bool A Boolean value.

Edm.DateTime DateTime A 64-bit value expressed as


Coordinated Universal Time (UTC). The
supported DateTime range begins
midnight, January 1, 1601 A.D. (C.E.),
UTC. The range ends December 31,
9999.

Edm.Double double A 64-bit floating point value.

Edm.Guid Guid A 128-bit globally unique identifier.

Edm.Int32 Int32 A 32-bit integer.

Edm.Int64 Int64 A 64-bit integer.

Edm.String String A UTF-16-encoded value. String values


can be up to 64 KB.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Binary format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Binary format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure
Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.
You can use Binary dataset in Copy activity, GetMetadata activity, or Delete activity. When using Binary dataset,
ADF does not parse file content but treat it as-is.

NOTE
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Binary dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Binar y .

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .

compression Group of properties to configure file No


compression. Configure this section
when you want to do
compression/decompression during
activity execution.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The compression codec used to No


read/write binary files.
Allowed values are bzip2 , gzip ,
deflate , ZipDeflate , Tar , or TarGzip .
Note when using copy activity to
decompress ZipDeflate /TarGzip /Tar
file(s) and write to file-based sink data
store, by default files are extracted to
the folder:
<path specified in
dataset>/<folder named as source
compressed file>/
, use preserveZipFileNameAsFolder /
preserveCompressionFileNameAsFolder
on copy activity source to control
whether to preserve the name of the
compressed file(s) as folder structure.

level The compression ratio. Apply when No


dataset is used in Copy activity sink.
Allowed values are Optimal or
Fastest .
- Fastest: The compression operation
should complete as quickly as possible,
even if the resulting file is not
optimally compressed.
- Optimal: The compression operation
should be optimally compressed, even
if the operation takes a longer time to
complete. For more information, see
Compression Level topic.

Below is an example of Binary dataset on Azure Blob Storage:

{
"name": "BinaryDataset",
"properties": {
"type": "Binary",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "ZipDeflate"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Binary source and sink.
NOTE
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.

Binary as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to Binar ySource .

formatSettings A group of properties. Refer to Binar y No


read settings table below.

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Supported binar y read settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to Binar yReadSettings .

compressionProperties A group of properties on how to No


decompress data for a given
compression codec.

preserveZipFileNameAsFolder Applies when input dataset is No


(under compressionProperties -> configured with ZipDeflate
type as ZipDeflateReadSettings ) compression. Indicates whether to
preserve the source zip file name as
folder structure during copy.
- When set to true (default) , Data
Factory writes unzipped files to
<path specified in
dataset>/<folder named as source
zip file>/
.
- When set to false , Data Factory
writes unzipped files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source zip files
to avoid racing or unexpected
behavior.
P RO P ERT Y DESC RIP T IO N REQ UIRED

preserveCompressionFileNameAsFolde Applies when input dataset is No


r configured with TarGzip /Tar
(under compressionProperties -> compression. Indicates whether to
type as TarGZipReadSettings or preserve the source compressed file
TarReadSettings ) name as folder structure during copy.
- When set to true (default) , Data
Factory writes decompressed files to
<path specified in
dataset>/<folder named as source
compressed file>/
.
- When set to false , Data Factory
writes decompressed files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source files to
avoid racing or unexpected behavior.

"activities": [
{
"name": "CopyFromBinary",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"deleteFilesAfterCompletion": true
},
"formatSettings": {
"type": "BinaryReadSettings",
"compressionProperties": {
"type": "ZipDeflateReadSettings",
"preserveZipFileNameAsFolder": false
}
}
},
...
}
...
}
]

Binary as sink
The following properties are supported in the copy activity *sink* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to Binar ySink .

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .
Next steps
Copy activity overview
GetMetadata activity
Delete activity
Copy data from Cassandra using Azure Data
Factory
5/6/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Cassandra database.
It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Cassandra connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Cassandra database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Cassandra connector supports:
Cassandra versions 2.x and 3.x .
Copying data using Basic or Anonymous authentication.

NOTE
For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in Cassandra driver, therefore you don't need to manually install any
driver when copying data from/to Cassandra.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Cassandra connector.

Linked service properties


The following properties are supported for Cassandra linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Cassandra

host One or more IP addresses or host Yes


names of Cassandra servers.
Specify a comma-separated list of IP
addresses or host names to connect to
all servers concurrently.

port The TCP port that the Cassandra No (default is 9042)


server uses to listen for client
connections.

authenticationType Type of authentication used to connect Yes


to the Cassandra database.
Allowed values are: Basic, and
Anonymous .

username Specify user name for the user Yes, if authenticationType is set to
account. Basic.

password Specify password for the user account. Yes, if authenticationType is set to
Mark this field as a SecureString to Basic.
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
Currently connection to Cassandra using TLS is not supported.

Example:
{
"name": "CassandraLinkedService",
"properties": {
"type": "Cassandra",
"typeProperties": {
"host": "<host>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Cassandra dataset.
To copy data from Cassandra, set the type property of the dataset to CassandraTable . The following properties
are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: CassandraTable

keyspace Name of the keyspace or schema in No (if "query" for "CassandraSource" is


Cassandra database. specified)

tableName Name of the table in Cassandra No (if "query" for "CassandraSource" is


database. specified)

Example:

{
"name": "CassandraDataset",
"properties": {
"type": "CassandraTable",
"typeProperties": {
"keySpace": "<keyspace name>",
"tableName": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Cassandra linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Cassandra source.
Cassandra as source
To copy data from Cassandra, set the source type in the copy activity to CassandraSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
CassandraSource

query Use the custom query to read data. No (if "tableName" and "keyspace" in
SQL-92 query or CQL query. See CQL dataset are specified).
reference.

When using SQL query, specify


keyspace name.table name to
represent the table you want to query.

consistencyLevel The consistency level specifies how No (default is ONE )


many replicas must respond to a read
request before returning data to the
client application. Cassandra checks
the specified number of replicas for
data to satisfy the read request. See
Configuring data consistency for
details.

Allowed values are: ONE , TWO ,


THREE , QUORUM , ALL ,
LOCAL_QUORUM ,
EACH_QUORUM , and LOCAL_ONE .

Example:
"activities":[
{
"name": "CopyFromCassandra",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cassandra input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CassandraSource",
"query": "select id, firstname, lastname from mykeyspace.mytable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mapping for Cassandra


When copying data from Cassandra, the following mappings are used from Cassandra data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

C A SSA N DRA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

ASCII String

BIGINT Int64

BLOB Byte[]

BOOLEAN Boolean

DECIMAL Decimal

DOUBLE Double

FLOAT Single

INET String

INT Int32

TEXT String

TIMESTAMP DateTime
C A SSA N DRA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

TIMEUUID Guid

UUID Guid

VARCHAR String

VARINT Decimal

NOTE
For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section.
User-defined types are not supported.
The length of Binary Column and String Column lengths cannot be greater than 4000.

Work with collections using virtual table


Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your Cassandra database. For
collection types including map, set and list, the driver renormalizes the data into corresponding virtual tables.
Specifically, if a table contains any collection columns, the driver generates the following virtual tables:
A base table , which contains the same data as the real table except for the collection columns. The base
table uses the same name as the real table that it represents.
A vir tual table for each collection column, which expands the nested data. The virtual tables that represent
collections are named using the name of the real table, a separator "vt" and the name of the column.
Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example
section for details. You can access the content of Cassandra collections by querying and joining the virtual
tables.
Example
For example, the following "ExampleTable" is a Cassandra database table that contains an integer primary key
column named "pk_int", a text column named value, a list column, a map column, and a set column (named
"StringSet").

P K _IN T VA L UE L IST MAP ST RIN GSET

1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"}

3 "sample value 3" ["100", "101", "102", {"S1": "t"} {"A", "E"}
"105"]

The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the
virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual
table row corresponds to.
The first virtual table is the base table named "ExampleTable" is shown in the following table:

P K _IN T VA L UE

1 "sample value 1"


P K _IN T VA L UE

3 "sample value 3"

The base table contains the same data as the original database table except for the collections, which are omitted
from this table and expanded in other virtual tables.
The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns.
The columns with names that end with "_index" or "_key" indicate the position of the data within the original list
or map. The columns with names that end with "_value" contain the expanded data from the collection.
Table "ExampleTable_vt_List":

P K _IN T L IST _IN DEX L IST _VA L UE

1 0 1

1 1 2

1 2 3

3 0 100

3 1 101

3 2 102

3 3 103

Table "ExampleTable_vt_Map":

P K _IN T M A P _K EY M A P _VA L UE

1 S1 A

1 S2 b

3 S1 t

Table "ExampleTable_vt_StringSet":

P K _IN T ST RIN GSET _VA L UE

1 A

1 B

1 C

3 A

3 E
Lookup activity properties
To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Common Data Model format in Azure Data Factory
3/5/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Common Data Model (CDM) metadata system makes it possible for data and its meaning to be easily
shared across applications and business processes. To learn more, see the Common Data Model overview.
In Azure Data Factory, users can transform data from CDM entities in both model.json and manifest form stored
in Azure Data Lake Store Gen2 (ADLS Gen2) using mapping data flows. You can also sink data in CDM format
using CDM entity references that will land your data in CSV or Parquet format in partitioned folders.

Mapping data flow properties


The Common Data Model is available as an inline dataset in mapping data flows as both a source and a sink.

NOTE
When writing CDM entities, you must have an existing CDM entity definition (metadata schema) already defined to use as
a reference. The ADF data flow sink will read that CDM entity file and import the schema into your sink for field mapping.

Source properties
The below table lists the properties supported by a CDM source. You can edit these properties in the Source
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes cdm format


cdm

Metadata format Where the entity Yes 'manifest' or manifestType


reference to the data 'model'
is located. If using
CDM version 1.0,
choose manifest. If
using a CDM version
before 1.0, choose
model.json.

Root location: Container name of yes String fileSystem


container the CDM folder

Root location: folder Root folder location yes String folderPath


path of CDM folder

Manifest file: Entity Folder path of the no String entityPath


path entity within the root
folder
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Manifest file: Name of the manifest No String manifestName


Manifest name file. Default value is
'default'

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Schema linked The linked service yes, if using manifest 'adlsgen2' or corpusStore
service where the corpus is 'github'
located

Entity reference Container corpus is yes, if using manifest String adlsgen2_fileSystem


container in and corpus in ADLS
Gen2

Entity reference GitHub repository yes, if using manifest String github_repository


Repository name and corpus in GitHub

Entity reference GitHub repository yes, if using manifest String github_branch


Branch branch and corpus in GitHub

Corpus folder the root location of yes, if using manifest String corpusPath
the corpus

Corpus entity Path to entity yes String entity


reference

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

When selecting "Entity Reference" both in the Source and Sink transformations, you can select from these three
options for the location of your entity reference:
Local uses the entity defined in the manifest file already being used by ADF
Custom will ask you to point to an entity manifest file that is different from the manifest file ADF is using
Standard will use an entity reference from the standard library of CDM entities maintained in Github .
Sink settings
Point to the CDM entity reference file that contains the definition of the entity you would like to write.

Define the partition path and format of the output files that you want ADF to use for writing your entities.
Set the output file location and the location and name for the manifest file.

Import schema
CDM is only available as an inline dataset and, by default, doesn't have an associated schema. To get column
metadata, click the Impor t schema button in the Projection tab. This will allow you to reference the column
names and data types specified by the corpus. To import the schema, a data flow debug session must be active
and you must have an existing CDM entity definition file to point to.
When mapping data flow columns to entity properties in the Sink transformation, click on the "Mapping" tab
and select "Import Schema". ADF will read the entity reference that you pointed to in your Sink options, allowing
you to map to the target CDM schema.

NOTE
When using model.json source type that originates from Power BI or Power Platform dataflows, you may encounter
"corpus path is null or empty" errors from the source transformation. This is likely due to formatting issues of the
partition location path in the model.json file. To fix this, follow these steps:

1. Open the model.json file in a text editor


2. Find the partitions.Location property
3. Change "blob.core.windows.net" to "dfs.core.windows.net"
4. Fix any "%2F" encoding in the URL to "/"
5. If using ADF Data Flows, Special characters in the partition file path must be replaced with alpha-numeric
values, or switch to Synapse Data Flows
CDM source data flow script example
source(output(
ProductSizeId as integer,
ProductColor as integer,
CustomerId as string,
Note as string,
LastModifiedDate as timestamp
),
allowSchemaDrift: true,
validateSchema: false,
entity: 'Product.cdm.json/Product',
format: 'cdm',
manifestType: 'manifest',
manifestName: 'ProductManifest',
entityPath: 'Product',
corpusPath: 'Products',
corpusStore: 'adlsgen2',
adlsgen2_fileSystem: 'models',
folderPath: 'ProductData',
fileSystem: 'data') ~> CDMSource

Sink properties
The below table lists the properties supported by a CDM sink. You can edit these properties in the Settings tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes cdm format


cdm

Root location: Container name of yes String fileSystem


container the CDM folder

Root location: folder Root folder location yes String folderPath


path of CDM folder

Manifest file: Entity Folder path of the no String entityPath


path entity within the root
folder

Manifest file: Name of the manifest No String manifestName


Manifest name file. Default value is
'default'

Schema linked The linked service yes 'adlsgen2' or corpusStore


service where the corpus is 'github'
located

Entity reference Container corpus is yes, if corpus in ADLS String adlsgen2_fileSystem


container in Gen2

Entity reference GitHub repository yes, if corpus in String github_repository


Repository name GitHub

Entity reference GitHub repository yes, if corpus in String github_branch


Branch branch GitHub

Corpus folder the root location of yes String corpusPath


the corpus
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Corpus entity Path to entity yes String entity


reference

Partition path Location where the no String partitionPath


partition will be
written

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

Format type Choose to specify no parquet if specified subformat


parquet format

Column delimiter If writing to yes, if writing to String columnDelimiter


DelimitedText, how to DelimitedText
delimit columns

First row as header If using no true or false columnNamesAsHea


DelimitedText, der
whether the column
names are added as
a header

CDM sink data flow script example


The associated data flow script is:

CDMSource sink(allowSchemaDrift: true,


validateSchema: false,
entity: 'Product.cdm.json/Product',
format: 'cdm',
entityPath: 'ProductSize',
manifestName: 'ProductSizeManifest',
corpusPath: 'Products',
partitionPath: 'adf',
folderPath: 'ProductSizeData',
fileSystem: 'cdm',
subformat: 'parquet',
corpusStore: 'adlsgen2',
adlsgen2_fileSystem: 'models',
truncate: true,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> CDMSink

Next steps
Create a source transformation in mapping data flow.
Copy data from Concur using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Concur. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Concur connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Concur to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.

NOTE
Partner account is currently not supported.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Concur connector.

Linked service properties


The following properties are supported for Concur linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Concur

connectionProperties A group of properties that defines how Yes


to connect to Concur.

Under connectionProperties :

authenticationType Allowed values are Yes


OAuth_2.0_Bearer and OAuth_2.0
(legacy). The OAuth 2.0 authentication
option works with the old Concur API
which was deprecated since Feb 2017.

host The endpoint of the Concur server, e.g. Yes


implementation.concursolutions.com
.

baseUrl The base URL of your Concur's Yes for OAuth_2.0_Bearer


authorization URL. authentication

clientId Application client ID supplied by Yes


Concur App Management.

clientSecret The client secret corresponding to the Yes for OAuth_2.0_Bearer


client ID. Mark this field as a authentication
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

username The user name that you use to access Yes


Concur service.

password The password corresponding to the Yes


user name that you provided in the
username field. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:
{
"name":"ConcurLinkedService",
"properties":{
"type":"Concur",
"typeProperties":{
"connectionProperties":{
"host":"<host e.g. implementation.concursolutions.com>",
"baseUrl":"<base URL for authorization e.g. us-impl.api.concursolutions.com>",
"authenticationType":"OAuth_2.0_Bearer",
"clientId":"<client id>",
"clientSecret":{
"type": "SecureString",
"value": "<client secret>"
},
"username":"fakeUserName",
"password":{
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Example (legacy):
Note the following is a legacy linked service model without connectionProperties and using OAuth_2.0
authentication.

{
"name": "ConcurLinkedService",
"properties": {
"type": "Concur",
"typeProperties": {
"clientId" : "<clientId>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Concur dataset.
To copy data from Concur, set the type property of the dataset to ConcurObject . There is no additional type-
specific property in this type of dataset. The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: ConcurObject
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ConcurDataset",
"properties": {
"type": "ConcurObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Concur linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Concur source.
ConcurSource as source
To copy data from Concur, set the source type in the copy activity to ConcurSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: ConcurSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Opportunities
where Id = xxx "
.

Example:
"activities":[
{
"name": "CopyFromConcur",
"type": "Copy",
"inputs": [
{
"referenceName": "<Concur input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ConcurSource",
"query": "SELECT * FROM Opportunities where Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Couchbase using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Couchbase. It builds
on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Couchbase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Couchbase to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Couchbase connector.

Linked service properties


The following properties are supported for Couchbase linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Couchbase

connectionString An ODBC connection string to connect Yes


to Couchbase.
You can also put credential string in
Azure Key Vault and pull the
credString configuration out of the
connection string. Refer to the
following samples and Store credentials
in Azure Key Vault article with more
details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": "Server=<server>; Port=<port>;AuthMech=1;CredString=[{\"user\": \"JSmith\",
\"pass\":\"access123\"}, {\"user\": \"Admin\", \"pass\":\"simba123\"}];"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store credential string in Azure Key Vault


{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": "Server=<server>; Port=<port>;AuthMech=1;",
"credString": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Couchbase dataset.
To copy data from Couchbase, set the type property of the dataset to CouchbaseTable . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: CouchbaseTable

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "CouchbaseDataset",
"properties": {
"type": "CouchbaseTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Couchbase linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Couchbase source.
CouchbaseSource as source
To copy data from Couchbase, set the source type in the copy activity to CouchbaseSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
CouchbaseSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromCouchbase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Couchbase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CouchbaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from DB2 by using Azure Data Factory
5/6/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a DB2 database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This DB2 database connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from DB2 database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this DB2 connector supports the following IBM DB2 platforms and versions with Distributed
Relational Database Architecture (DRDA) SQL Access Manager (SQLAM) version 9, 10 and 11. It utilizes the
DDM/DRDA protocol.
IBM DB2 for z/OS 12.1
IBM DB2 for z/OS 11.1
IBM DB2 for z/OS 10.1
IBM DB2 for i 7.3
IBM DB2 for i 7.2
IBM DB2 for i 7.1
IBM DB2 for LUW 11
IBM DB2 for LUW 10.5
IBM DB2 for LUW 10.1

TIP
DB2 connector is built on top of Microsoft OLE DB Provider for DB2. To troubleshoot DB2 connector errors, refer to Data
Provider Error Codes.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in DB2 driver, therefore you don't need to manually install any driver
when copying data from DB2.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
DB2 connector.

Linked service properties


The following properties are supported for DB2 linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Db2

connectionString Specify information needed to connect Yes


to the DB2 instance.
You can also put password in Azure
Key Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Typical properties inside the connection string:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

server Name of the DB2 server. You can Yes


specify the port number following the
server name delimited by colon e.g.
server:port .
The DB2 connector utilizes the
DDM/DRDA protocol, and by default
uses port 50000 if not specified. The
port your specific DB2 database uses
might be different based on the
version and your settings, e.g. for DB2
LUW the default port is 50000, for
AS400 the default port is 446 or 448
when TLS enabled. Refer to the
following DB2 documents on how the
port is configured typically: DB2 z/OS,
DB2 iSeries, and DB2 LUW.

database Name of the DB2 database. Yes

authenticationType Type of authentication used to connect Yes


to the DB2 database.
Allowed value is: Basic.

username Specify user name to connect to the Yes


DB2 database.

password Specify password for the user account Yes


you specified for the username. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

packageCollection Specify under where the needed No


packages are auto created by ADF
when querying the database. If this is
not set, Data Factory uses the
{username} as the default value.

certificateCommonName When you use Secure Sockets Layer No


(SSL) or Transport Layer Security (TLS)
encryption, you must enter a value for
Certificate common name.

TIP
If you receive an error message that states
The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-
805
, the reason is a needed package is not created for the user. By default, ADF will try to create a the package under
collection named as the user you used to connect to the DB2. Specify the package collection property to indicate under
where you want ADF to create the needed packages when querying the database.

Example:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"connectionString":"server=<server:port>;database=<database>;authenticationType=Basic;username=
<username>;password=<password>;packageCollection=<packagecollection>;certificateCommonName=<certname>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"connectionString": "server=<server:port>;database=<database>;authenticationType=Basic;username=
<username>;packageCollection=<packagecollection>;certificateCommonName=<certname>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using DB2 linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"server": "<servername:port>",
"database": "<dbname>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by DB2 dataset.
To copy data from DB2, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: Db2Table

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "DB2Dataset",
"properties":
{
"type": "Db2Table",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<DB2 linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by DB2 source.
DB2 as source
To copy data from DB2, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: Db2Source

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"DB2ADMIN\".\"Customers\""
.

Example:

"activities":[
{
"name": "CopyFromDB2",
"type": "Copy",
"inputs": [
{
"referenceName": "<DB2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "Db2Source",
"query": "SELECT * FROM \"DB2ADMIN\".\"Customers\""
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.

Data type mapping for DB2


When copying data from DB2, the following mappings are used from DB2 data types to Azure Data Factory
interim data types. See Schema and data type mappings to learn about how copy activity maps the source
schema and data type to the sink.
DB 2 DATA B A SE T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

BigInt Int64

Binary Byte[]

Blob Byte[]

Char String

Clob String

Date Datetime

DB2DynArray String

DbClob String

Decimal Decimal

DecimalFloat Decimal

Double Double

Float Double

Graphic String

Integer Int32

LongVarBinary Byte[]

LongVarChar String

LongVarGraphic String

Numeric Decimal

Real Single

SmallInt Int16

Time TimeSpan

Timestamp DateTime

VarBinary Byte[]

VarChar String

VarGraphic String
DB 2 DATA B A SE T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Xml Byte[]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use a copy activity in Azure Data Factory to copy data from and to Microsoft
Dynamics 365 and Microsoft Dynamics CRM. It builds on the copy activity overview article that presents a
general overview of a copy activity.

Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.

NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.

This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.

DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication

Dynamics CRM 2016 on-premises


with IFD

Dynamics CRM 2015 on-premises


with IFD

NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.

IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

This Dynamics connector is built on top of Dynamics XRM tooling.

Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM online
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes


instance. The value must be "Online"
for Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, the same one you access
from browser. An example is
"https://<organization-
name>.crm[x].dynamics.com".

authenticationType The authentication type to connect to Yes


a Dynamics server. Valid values are
"AADServicePrincipal" and "Office365".

servicePrincipalId The client ID of the Azure AD Yes when authentication is


application. "AADServicePrincipal"

servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".

servicePrincipalCredential The service-principal credential. Yes when authentication is


"AADServicePrincipal"
When you use "ServicePrincipalKey" as
the credential type,
servicePrincipalCredential can be
a string that Azure Data Factory
encrypts upon linked service
deployment. Or it can be a reference
to a secret in Azure Key Vault.

When you use "ServicePrincipalCert"


as the credential,
servicePrincipalCredential must
be a reference to a certificate in Azure
Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

username The username to connect to Dynamics. Yes when authentication is "Office365"

password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.

Example: Dynamics online using Azure AD service-principal and key authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Azure AD service-principal and certificate authentication


{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Office 365 authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are hostName and por t .

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes.


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes.


instance. The value must be
"OnPremisesWithIfd" for Dynamics on-
premises with IFD.
P RO P ERT Y DESC RIP T IO N REQ UIRED

hostName The host name of the on-premises Yes.


Dynamics server.

port The port of the on-premises Dynamics No. The default value is 443.
server.

organizationName The organization name of the Yes.


Dynamics instance.

authenticationType The authentication type to connect to Yes.


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username The username to connect to Dynamics. Yes.

password The password for the user account you Yes.


specified for the username. You can
mark this field with "SecureString" to
store it securely in Data Factory. Or
you can store a password in Key Vault
and let the copy activity pull from
there when it does data copy. Learn
more from Store credentials in Key
Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to "DynamicsEntity",
"DynamicsCrmEntity", or
"CommonDataServiceForAppsEntity".

entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink

Example

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, the copy activity source section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
"DynamicsSource",
"DynamicsCrmSource", or
"CommonDataServiceForAppsSource".

query FetchXML is a proprietary query No if entityName in the dataset is


language that is used in Dynamics specified
online and on-premises. See the
following example. To learn more, see
Build queries with FetchXML.
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't
contain it.

IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.

Example

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, the copy activity sink section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes.


sink must be set to "DynamicsSink",
"DynamicsCrmSink", or
"CommonDataServiceForAppsSink".

writeBehavior The write behavior of the operation. Yes


The value must be "Upsert".

alternateKeyName The alternate key name defined on No.


your entity to do an upsert.

writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.

ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.

Valid values are TRUE and FALSE :


TRUE : Leave the data in the
destination object unchanged
when you do an upsert or
update operation. Insert a
defined default value when you
do an insert operation.
FALSE : Update the data in the
destination object to a null
value when you do an upsert
or update operation. Insert a
null value when you do an
insert operation.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
NOTE
The default value for both the sink writeBatchSize and the copy activity parallelCopies for the Dynamics sink is 10.
Therefore, 100 records are concurrently submitted by default to Dynamics.

For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Retrieving data from views


To retrieve data from Dynamics views, you need to get the saved query of the view, and use the query to get the
data.
There are two entities which store different types of view: "saved query" stores system view and "user query"
stores user view. To get the information of the views, refer to the following FetchXML query and replace the
"TARGETENTITY" with savedquery or userquery . Each entity type has more available attributes that you can add
to the query based on your need. Learn more about savedquery entity and userquery entity.
<fetch top="5000" >
<entity name="<TARGETENTITY>">
<attribute name="name" />
<attribute name="fetchxml" />
<attribute name="returnedtypecode" />
<attribute name="querytype" />
</entity>
</fetch>

You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.

<filter type="and" >


<condition attribute="returnedtypecode" operator="eq" value="1" />
<condition attribute="name" operator="eq" value="My Active Accounts" />
</filter>

Data type mapping for Dynamics


When you copy data from Dynamics, the following table shows mappings from Dynamics data types to Data
Factory interim data types. To learn how a copy activity maps to a source schema and a data type maps to a sink,
see Schema and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure that is based on your source
Dynamics data type by using the following mapping table:

DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer GUID ✓ ✓ (See guidance)

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup GUID ✓ ✓ (See guidance)

AttributeType.ManagedPro Boolean ✓
perty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeType.Owner GUID ✓ ✓ (See guidance)

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi GUID ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.

Writing data to a lookup field


To write data into a lookup field with multiple targets like Customer and Owner, follow this guidance and
example:
1. Make your source contains both the field value and the corresponding target entity name.
If all records map to the same target entity, ensure one of the following conditions:
Your source data has a column that stores the target entity name.
You've added an additional column in the copy activity source to define the target entity.
If different records map to different target entities, make sure your source data has a column that
stores the corresponding target entity name.
2. Map both the value and entity-reference columns from source to sink. The entity-reference column must
be mapped to a virtual column with the special naming pattern {lookup_field_name}@EntityReference . The
column doesn't actually exist in Dynamics. It's used to indicate this column is the metadata column of the
given multitarget lookup field.
For example, assume the source has these two columns:
CustomerField column of type GUID , which is the primary key value of the target entity in Dynamics.
Target column of type String , which is the logical name of the target entity.
Also assume you want to copy such data to the sink Dynamics entity field CustomerField of type Customer .
In copy-activity column mapping, map the two columns as follows:
CustomerField to CustomerField . This mapping is the normal field mapping.
Target to CustomerField@EntityReference . The sink column is a virtual column representing the entity
reference. Input such field names in a mapping, as they won't show up by importing schemas.
If all of your source records map to the same target entity and your source data doesn't contain the target entity
name, here is a shortcut: in the copy activity source, add an additional column. Name the new column by using
the pattern {lookup_field_name}@EntityReference , set the value to the target entity name, then proceed with
column mapping as usual. If your source and sink column names are identical, you can also skip explicit column
mapping because copy activity by default maps columns by name.

Lookup activity properties


To learn details about the properties, see Lookup activity.

Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Delimited text format in Azure Data Factory
7/12/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the delimited text files or write the data into delimited text
format .
Delimited text format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage,
Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP,
Google Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the delimited text dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to DelimitedText .

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location .

columnDelimiter The character(s) used to separate No


columns in a file.
The default value is comma , . When
the column delimiter is defined as
empty string, which means no
delimiter, the whole line is taken as a
single column.
Currently, column delimiter as empty
string or multi-char is only supported
for mapping data flow but not Copy
activity.

rowDelimiter The single character or "\r\n" used to No


separate rows in a file.
The default value is any of the
following values on read: ["\r\n",
"\r", "\n"] , and "\n" or "\r\n" on
write by mapping data flow and Copy
activity respectively.
When the row delimiter is set to no
delimiter (empty string), the column
delimiter must be set as no delimiter
(empty string) as well, which means to
treat the entire content as a single
value.
Currently, row delimiter as empty
string is only supported for mapping
data flow but not Copy activity.
P RO P ERT Y DESC RIP T IO N REQ UIRED

quoteChar The single character to quote column No


values if it contains column delimiter.
The default value is double quotes
" .
When quoteChar is defined as empty
string, it means there is no quote char
and column value is not quoted, and
escapeChar is used to escape the
column delimiter and itself.

escapeChar The single character to escape quotes No


inside a quoted value.
The default value is backslash \ .
When escapeChar is defined as
empty string, the quoteChar must be
set as empty string as well, in which
case make sure all column values don't
contain delimiters.

firstRowAsHeader Specifies whether to treat/make the No


first row as a header line with names of
columns.
Allowed values are true and false
(default).
When first row as header is false, note
UI data preview and lookup activity
output auto generate column names
as Prop_{n} (starting from 0), copy
activity requires explicit mapping from
source to sink and locates columns by
ordinal (starting from 1), and mapping
data flow lists and locates columns
with name as Column_{n} (starting
from 1).

nullValue Specifies the string representation of No


null value.
The default value is empty string .
P RO P ERT Y DESC RIP T IO N REQ UIRED

encodingName The encoding type used to read/write No


test files.
Allowed values are as follows: "UTF-8",
"UTF-16", "UTF-16BE", "UTF-32", "UTF-
32BE", "US-ASCII", "UTF-7", "BIG5",
"EUC-JP", "EUC-KR", "GB2312",
"GB18030", "JOHAB", "SHIFT-JIS",
"CP875", "CP866", "IBM00858",
"IBM037", "IBM273", "IBM437",
"IBM500", "IBM737", "IBM775",
"IBM850", "IBM852", "IBM855",
"IBM857", "IBM860", "IBM861",
"IBM863", "IBM864", "IBM865",
"IBM869", "IBM870", "IBM01140",
"IBM01141", "IBM01142",
"IBM01143", "IBM01144",
"IBM01145", "IBM01146",
"IBM01147", "IBM01148",
"IBM01149", "ISO-2022-JP", "ISO-
2022-KR", "ISO-8859-1", "ISO-8859-
2", "ISO-8859-3", "ISO-8859-4", "ISO-
8859-5", "ISO-8859-6", "ISO-8859-7",
"ISO-8859-8", "ISO-8859-9", "ISO-
8859-13", "ISO-8859-15",
"WINDOWS-874", "WINDOWS-1250",
"WINDOWS-1251", "WINDOWS-
1252", "WINDOWS-1253",
"WINDOWS-1254", "WINDOWS-
1255", "WINDOWS-1256",
"WINDOWS-1257", "WINDOWS-
1258".
Note mapping data flow doesn't
support UTF-7 encoding.

compressionCodec The compression codec used to No


read/write text files.
Allowed values are bzip2 , gzip ,
deflate , ZipDeflate , TarGzip , Tar ,
snappy , or lz4 . Default is not
compressed.
Note currently Copy activity doesn't
support "snappy" & "lz4", and
mapping data flow doesn't support
"ZipDeflate", "TarGzip" and "Tar".
Note when using copy activity to
decompress ZipDeflate /TarGzip /Tar
file(s) and write to file-based sink data
store, by default files are extracted to
the folder:
<path specified in
dataset>/<folder named as source
compressed file>/
, use preserveZipFileNameAsFolder /
preserveCompressionFileNameAsFolder
on copy activity source to control
whether to preserve the name of the
compressed file(s) as folder structure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

compressionLevel The compression ratio. No


Allowed values are Optimal or
Fastest .
- Fastest: The compression operation
should complete as quickly as possible,
even if the resulting file is not
optimally compressed.
- Optimal: The compression operation
should be optimally compressed, even
if the operation takes a longer time to
complete. For more information, see
Compression Level topic.

Below is an example of delimited text dataset on Azure Blob Storage:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"columnDelimiter": ",",
"quoteChar": "\"",
"escapeChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the delimited text source and sink.
Delimited text as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
DelimitedTextSource .

formatSettings A group of properties. Refer to No


Delimited text read settings table
below.
P RO P ERT Y DESC RIP T IO N REQ UIRED

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings .

Supported delimited text read settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to DelimitedTextReadSettings .

skipLineCount Indicates the number of non-empty No


rows to skip when reading data from
input files.
If both skipLineCount and
firstRowAsHeader are specified, the
lines are skipped first and then the
header information is read from the
input file.

compressionProperties A group of properties on how to No


decompress data for a given
compression codec.

preserveZipFileNameAsFolder Applies when input dataset is No


(under compressionProperties -> configured with ZipDeflate
type as ZipDeflateReadSettings ) compression. Indicates whether to
preserve the source zip file name as
folder structure during copy.
- When set to true (default) , Data
Factory writes unzipped files to
<path specified in
dataset>/<folder named as source
zip file>/
.
- When set to false , Data Factory
writes unzipped files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source zip files
to avoid racing or unexpected
behavior.
P RO P ERT Y DESC RIP T IO N REQ UIRED

preserveCompressionFileNameAsFolde Applies when input dataset is No


r configured with TarGzip /Tar
(under compressionProperties -> compression. Indicates whether to
type as TarGZipReadSettings or preserve the source compressed file
TarReadSettings ) name as folder structure during copy.
- When set to true (default) , Data
Factory writes decompressed files to
<path specified in
dataset>/<folder named as source
compressed file>/
.
- When set to false , Data Factory
writes decompressed files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source files to
avoid racing or unexpected behavior.

"activities": [
{
"name": "CopyFromDelimitedText",
"type": "Copy",
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "DelimitedTextReadSettings",
"skipLineCount": 3,
"compressionProperties": {
"type": "ZipDeflateReadSettings",
"preserveZipFileNameAsFolder": false
}
}
},
...
}
...
}
]

Delimited text as sink


The following properties are supported in the copy activity *sink* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
DelimitedTextSink .

formatSettings A group of properties. Refer to No


Delimited text write settings table
below.
P RO P ERT Y DESC RIP T IO N REQ UIRED

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings .

Supported delimited text write settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to DelimitedTextWriteSettings .

fileExtension The file extension used to name the Yes when file name is not specified in
output files, for example, .csv , output dataset
.txt . It must be specified when the
fileName is not specified in the
output DelimitedText dataset. When
file name is configured in the output
dataset, it will be used as the sink file
name and the file extension setting will
be ignored.

maxRowsPerFile When writing data into a folder, you No


can choose to write to multiple files
and specify the max rows per file.

fileNamePrefix Applicable when maxRowsPerFile is No


configured.
Specify the file name prefix when
writing data to multiple files, resulted
in this pattern:
<fileNamePrefix>_00000.
<fileExtension>
. If not specified, file name prefix will be
auto generated. This property does
not apply when source is file-based
store or partition-option-enabled data
store.

Mapping data flow properties


In mapping data flows, you can read and write to delimited text format in the following data stores: Azure Blob
Storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Source properties
The below table lists the properties supported by a delimited text source. You can edit these properties in the
Source options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process

Multiline rows Does the source file no true or false multiLineRow


contain rows that
span multiple lines.
Multiline values must
be in quotes.

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root ['<from>',
'<to>']

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

NOTE
Data flow sources support for list of files is limited to 1024 entries in your file. To include more files, use wildcards in your
file list.

Source example
The below image is an example of a delimited text source configuration in mapping data flows.
The associated data flow script is:

source(
allowSchemaDrift: true,
validateSchema: false,
multiLineRow: true,
wildcardPaths:['*.csv']) ~> CSVSource

NOTE
Data flow sources support a limited set of Linux globbing that is support by Hadoop file systems

Sink properties
The below table lists the properties supported by a delimited text sink. You can edit these properties in the
Settings tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

File name option The naming format no Pattern: String filePattern


of the data written. Per partition: String[] partitionFileNames
By default, one file Name file as column rowUrlColumn
per partition in data: String partitionFileNames
format Output to single file: rowFolderUrlColumn
part-#####-tid- ['<fileName>']
<guid> Name folder as
column data: String

Quote all Enclose all values in no true or false quoteAll


quotes

Header Add customer no [<string array>] header


headers to output
files

Sink example
The below image is an example of a delimited text sink configuration in mapping data flows.
The associated data flow script is:

CSVSource sink(allowSchemaDrift: true,


validateSchema: false,
truncate: true,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> CSVSink

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Delta format in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article highlights how to copy data to and from a delta lake stored in Azure Data Lake Store Gen2 or Azure
Blob Storage using the delta format. This connector is available as an inline dataset in mapping data flows as
both a source and a sink.

Mapping data flow properties


This connector is available as an inline dataset in mapping data flows as both a source and a sink.
Source properties
The below table lists the properties supported by a delta source. You can edit these properties in the Source
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes delta format


delta

File system The container/file yes String fileSystem


system of the delta
lake

Folder path The direct of the yes String folderPath


delta lake

Compression type The compression no bzip2 compressionType


type of the delta gzip
table deflate
ZipDeflate
snappy
lz4

Compression level Choose whether the required if Optimal or compressionLevel


compression compressedType is Fastest
completes as quickly specified.
as possible or if the
resulting file should
be optimally
compressed.

Time travel Choose whether to no Query by timestamp: timestampAsOf


query an older Timestamp versionAsOf
snapshot of a delta Query by version:
table Integer
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Import schema
Delta is only available as an inline dataset and, by default, doesn't have an associated schema. To get column
metadata, click the Impor t schema button in the Projection tab. This will allow you to reference the column
names and data types specified by the corpus. To import the schema, a data flow debug session must be active
and you must have an existing CDM entity definition file to point to.
Delta source script example

source(output(movieId as integer,
title as string,
releaseDate as date,
rated as boolean,
screenedOn as timestamp,
ticketPrice as decimal(10,2)
),
store: 'local',
format: 'delta',
versionAsOf: 0,
allowSchemaDrift: false,
folderPath: $tempPath + '/delta'
) ~> movies

Sink properties
The below table lists the properties supported by a delta sink. You can edit these properties in the Settings tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes delta format


delta

File system The container/file yes String fileSystem


system of the delta
lake

Folder path The direct of the yes String folderPath


delta lake

Compression type The compression no bzip2 compressionType


type of the delta gzip
table deflate
ZipDeflate
snappy
lz4
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Compression level Choose whether the required if Optimal or compressionLevel


compression compressedType is Fastest
completes as quickly specified.
as possible or if the
resulting file should
be optimally
compressed.

Vacuum Specify retention yes Integer vacuum


threshold in hours
for older versions of
table. A value of 0 or
less defaults to 30
days

Update method Specify which update yes true or false deletable


operations are insertable
allowed on the delta updateable
lake. For methods merge
that aren't insert, a
preceding alter row
transformation is
required to mark
rows.

Optimized Write Achieve higher no true or false optimizedWrite: true


throughput for write
operation via
optimizing internal
shuffle in Spark
executors. As a result,
you may notice fewer
partitions and files
that are of a larger
size

Auto Compact After any write no true or false autoCompact: true


operation has
completed, Spark will
automatically execute
the OPTIMIZE
command to re-
organize the data,
resulting in more
partitions if
necessary, for better
reading performance
in the future

Delta sink script example


The associated data flow script is:
moviesAltered sink(
input(movieId as integer,
title as string
),
mapColumn(
movieId,
title
),
insertable: true,
updateable: true,
deletable: true,
upsertable: false,
keys: ['movieId'],
store: 'local',
format: 'delta',
vacuum: 180,
folderPath: $tempPath + '/delta'
) ~> movieDB

Known limitations
When writing to a delta sink, there is a known limitation where the numbers of rows written won't be return in
the monitoring output.

Next steps
Create a source transformation in mapping data flow.
Create a sink transformation in mapping data flow.
Create an alter row transformation to mark rows as insert, update, upsert, or delete.
Copy data from Drill using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Drill. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Drill connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Drill to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Drill connector.
Linked service properties
The following properties are supported for Drill linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Drill

connectionString An ODBC connection string to connect Yes


to Drill.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=
<user name>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=
<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Drill dataset.
To copy data from Drill, set the type property of the dataset to DrillTable . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: DrillTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example
{
"name": "DrillDataset",
"properties": {
"type": "DrillTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Drill linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Drill source.
DrillSource as source
To copy data from Drill, set the source type in the copy activity to DrillSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: DrillSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromDrill",
"type": "Copy",
"inputs": [
{
"referenceName": "<Drill input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DrillSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Lookup activity properties
To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use a copy activity in Azure Data Factory to copy data from and to Microsoft
Dynamics 365 and Microsoft Dynamics CRM. It builds on the copy activity overview article that presents a
general overview of a copy activity.

Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.

NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.

This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.

DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication

Dynamics CRM 2016 on-premises


with IFD

Dynamics CRM 2015 on-premises


with IFD

NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.

IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

This Dynamics connector is built on top of Dynamics XRM tooling.

Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM online
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes


instance. The value must be "Online"
for Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, the same one you access
from browser. An example is
"https://<organization-
name>.crm[x].dynamics.com".

authenticationType The authentication type to connect to Yes


a Dynamics server. Valid values are
"AADServicePrincipal" and "Office365".

servicePrincipalId The client ID of the Azure AD Yes when authentication is


application. "AADServicePrincipal"

servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".

servicePrincipalCredential The service-principal credential. Yes when authentication is


"AADServicePrincipal"
When you use "ServicePrincipalKey" as
the credential type,
servicePrincipalCredential can be
a string that Azure Data Factory
encrypts upon linked service
deployment. Or it can be a reference
to a secret in Azure Key Vault.

When you use "ServicePrincipalCert"


as the credential,
servicePrincipalCredential must
be a reference to a certificate in Azure
Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

username The username to connect to Dynamics. Yes when authentication is "Office365"

password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.

Example: Dynamics online using Azure AD service-principal and key authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Azure AD service-principal and certificate authentication


{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Office 365 authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are hostName and por t .

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes.


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes.


instance. The value must be
"OnPremisesWithIfd" for Dynamics on-
premises with IFD.
P RO P ERT Y DESC RIP T IO N REQ UIRED

hostName The host name of the on-premises Yes.


Dynamics server.

port The port of the on-premises Dynamics No. The default value is 443.
server.

organizationName The organization name of the Yes.


Dynamics instance.

authenticationType The authentication type to connect to Yes.


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username The username to connect to Dynamics. Yes.

password The password for the user account you Yes.


specified for the username. You can
mark this field with "SecureString" to
store it securely in Data Factory. Or
you can store a password in Key Vault
and let the copy activity pull from
there when it does data copy. Learn
more from Store credentials in Key
Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to "DynamicsEntity",
"DynamicsCrmEntity", or
"CommonDataServiceForAppsEntity".

entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink

Example

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, the copy activity source section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
"DynamicsSource",
"DynamicsCrmSource", or
"CommonDataServiceForAppsSource".

query FetchXML is a proprietary query No if entityName in the dataset is


language that is used in Dynamics specified
online and on-premises. See the
following example. To learn more, see
Build queries with FetchXML.
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't
contain it.

IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.

Example

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, the copy activity sink section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes.


sink must be set to "DynamicsSink",
"DynamicsCrmSink", or
"CommonDataServiceForAppsSink".

writeBehavior The write behavior of the operation. Yes


The value must be "Upsert".

alternateKeyName The alternate key name defined on No.


your entity to do an upsert.

writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.

ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.

Valid values are TRUE and FALSE :


TRUE : Leave the data in the
destination object unchanged
when you do an upsert or
update operation. Insert a
defined default value when you
do an insert operation.
FALSE : Update the data in the
destination object to a null
value when you do an upsert
or update operation. Insert a
null value when you do an
insert operation.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
NOTE
The default value for both the sink writeBatchSize and the copy activity parallelCopies for the Dynamics sink is 10.
Therefore, 100 records are concurrently submitted by default to Dynamics.

For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Retrieving data from views


To retrieve data from Dynamics views, you need to get the saved query of the view, and use the query to get the
data.
There are two entities which store different types of view: "saved query" stores system view and "user query"
stores user view. To get the information of the views, refer to the following FetchXML query and replace the
"TARGETENTITY" with savedquery or userquery . Each entity type has more available attributes that you can add
to the query based on your need. Learn more about savedquery entity and userquery entity.
<fetch top="5000" >
<entity name="<TARGETENTITY>">
<attribute name="name" />
<attribute name="fetchxml" />
<attribute name="returnedtypecode" />
<attribute name="querytype" />
</entity>
</fetch>

You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.

<filter type="and" >


<condition attribute="returnedtypecode" operator="eq" value="1" />
<condition attribute="name" operator="eq" value="My Active Accounts" />
</filter>

Data type mapping for Dynamics


When you copy data from Dynamics, the following table shows mappings from Dynamics data types to Data
Factory interim data types. To learn how a copy activity maps to a source schema and a data type maps to a sink,
see Schema and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure that is based on your source
Dynamics data type by using the following mapping table:

DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer GUID ✓ ✓ (See guidance)

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup GUID ✓ ✓ (See guidance)

AttributeType.ManagedPro Boolean ✓
perty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeType.Owner GUID ✓ ✓ (See guidance)

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi GUID ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.

Writing data to a lookup field


To write data into a lookup field with multiple targets like Customer and Owner, follow this guidance and
example:
1. Make your source contains both the field value and the corresponding target entity name.
If all records map to the same target entity, ensure one of the following conditions:
Your source data has a column that stores the target entity name.
You've added an additional column in the copy activity source to define the target entity.
If different records map to different target entities, make sure your source data has a column that
stores the corresponding target entity name.
2. Map both the value and entity-reference columns from source to sink. The entity-reference column must
be mapped to a virtual column with the special naming pattern {lookup_field_name}@EntityReference . The
column doesn't actually exist in Dynamics. It's used to indicate this column is the metadata column of the
given multitarget lookup field.
For example, assume the source has these two columns:
CustomerField column of type GUID , which is the primary key value of the target entity in Dynamics.
Target column of type String , which is the logical name of the target entity.
Also assume you want to copy such data to the sink Dynamics entity field CustomerField of type Customer .
In copy-activity column mapping, map the two columns as follows:
CustomerField to CustomerField . This mapping is the normal field mapping.
Target to CustomerField@EntityReference . The sink column is a virtual column representing the entity
reference. Input such field names in a mapping, as they won't show up by importing schemas.
If all of your source records map to the same target entity and your source data doesn't contain the target entity
name, here is a shortcut: in the copy activity source, add an additional column. Name the new column by using
the pattern {lookup_field_name}@EntityReference , set the value to the target entity name, then proceed with
column mapping as usual. If your source and sink column names are identical, you can also skip explicit column
mapping because copy activity by default maps columns by name.

Lookup activity properties


To learn details about the properties, see Lookup activity.

Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Copy data from Dynamics AX by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from Dynamics AX source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
This Dynamics AX connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Dynamics AX to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this Dynamics AX connector supports copying data from Dynamics AX using OData protocol with
Ser vice Principal authentication .

TIP
You can also use this connector to copy data from Dynamics 365 Finance and Operations . Refer to Dynamics 365's
OData support and Authentication method.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Dynamics AX connector.

Prerequisites
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Go to Dynamics AX, and grant this service principal proper permission to access your Dynamics AX.

Linked service properties


The following properties are supported for Dynamics AX linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


DynamicsAX .

url The Dynamics AX (or Dynamics 365 Yes


Finance and Operations) instance
OData endpoint.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

aadResourceId Specify the AAD resource you are Yes


requesting for authorization. For
example, if your Dynamics URL is
https://sampledynamics.sandbox.operations.dynamics.com/data/
, the corresponding AAD resource is
usually
https://sampledynamics.sandbox.operations.dynamics.com
.

connectVia The Integration Runtime to use to No


connect to the data store. You can
choose Azure Integration Runtime or a
self-hosted Integration Runtime (if
your data store is located in a private
network). If not specified, the default
Azure Integration Runtime is used.

Example
{
"name": "DynamicsAXLinkedService",
"properties": {
"type": "DynamicsAX",
"typeProperties": {
"url": "<Dynamics AX instance OData endpoint>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource, e.g. https://sampledynamics.sandbox.operations.dynamics.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Dataset properties
This section provides a list of properties that the Dynamics AX dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from Dynamics AX, set the type property of the dataset to DynamicsAXResource . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to DynamicsAXResource .

path The path to the Dynamics AX OData Yes


entity.

Example

{
"name": "DynamicsAXResourceDataset",
"properties": {
"type": "DynamicsAXResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Dynamics AX linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy Activity properties


This section provides a list of properties that the Dynamics AX source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Dynamics AX as source
To copy data from Dynamics AX, set the source type in Copy Activity to DynamicsAXSource . The following
properties are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity source must be set to
DynamicsAXSource .

query OData query options for filtering data. No


Example:
"?
$select=Name,Description&$top=5"
.

Note : The connector copies data from


the combined URL:
[URL specified in linked
service]/[path specified in
dataset][query specified in copy
activity source]
. For more information, see OData URL
components.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. If not specified, the
default value is 00:30:00 (30
minutes).

Example
"activities":[
{
"name": "CopyFromDynamicsAX",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics AX input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsAXSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use a copy activity in Azure Data Factory to copy data from and to Microsoft
Dynamics 365 and Microsoft Dynamics CRM. It builds on the copy activity overview article that presents a
general overview of a copy activity.

Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.

NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.

This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.

DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES

Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication

Dynamics CRM 2016 on-premises


with IFD

Dynamics CRM 2015 on-premises


with IFD

NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.

IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.

For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.

TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.

This Dynamics connector is built on top of Dynamics XRM tooling.

Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.

Linked service properties


The following properties are supported for the Dynamics linked service.
Dynamics 365 and Dynamics CRM online
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes


instance. The value must be "Online"
for Dynamics online.

serviceUri The service URL of your Dynamics Yes


instance, the same one you access
from browser. An example is
"https://<organization-
name>.crm[x].dynamics.com".

authenticationType The authentication type to connect to Yes


a Dynamics server. Valid values are
"AADServicePrincipal" and "Office365".

servicePrincipalId The client ID of the Azure AD Yes when authentication is


application. "AADServicePrincipal"

servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".

servicePrincipalCredential The service-principal credential. Yes when authentication is


"AADServicePrincipal"
When you use "ServicePrincipalKey" as
the credential type,
servicePrincipalCredential can be
a string that Azure Data Factory
encrypts upon linked service
deployment. Or it can be a reference
to a secret in Azure Key Vault.

When you use "ServicePrincipalCert"


as the credential,
servicePrincipalCredential must
be a reference to a certificate in Azure
Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

username The username to connect to Dynamics. Yes when authentication is "Office365"

password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.

Example: Dynamics online using Azure AD service-principal and key authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Azure AD service-principal and certificate authentication


{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Dynamics online using Office 365 authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dynamics 365 and Dynamics CRM on-premises with IFD


Additional properties that compare to Dynamics online are hostName and por t .

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes.


"Dynamics", "DynamicsCrm", or
"CommonDataServiceForApps".

deploymentType The deployment type of the Dynamics Yes.


instance. The value must be
"OnPremisesWithIfd" for Dynamics on-
premises with IFD.
P RO P ERT Y DESC RIP T IO N REQ UIRED

hostName The host name of the on-premises Yes.


Dynamics server.

port The port of the on-premises Dynamics No. The default value is 443.
server.

organizationName The organization name of the Yes.


Dynamics instance.

authenticationType The authentication type to connect to Yes.


the Dynamics server. Specify "Ifd" for
Dynamics on-premises with IFD.

username The username to connect to Dynamics. Yes.

password The password for the user account you Yes.


specified for the username. You can
mark this field with "SecureString" to
store it securely in Data Factory. Or
you can store a password in Key Vault
and let the copy activity pull from
there when it does data copy. Learn
more from Store credentials in Key
Vault.

connectVia The integration runtime to be used to No


connect to the data store. If no value is
specified, the property uses the default
Azure integration runtime.

Example: Dynamics on-premises with IFD using IFD authentication

{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to "DynamicsEntity",
"DynamicsCrmEntity", or
"CommonDataServiceForAppsEntity".

entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink

Example

{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Dynamics source and sink types.
Dynamics as a source type
To copy data from Dynamics, the copy activity source section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
"DynamicsSource",
"DynamicsCrmSource", or
"CommonDataServiceForAppsSource".

query FetchXML is a proprietary query No if entityName in the dataset is


language that is used in Dynamics specified
online and on-premises. See the
following example. To learn more, see
Build queries with FetchXML.
NOTE
The PK column will always be copied out even if the column projection you configure in the FetchXML query doesn't
contain it.

IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.

Example

"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Sample FetchXML query


<fetch>
<entity name="account">
<attribute name="accountid" />
<attribute name="name" />
<attribute name="marketingonly" />
<attribute name="modifiedon" />
<order attribute="modifiedon" descending="false" />
<filter type="and">
<condition attribute ="modifiedon" operator="between">
<value>2017-03-10 18:40:00z</value>
<value>2017-03-12 20:40:00z</value>
</condition>
</filter>
</entity>
</fetch>

Dynamics as a sink type


To copy data to Dynamics, the copy activity sink section supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes.


sink must be set to "DynamicsSink",
"DynamicsCrmSink", or
"CommonDataServiceForAppsSink".

writeBehavior The write behavior of the operation. Yes


The value must be "Upsert".

alternateKeyName The alternate key name defined on No.


your entity to do an upsert.

writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.

ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.

Valid values are TRUE and FALSE :


TRUE : Leave the data in the
destination object unchanged
when you do an upsert or
update operation. Insert a
defined default value when you
do an insert operation.
FALSE : Update the data in the
destination object to a null
value when you do an upsert
or update operation. Insert a
null value when you do an
insert operation.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.
NOTE
The default value for both the sink writeBatchSize and the copy activity parallelCopies for the Dynamics sink is 10.
Therefore, 100 records are concurrently submitted by default to Dynamics.

For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example

"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]

Retrieving data from views


To retrieve data from Dynamics views, you need to get the saved query of the view, and use the query to get the
data.
There are two entities which store different types of view: "saved query" stores system view and "user query"
stores user view. To get the information of the views, refer to the following FetchXML query and replace the
"TARGETENTITY" with savedquery or userquery . Each entity type has more available attributes that you can add
to the query based on your need. Learn more about savedquery entity and userquery entity.
<fetch top="5000" >
<entity name="<TARGETENTITY>">
<attribute name="name" />
<attribute name="fetchxml" />
<attribute name="returnedtypecode" />
<attribute name="querytype" />
</entity>
</fetch>

You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.

<filter type="and" >


<condition attribute="returnedtypecode" operator="eq" value="1" />
<condition attribute="name" operator="eq" value="My Active Accounts" />
</filter>

Data type mapping for Dynamics


When you copy data from Dynamics, the following table shows mappings from Dynamics data types to Data
Factory interim data types. To learn how a copy activity maps to a source schema and a data type maps to a sink,
see Schema and data type mappings.
Configure the corresponding Data Factory data type in a dataset structure that is based on your source
Dynamics data type by using the following mapping table:

DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeTypeCode.BigInt Long ✓ ✓

AttributeTypeCode.Boolean Boolean ✓ ✓

AttributeType.Customer GUID ✓ ✓ (See guidance)

AttributeType.DateTime Datetime ✓ ✓

AttributeType.Decimal Decimal ✓ ✓

AttributeType.Double Double ✓ ✓

AttributeType.EntityName String ✓ ✓

AttributeType.Integer Int32 ✓ ✓

AttributeType.Lookup GUID ✓ ✓ (See guidance)

AttributeType.ManagedPro Boolean ✓
perty

AttributeType.Memo String ✓ ✓

AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K

AttributeType.Owner GUID ✓ ✓ (See guidance)

AttributeType.Picklist Int32 ✓ ✓

AttributeType.Uniqueidentifi GUID ✓ ✓
er

AttributeType.String String ✓ ✓

AttributeType.State Int32 ✓ ✓

AttributeType.Status Int32 ✓ ✓

NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.

Writing data to a lookup field


To write data into a lookup field with multiple targets like Customer and Owner, follow this guidance and
example:
1. Make your source contains both the field value and the corresponding target entity name.
If all records map to the same target entity, ensure one of the following conditions:
Your source data has a column that stores the target entity name.
You've added an additional column in the copy activity source to define the target entity.
If different records map to different target entities, make sure your source data has a column that
stores the corresponding target entity name.
2. Map both the value and entity-reference columns from source to sink. The entity-reference column must
be mapped to a virtual column with the special naming pattern {lookup_field_name}@EntityReference . The
column doesn't actually exist in Dynamics. It's used to indicate this column is the metadata column of the
given multitarget lookup field.
For example, assume the source has these two columns:
CustomerField column of type GUID , which is the primary key value of the target entity in Dynamics.
Target column of type String , which is the logical name of the target entity.
Also assume you want to copy such data to the sink Dynamics entity field CustomerField of type Customer .
In copy-activity column mapping, map the two columns as follows:
CustomerField to CustomerField . This mapping is the normal field mapping.
Target to CustomerField@EntityReference . The sink column is a virtual column representing the entity
reference. Input such field names in a mapping, as they won't show up by importing schemas.
If all of your source records map to the same target entity and your source data doesn't contain the target entity
name, here is a shortcut: in the copy activity source, add an additional column. Name the new column by using
the pattern {lookup_field_name}@EntityReference , set the value to the target entity name, then proceed with
column mapping as usual. If your source and sink column names are identical, you can also skip explicit column
mapping because copy activity by default maps columns by name.

Lookup activity properties


To learn details about the properties, see Lookup activity.

Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Excel format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the Excel files . Azure Data Factory supports both ".xls" and ".xlsx".
Excel format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob,
Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP. It is supported as source but not sink.

NOTE
".xls" format is not supported while using HTTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Excel dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Excel.

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location .

sheetName The Excel worksheet name to read Specify sheetName or sheetIndex


data.

sheetIndex The Excel worksheet index to read Specify sheetName or sheetIndex


data, starting from 0.

range The cell range in the given worksheet No


to locate the selective data, e.g.:
- Not specified: reads the whole
worksheet as a table from the first
non-empty row and column
- A3 : reads a table starting from the
given cell, dynamically detects all the
rows below and all the columns to the
right
- A3:H5 : reads this fixed range as a
table
- A3:A3 : reads this single cell
P RO P ERT Y DESC RIP T IO N REQ UIRED

firstRowAsHeader Specifies whether to treat the first row No


in the given worksheet/range as a
header line with names of columns.
Allowed values are true and false
(default).

nullValue Specifies the string representation of No


null value.
The default value is empty string .

compression Group of properties to configure file No


compression. Configure this section
when you want to do
compression/decompression during
activity execution.

type The compression codec used to No.


(under compression ) read/write JSON files.
Allowed values are bzip2 , gzip ,
deflate , ZipDeflate , TarGzip , Tar ,
snappy , or lz4 . Default is not
compressed.
Note currently Copy activity doesn't
support "snappy" & "lz4", and
mapping data flow doesn't support
"ZipDeflate", "TarGzip" and "Tar".
Note when using copy activity to
decompress ZipDeflate file(s) and
write to file-based sink data store, files
are extracted to the folder:
<path specified in
dataset>/<folder named as source
zip file>/
.

level The compression ratio. No


(under compression ) Allowed values are Optimal or
Fastest .
- Fastest: The compression operation
should complete as quickly as possible,
even if the resulting file is not
optimally compressed.
- Optimal: The compression operation
should be optimally compressed, even
if the operation takes a longer time to
complete. For more information, see
Compression Level topic.

Below is an example of Excel dataset on Azure Blob Storage:


{
"name": "ExcelDataset",
"properties": {
"type": "Excel",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"sheetName": "MyWorksheet",
"range": "A3:H5",
"firstRowAsHeader": true
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Excel source.
Excel as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to ExcelSource .

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings .

"activities": [
{
"name": "CopyFromExcel",
"type": "Copy",
"typeProperties": {
"source": {
"type": "ExcelSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
...
}
...
}
]

Mapping data flow properties


In mapping data flows, you can read Excel format in the following data stores: Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2. You can point to Excel files either using Excel dataset or
using an inline dataset.
Source properties
The below table lists the properties supported by an Excel source. You can edit these properties in the Source
options tab. When using inline dataset, you will see additional file settings, which are the same as the
properties described in dataset properties section.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root ['<from>',
'<to>']

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Source example
The below image is an example of an Excel source configuration in mapping data flows using dataset mode.
The associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
wildcardPaths:['*.xls']) ~> ExcelSource

If you use inline dataset, you see the following source options in mapping data flow.

The associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
format: 'excel',
fileSystem: 'container',
folderPath: 'path',
fileName: 'sample.xls',
sheetName: 'worksheet',
firstRowAsHeader: true) ~> ExcelSourceInlineDataset
Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Copy data to or from a file system by using Azure
Data Factory
5/6/2021 • 17 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data to and from file system. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This file system connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this file system connector supports:
Copying files from/to local machine or network file share. To use a Linux file share, install Samba on your
Linux server.
Copying files using Windows authentication.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.

NOTE
Mapped network drive is not supported when loading data from a network file share. Use the actual path instead e.g.
\\server\share .

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
file system.

Linked service properties


The following properties are supported for file system linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


FileSer ver .

host Specifies the root path of the folder Yes


that you want to copy. Use the escape
character "" for special characters in
the string. See Sample linked service
and dataset definitions for examples.

userId Specify the ID of the user who has Yes


access to the server.

password Specify the password for the user Yes


(userId). Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Sample linked service and dataset definitions


" H O ST " IN L IN K ED SERVIC E " F O L DERPAT H " IN DATA SET
SC EN A RIO DEF IN IT IO N DEF IN IT IO N

Local folder on Integration Runtime In JSON: D:\\ In JSON: .\\ or


machine: On UI: D:\ folder\\subfolder
On UI: .\ or folder\subfolder
Examples: D:\* or D:\folder\subfolder\*

Remote shared folder: In JSON: \\\\myserver\\share In JSON: .\\ or


On UI: \\myserver\share folder\\subfolder
Examples: \\myserver\share\* or On UI: .\ or folder\subfolder
\\myserver\share\folder\subfolder\*
NOTE
When authoring via UI, you don't need to input double backslash ( \\ ) to escape like you do via JSON, specify single
backslash.

Example:

{
"name": "FileLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "<host>",
"userId": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for file system under location settings in format-based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


dataset must be set to
FileSer verLocation .

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.
P RO P ERT Y DESC RIP T IO N REQ UIRED

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<File system linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by file system source and sink.
File system as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for file system under storeSettings settings in format-based copy
source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
FileSer verReadSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

Locate the files to copy:

OPTION 1: static path Copy from the given folder/file path


specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: server side filter File server side native filter, which No
- fileFilter provides better performance than
OPTION 3 wildcard filter. Use * to
match zero or more characters and ?
to match zero or single character.
Learn more about the syntax and
notes from the Remarks under this
section.

OPTION 3: client side filter The folder path with wildcard No


- wildcardFolderPath characters to filter source folders. Such
filter happens on ADF side, ADF
enumerate the folders/files under the
given path then apply the wildcard
filter.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.

OPTION 3: client side filter The file name with wildcard characters Yes
- wildcardFileName under the given
folderPath/wildcardFolderPath to filter
source files. Such filter happens on
ADF side, ADF enumerate the files
under the given path then apply the
wildcard filter.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

OPTION 3: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When using this option, do not specify
file name in dataset. See more
examples in File list examples.

Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL, which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

File system as sink


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for file system under storeSettings settings in format-based copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
FileSer verWriteSettings .
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using file list path in copy activity source.
Assuming you have the following source folder structure and want to copy the files in bold:
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT A DF C O N F IGURAT IO N

root File1.csv In dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy, one file per
line with the relative path to the path
configured in the dataset.

recursive and copyBehavior examples


This section describes the resulting behavior of the Copy operation for different combinations of recursive and
copyBehavior values.

SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

true preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the same
File2 structure as the source:
Subfolder1
File3 Folder1
File4 File1
File5 File2
Subfolder1
File3
File4
File5.

true flattenHierarchy Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2
autogenerated name for
File3
autogenerated name for
File4
autogenerated name for
File5

true mergeFiles Folder1 The target Folder1 is


File1 created with the following
File2 structure:
Subfolder1
File3 Folder1
File4 File1 + File2 + File3 +
File5 File4 + File 5 contents are
merged into one file with
autogenerated file name
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET

false preserveHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1
File5 File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false flattenHierarchy Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 autogenerated name for
File5 File1
autogenerated name for
File2

Subfolder1 with File3, File4,


and File5 are not picked up.

false mergeFiles Folder1 The target folder Folder1 is


File1 created with the following
File2 structure
Subfolder1
File3 Folder1
File4 File1 + File2 contents are
File5 merged into one file with
autogenerated file name.
autogenerated name for
File1

Subfolder1 with File3, File4,


and File5 are not picked up.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Delete activity properties


To learn details about the properties, check Delete activity

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Wildcard filter is No


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual folder
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Sample linked
service and dataset definitions and
Folder and file filter examples.

fileName Name or wildcard filter for the No


file(s) under the specified "folderPath".
If you don't specify a value for this
property, the dataset points to all files
in the folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.

When fileName isn't specified for an


output dataset and
preser veHierarchy isn't specified in
the activity sink, the copy activity
automatically generates the file name
with the following pattern: "Data.
[activity run ID GUID].[GUID if
FlattenHierarchy].[format if configured].
[compression if configured]", for
example "Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt.gz"; if you copy
from tabular source using table name
instead of query, the name pattern is
"[table name].[format].[compression if
configured]", for example
"MyTable.csv".
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of
files.

The properties can be NULL, which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeEnd Files filter based on the attribute: Last No


Modified. The files will be selected if
their last modified time is within the
time range between
modifiedDatetimeStart and
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of "2018-12-01T05:00:00Z".

Be aware the overall performance of


data movement will be impacted by
enabling this setting when you want to
do file filter from huge amounts of
files.

The properties can be NULL, which


means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has
datetime value but
modifiedDatetimeEnd is NULL, it
means the files whose last modified
attribute is greater than or equal with
the datetime value will be selected.
When modifiedDatetimeEnd has
datetime value but
modifiedDatetimeStart is NULL, it
means the files whose last modified
attribute is less than the datetime
value will be selected.

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.

If you want to parse or generate files


with a specific format, the following file
format types are supported:
TextFormat , JsonFormat ,
AvroFormat , OrcFormat ,
ParquetFormat . Set the type
property under format to one of these
values. For more information, see Text
Format, Json Format, Avro Format, Orc
Format, and Parquet Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are: Optimal and
Fastest .
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:

{
"name": "FileSystemDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<file system linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default),
false
P RO P ERT Y DESC RIP T IO N REQ UIRED

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<file system input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Legacy copy activity sink model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to: FileSystemSink
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
preserves the file hierarchy in the
target folder. The relative path of
source file to source folder is identical
to the relative path of target file to
target folder.
- FlattenHierarchy : all files from the
source folder are in the first level of
target folder. The target files have
autogenerated name.
- MergeFiles : merges all files from
the source folder to one file. No record
deduplication is performed during the
merge. If the File Name is specified, the
merged file name would be the
specified name; otherwise, would be
autogenerated file name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<file system output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from FTP server by using Azure Data
Factory
5/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from FTP server. To learn about Azure Data Factory, read the introductory
article.

Supported capabilities
This FTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this FTP connector supports:
Copying files using Basic or Anonymous authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.
The FTP connector support FTP server running in passive mode. Active mode is not supported.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
FTP.

Linked service properties


The following properties are supported for FTP linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


FtpSer ver .

host Specify the name or IP address of the Yes


FTP server.

port Specify the port on which the FTP No


server is listening.
Allowed values are: integer, default
value is 21 .

enableSsl Specify whether to use FTP over an No


SSL/TLS channel.
Allowed values are: true (default),
false .

enableServerCertificateValidation Specify whether to enable server No


TLS/SSL certificate validation when you
are using FTP over SSL/TLS channel.
Allowed values are: true (default),
false .

authenticationType Specify the authentication type. Yes


Allowed values are: Basic,
Anonymous

userName Specify the user who has access to the No


FTP server.

password Specify the password for the user No


(userName). Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
The FTP connector supports accessing FTP server with either no encryption or explicit SSL/TLS encryption; it doesn’t
support implicit SSL/TLS encryption.

Example 1: using Anonymous authentication


{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Basic authentication

{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for FTP under location settings in format-based dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


dataset must be set to
FtpSer verLocation .

folderPath The path to folder. If you want to use No


wildcard to filter folder, skip this setting
and specify in activity source settings.

fileName The file name under the given No


folderPath. If you want to use wildcard
to filter files, skip this setting and
specify in activity source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FtpServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by FTP source.
FTP as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for FTP under storeSettings settings in format-based copy source:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
FtpReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given folder/file path


specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: wildcard The folder path with wildcard No


- wildcardFolderPath characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual folder
name has wildcard or this escape char
inside.
See more examples in Folder and file
filter examples.

OPTION 2: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.

OPTION 3: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When using this option, do not specify
file name in dataset. See more
examples in File list examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .
P RO P ERT Y DESC RIP T IO N REQ UIRED

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

useBinaryTransfer Specify whether to use the binary No


transfer mode. The values are true for
binary mode (default), and false for
ASCII.

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

When copying data form FTP, currently ADF tries to get the file length first, then divide the file into multiple parts
and read them in parallel. If your FTP server doesn't support getting file length or seeking to read from a certain
offset, you may encounter failure.
Example:

"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "FtpReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using file list path in copy activity source.
Assuming you have the following source folder structure and want to copy the files in bold:

SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT A DF C O N F IGURAT IO N

root File1.csv In dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy, one file per
line with the relative path to the path
configured in the dataset.

Lookup activity properties


To learn details about the properties, check Lookup activity.
GetMetadata activity properties
To learn details about the properties, check GetMetadata activity

Delete activity properties


To learn details about the properties, check Delete activity

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: FileShare

folderPath Path to the folder. Wildcard filter is Yes


supported, allowed wildcards are: *
(matches zero or more characters) and
? (matches zero or single character);
use ^ to escape if your actual folder
name has wildcard or this escape char
inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the No


file(s) under the specified "folderPath".
If you don't specify a value for this
property, the dataset points to all files
in the folder.

For filter, allowed wildcards are: *


(matches zero or more characters) and
? (matches zero or single character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual file
name has wildcard or this escape char
inside.
P RO P ERT Y DESC RIP T IO N REQ UIRED

format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat ,
JsonFormat , AvroFormat ,
OrcFormat , ParquetFormat . Set the
type property under format to one of
these values. For more information,
see Text Format, Json Format, Avro
Format, Orc Format, and Parquet
Format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: GZip , Deflate ,
BZip2 , and ZipDeflate .
Supported levels are: Optimal and
Fastest .

useBinaryTransfer Specify whether to use the binary No


transfer mode. The values are true for
binary mode (default), and false for
ASCII.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.

Example:
{
"name": "FTPDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "myfile.csv.gz",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
FileSystemSource

recursive Indicates whether the data is read No


recursively from the sub folders or only
from the specified folder. Note when
recursive is set to true and sink is file-
based store, empty folder/sub-folder
will not be copied/created at sink.
Allowed values are: true (default),
false

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<FTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Use GitHub to read Common Data Model entity
references
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The GitHub connector in Azure Data Factory is only used to receive the entity reference schema for the Common
Data Model format in mapping data flow.

Linked service properties


The following properties are supported for the GitHub linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to yes


GitHub .

userName GitHub username yes

password GitHub password yes

Next Steps
Create a source dataset in mapping data flow.
Copy data from Google AdWords using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Google AdWords. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Google AdWords connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Google AdWords to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google AdWords connector.

Linked service properties


The following properties are supported for Google AdWords linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


GoogleAdWords

clientCustomerID The Client customer ID of the Yes


AdWords account that you want to
fetch report data for.
P RO P ERT Y DESC RIP T IO N REQ UIRED

developerToken The developer token associated with Yes


the manager account that you use to
grant access to the AdWords API. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

authenticationType The OAuth 2.0 authentication Yes


mechanism used for authentication.
ServiceAuthentication can only be
used on self-hosted IR.
Allowed values are:
Ser viceAuthentication ,
UserAuthentication

refreshToken The refresh token obtained from No


Google for authorizing access to
AdWords for UserAuthentication. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

clientId The client ID of the Google application No


used to acquire the refresh token. You
can choose to mark this field as a
SecureString to store it securely in ADF,
or store password in Azure Key Vault
and let ADF copy activity pull from
there when performing data copy -
learn more from Store credentials in
Key Vault.

clientSecret The client secret of the google No


application used to acquire the refresh
token. You can choose to mark this
field as a SecureString to store it
securely in ADF, or store password in
Azure Key Vault and let ADF copy
activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

email The service account email ID that is No


used for ServiceAuthentication and can
only be used on self-hosted IR.

keyFilePath The full path to the .p12 key file that is No


used to authenticate the service
account email address and can only be
used on self-hosted IR.
P RO P ERT Y DESC RIP T IO N REQ UIRED

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false.

Example:

{
"name": "GoogleAdWordsLinkedService",
"properties": {
"type": "GoogleAdWords",
"typeProperties": {
"clientCustomerID" : "<clientCustomerID>",
"developerToken": {
"type": "SecureString",
"value": "<developerToken>"
},
"authenticationType" : "ServiceAuthentication",
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
},
"clientId": {
"type": "SecureString",
"value": "<clientId>"
},
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"email" : "<email>",
"keyFilePath" : "<keyFilePath>",
"trustedCertPath" : "<trustedCertPath>",
"useSystemTrustStore" : true,
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Google AdWords dataset.
To copy data from Google AdWords, set the type property of the dataset to GoogleAdWordsObject . The
following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: GoogleAdWordsObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "GoogleAdWordsDataset",
"properties": {
"type": "GoogleAdWordsObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<GoogleAdWords linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Google AdWords source.
Google AdWords as source
To copy data from Google AdWords, set the source type in the copy activity to GoogleAdWordsSource . The
following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
GoogleAdWordsSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromGoogleAdWords",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleAdWords input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleAdWordsSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google BigQuery by using Azure
Data Factory
5/6/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from Google BigQuery. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
This Google BigQuery connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Google BigQuery to any supported sink data store. For a list of data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a
driver to use this connector.

NOTE
This Google BigQuery connector is built on top of the BigQuery APIs. Be aware that BigQuery limits the maximum rate of
incoming requests and enforces appropriate quotas on a per-project basis, refer to Quotas & Limits - API requests. Make
sure you do not trigger too many concurrent requests to the account.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Google BigQuery connector.

Linked service properties


The following properties are supported for the Google BigQuery linked service.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


GoogleBigQuer y .

project The project ID of the default BigQuery Yes


project to query against.

additionalProjects A comma-separated list of project IDs No


of public BigQuery projects to access.

requestGoogleDriveScope Whether to request access to Google No


Drive. Allowing Google Drive access
enables support for federated tables
that combine BigQuery data with data
from Google Drive. The default value is
false .

authenticationType The OAuth 2.0 authentication Yes


mechanism used for authentication.
ServiceAuthentication can be used
only on Self-hosted Integration
Runtime.
Allowed values are
UserAuthentication and
Ser viceAuthentication . Refer to
sections below this table on more
properties and JSON samples for those
authentication types respectively.

Using user authentication


Set "authenticationType" property to UserAuthentication , and specify the following properties along with
generic properties described in the previous section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

clientId ID of the application used to generate No


the refresh token.

clientSecret Secret of the application used to No


generate the refresh token. Mark this
field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

refreshToken The refresh token obtained from No


Google used to authorize access to
BigQuery. Learn how to get one from
Obtaining OAuth 2.0 access tokens
and this community blog. Mark this
field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project ID>",
"additionalProjects" : "<additional project IDs>",
"requestGoogleDriveScope" : true,
"authenticationType" : "UserAuthentication",
"clientId": "<id of the application used to generate the refresh token>",
"clientSecret": {
"type": "SecureString",
"value":"<secret of the application used to generate the refresh token>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refresh token>"
}
}
}
}

Using service authentication


Set "authenticationType" property to Ser viceAuthentication , and specify the following properties along with
generic properties described in the previous section. This authentication type can be used only on Self-hosted
Integration Runtime.

P RO P ERT Y DESC RIP T IO N REQ UIRED

email The service account email ID that is No


used for ServiceAuthentication. It can
be used only on Self-hosted
Integration Runtime.

keyFilePath The full path to the .p12 key file that is No


used to authenticate the service
account email address.

trustedCertPath The full path of the .pem file that No


contains trusted CA certificates used
to verify the server when you connect
over TLS. This property can be set only
when you use TLS on Self-hosted
Integration Runtime. The default value
is the cacerts.pem file installed with the
integration runtime.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified .pem file. The
default value is false .

Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project id>",
"requestGoogleDriveScope" : true,
"authenticationType" : "ServiceAuthentication",
"email": "<email>",
"keyFilePath": "<.p12 key path on the IR machine>"
},
"connectVia": {
"referenceName": "<name of Self-hosted Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Google BigQuery dataset.
To copy data from Google BigQuery, set the type property of the dataset to GoogleBigQuer yObject . The
following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: GoogleBigQuer yObject

dataset Name of the Google BigQuery dataset. No (if "query" in activity source is
specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table. This property is No (if "query" in activity source is
supported for backward compatibility. specified)
For new workload, use dataset and
table .

Example

{
"name": "GoogleBigQueryDataset",
"properties": {
"type": "GoogleBigQueryObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<GoogleBigQuery linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Google BigQuery source type.
GoogleBigQuerySource as a source type
To copy data from Google BigQuery, set the source type in the copy activity to GoogleBigQuer ySource . The
following properties are supported in the copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
GoogleBigQuer ySource .

query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromGoogleBigQuery",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleBigQuery input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleBigQuerySource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Google Cloud Storage by using
Azure Data Factory
5/6/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from Google Cloud Storage (GCS). To learn about Azure Data Factory, read
the introductory article.

Supported capabilities
This Google Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Google Cloud Storage connector supports copying files as is or parsing files with the supported
file formats and compression codecs. It takes advantage of GCS's S3-compatible interoperability.

Prerequisites
The following setup is required on your Google Cloud Storage account:
1. Enable interoperability for your Google Cloud Storage account
2. Set the default project that contains the data you want to copy from the target GCS bucket.
3. Create a service account and define the right levels of permissions by using Cloud IAM on GCP.
4. Generate the access keys for this service account.
Required permissions
To copy data from Google Cloud Storage, make sure you've been granted the following permissions for object
operations: storage.objects.get and storage.objects.list .
If you use Data Factory UI to author, additional storage.buckets.list permission is required for operations like
testing connection to linked service and browsing from root. If you don't want to grant this permission, you can
choose "Test connection to file path" or "Browse from specified path" options from the UI.
For the full list of Google Cloud Storage roles and associated permissions, see IAM roles for Cloud Storage on
the Google Cloud site.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google Cloud Storage.

Linked service properties


The following properties are supported for Google Cloud Storage linked services:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


GoogleCloudStorage .

accessKeyId ID of the secret access key. To find the Yes


access key and secret, see
Prerequisites.

secretAccessKey The secret access key itself. Mark this Yes


field as SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

serviceUrl Specify the custom GCS endpoint as Yes


https://storage.googleapis.com .

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

Here's an example:

{
"name": "GoogleCloudStorageLinkedService",
"properties": {
"type": "GoogleCloudStorage",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://storage.googleapis.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Google Cloud Storage under location settings in a format-based
dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location Yes


in the dataset must be set to
GoogleCloudStorageLocation .

bucketName The GCS bucket name. Yes

folderPath The path to folder under the given No


bucket. If you want to use a wildcard
to filter the folder, skip this setting and
specify that in activity source settings.

fileName The file name under the given bucket No


and folder path. If you want to use a
wildcard to filter the files, skip this
setting and specify that in activity
source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Google Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "GoogleCloudStorageLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties that the Google Cloud Storage source supports.
Google Cloud Storage as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Google Cloud Storage under storeSettings settings in a format-
based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
GoogleCloudStorageReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given bucket or


folder/file path specified in the dataset.
If you want to copy all files from a
bucket or folder, additionally specify
wildcardFileName as * .

OPTION 2: GCS prefix Prefix for the GCS key name under the No
- prefix given bucket configured in the dataset
to filter source GCS files. GCS keys
whose names start with
bucket_in_dataset/this_prefix are
selected. It utilizes GCS's service-side
filter, which provides better
performance than a wildcard filter.

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given bucket
configured in a dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has a
wildcard or this escape character
inside.
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given bucket and folder
path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your file name has a
wildcard or this escape character
inside. See more examples in Folder
and file filter examples.

OPTION 3: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When you're using this option, do not
specify the file name in the dataset.
See more examples in File list
examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL , which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.
P RO P ERT Y DESC RIP T IO N REQ UIRED

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "GoogleCloudStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using a file list path in the Copy activity source.
Assume that you have the following source folder structure and want to copy the files in bold:
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT DATA FA C TO RY C O N F IGURAT IO N

bucket File1.csv In dataset:


FolderA Subfolder1/File3.csv - Bucket: bucket
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In copy activity source:
File3.csv - File list path:
File4.json bucket/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line, with the relative path to the path
configured in the dataset.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity.

Delete activity properties


To learn details about the properties, check Delete activity.

Legacy models
If you were using an Amazon S3 connector to copy data from Google Cloud Storage, it's still supported as is for
backward compatibility. We suggest that you use the new model mentioned earlier. The Data Factory authoring
UI has switched to generating the new model.

Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Greenplum using Azure Data
Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Greenplum. It builds
on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Greenplum connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Greenplum to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Greenplum connector.
Linked service properties
The following properties are supported for Greenplum linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Greenplum

connectionString An ODBC connection string to connect Yes


to Greenplum.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Greenplum dataset.
To copy data from Greenplum, set the type property of the dataset to GreenplumTable . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: GreenplumTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example
{
"name": "GreenplumDataset",
"properties": {
"type": "GreenplumTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Greenplum linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Greenplum source.
GreenplumSource as source
To copy data from Greenplum, set the source type in the copy activity to GreenplumSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
GreenplumSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromGreenplum",
"type": "Copy",
"inputs": [
{
"referenceName": "<Greenplum input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GreenplumSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HBase using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HBase. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This HBase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from HBase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HBase connector.
Linked service properties
The following properties are supported for HBase linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


HBase

host The IP address or host name of the Yes


HBase server. (i.e.
[clustername].azurehdinsight.net
, 192.168.222.160 )

port The TCP port that the HBase instance No


uses to listen for client connections.
The default value is 9090. If you
connect to Azure HDInsights, specify
port as 443.

httpPath The partial URL corresponding to the No


HBase server, e.g. /hbaserest0 when
using HDInsights cluster.

authenticationType The authentication mechanism to use Yes


to connect to the HBase server.
Allowed values are: Anonymous ,
Basic

username The user name used to connect to the No


HBase instance.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when connecting over TLS. The default
value is false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting,
e.g. specify /hbaserest0 instead of /hbaserest .

Example for HDInsights HBase:

{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<cluster name>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbaserest0",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example for generic HBase:


{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<host e.g. 192.168.222.160>",
"port" : "<port>",
"httpPath" : "<e.g. /gateway/sandbox/hbase/version>",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true,
"trustedCertPath" : "<trustedCertPath>",
"allowHostNameCNMismatch" : true,
"allowSelfSignedServerCert" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HBase dataset.
To copy data from HBase, set the type property of the dataset to HBaseObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: HBaseObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "HBaseDataset",
"properties": {
"type": "HBaseObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<HBase linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HBase source.
HBaseSource as source
To copy data from HBase, set the source type in the copy activity to HBaseSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: HBaseSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromHBase",
"type": "Copy",
"inputs": [
{
"referenceName": "<HBase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HBaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from the HDFS server by using Azure
Data Factory
5/6/2021 • 19 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from the Hadoop Distributed File System (HDFS) server. To learn about
Azure Data Factory, read the introductory article.

Supported capabilities
The HDFS connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
Delete activity
Specifically, the HDFS connector supports:
Copying files by using Windows (Kerberos) or Anonymous authentication.
Copying files by using the webhdfs protocol or built-in DistCp support.
Copying files as is or by parsing or generating files with the supported file formats and compression codecs.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

NOTE
Make sure that the integration runtime can access all the [name node server]:[name node port] and [data node servers]:
[data node port] of the Hadoop cluster. The default [name node port] is 50070, and the default [data node port] is 50075.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HDFS.

Linked service properties


The following properties are supported for the HDFS linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Hdfs. Yes

url The URL to the HDFS Yes

authenticationType The allowed values are Anonymous or Yes


Windows.

To set up your on-premises


environment, see the Use Kerberos
authentication for the HDFS connector
section.

userName The username for Windows Yes (for Windows authentication)


authentication. For Kerberos
authentication, specify
<username>@<domain>.com .

password The password for Windows Yes (for Windows Authentication)


authentication. Mark this field as a
SecureString to store it securely in your
data factory, or reference a secret
stored in an Azure key vault.

connectVia The integration runtime to be used to No


connect to the data store. To learn
more, see the Prerequisites section. If
the integration runtime isn't specified,
the service uses the default Azure
Integration Runtime.

Example: using Anonymous authentication


{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Anonymous",
"userName": "hadoop"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: using Windows authentication

{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Windows",
"userName": "<username>@<domain>.com (for Kerberos auth)",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets in Azure Data
Factory.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HDFS under location settings in the format-based dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


the dataset must be set to
HdfsLocation.

folderPath The path to the folder. If you want to No


use a wildcard to filter the folder, skip
this setting and specify the path in
activity source settings.

fileName The file name under the specified No


folderPath. If you want to use a
wildcard to filter files, skip this setting
and specify the file name in activity
source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HdfsLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties that are available for defining activities, see Pipelines and activities in
Azure Data Factory. This section provides a list of properties that are supported by the HDFS source.
HDFS as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HDFS under storeSettings settings in the format-based Copy
source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
HdfsReadSettings .

Locate the files to copy

OPTION 1: static path Copy from the folder or file path that's
specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: wildcard The folder path with wildcard No


- wildcardFolderPath characters to filter source folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your actual folder
name has a wildcard or this escape
character inside.
For more examples, see Folder and file
filter examples.

OPTION 2: wildcard The file name with wildcard characters Yes


- wildcardFileName under the specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has a wildcard or this escape character
inside. For more examples, see Folder
and file filter examples.

OPTION 3: a list of files Indicates to copy a specified file set. No


- fileListPath Point to a text file that includes a list of
files you want to copy (one file per line,
with the relative path to the path
configured in the dataset).
When you use this option, do not
specify file name in the dataset. For
more examples, see File list examples.

Additional settings
P RO P ERT Y DESC RIP T IO N REQ UIRED

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink
is a file-based store, an empty folder or
subfolder isn't copied or created at the
sink.
Allowed values are true (default) and
false.
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files are filtered based on the attribute No


Last Modified.
The files are selected if their last
modified time is within the range of
modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of 2018-12-01T05:00:00Z.
The properties can be NULL, which
means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above.

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

DistCp settings

distcpSettings The property group to use when you No


use HDFS DistCp.

resourceManagerEndpoint The YARN (Yet Another Resource Yes, if using DistCp


Negotiator) endpoint

tempScriptPath A folder path that's used to store the Yes, if using DistCp
temp DistCp command script. The
script file is generated by Data Factory
and will be removed after the Copy job
is finished.

distcpOptions Additional options provided to DistCp No


command.

Example:
"activities":[
{
"name": "CopyFromHDFS",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "HdfsReadSettings",
"recursive": true,
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior if you use a wildcard filter with the folder path and file name.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the behavior that results from using a file list path in the Copy activity source. It assumes
that you have the following source folder structure and want to copy the files that are in bold type:

A Z URE DATA FA C TO RY
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT C O N F IGURAT IO N

root File1.csv In the dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In the Copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy (one file per
line, with the relative path to the path
configured in the dataset).

Use DistCp to copy data from HDFS


DistCp is a Hadoop native command-line tool for doing a distributed copy in a Hadoop cluster. When you run a
command in DistCp, it first lists all the files to be copied and then creates several Map jobs in the Hadoop cluster.
Each Map job does a binary copy from the source to the sink.
The Copy activity supports using DistCp to copy files as is into Azure Blob storage (including staged copy) or an
Azure data lake store. In this case, DistCp can take advantage of your cluster's power instead of running on the
self-hosted integration runtime. Using DistCp provides better copy throughput, especially if your cluster is very
powerful. Based on the configuration in your data factory, the Copy activity automatically constructs a DistCp
command, submits it to your Hadoop cluster, and monitors the copy status.
Prerequisites
To use DistCp to copy files as is from HDFS to Azure Blob storage (including staged copy) or the Azure data lake
store, make sure that your Hadoop cluster meets the following requirements:
The MapReduce and YARN services are enabled.
YARN version is 2.5 or later.
The HDFS server is integrated with your target data store: Azure Blob storage or Azure Data Lake
Store (ADLS Gen1) :
Azure Blob FileSystem is natively supported since Hadoop 2.7. You need only to specify the JAR path
in the Hadoop environment configuration.
Azure Data Lake Store FileSystem is packaged starting from Hadoop 3.0.0-alpha1. If your Hadoop
cluster version is earlier than that version, you need to manually import Azure Data Lake Store-related
JAR packages (azure-datalake-store.jar) into the cluster from here, and specify the JAR file path in the
Hadoop environment configuration.
Prepare a temp folder in HDFS. This temp folder is used to store a DistCp shell script, so it will occupy KB-
level space.
Make sure that the user account that's provided in the HDFS linked service has permission to:
Submit an application in YARN.
Create a subfolder and read/write files under the temp folder.
Configurations
For DistCp-related configurations and examples, go to the HDFS as source section.

Use Kerberos authentication for the HDFS connector


There are two options for setting up the on-premises environment to use Kerberos authentication for the HDFS
connector. You can choose the one that better fits your situation.
Option 1: Join a self-hosted integration runtime machine in the Kerberos realm
Option 2: Enable mutual trust between the Windows domain and the Kerberos realm
For either option, make sure you turn on webhdfs for Hadoop cluster:
1. Create the HTTP principal and keytab for webhdfs.

IMPORTANT
The HTTP Kerberos principal must start with "HTTP/" according to Kerberos HTTP SPNEGO specification. Learn
more from here.

Kadmin> addprinc -randkey HTTP/<namenode hostname>@<REALM.COM>


Kadmin> ktadd -k /etc/security/keytab/spnego.service.keytab HTTP/<namenode hostname>@<REALM.COM>

2. HDFS configuration options: add the following three properties in hdfs-site.xml .


<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/_HOST@<REALM.COM></value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/keytab/spnego.service.keytab</value>
</property>

Option 1: Join a self-hosted integration runtime machine in the Kerberos realm


Requirements
The self-hosted integration runtime machine needs to join the Kerberos realm and can’t join any Windows
domain.
How to configure
On the KDC ser ver :
Create a principal for Azure Data Factory to use, and specify the password.

IMPORTANT
The username should not contain the hostname.

Kadmin> addprinc <username>@<REALM.COM>

On the self-hosted integration runtime machine:


1. Run the Ksetup utility to configure the Kerberos Key Distribution Center (KDC) server and realm.
The machine must be configured as a member of a workgroup, because a Kerberos realm is different
from a Windows domain. You can achieve this configuration by setting the Kerberos realm and adding a
KDC server by running the following commands. Replace REALM.COM with your own realm name.

C:> Ksetup /setdomain REALM.COM


C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>

After you run these commands, restart the machine.


2. Verify the configuration with the Ksetup command. The output should be like:

C:> Ksetup
default realm = REALM.COM (external)
REALM.com:
kdc = <your_kdc_server_address>

In your data factor y:


Configure the HDFS connector by using Windows authentication together with your Kerberos principal name
and password to connect to the HDFS data source. For configuration details, check the HDFS linked service
properties section.
Option 2: Enable mutual trust between the Windows domain and the Kerberos realm
Requirements
The self-hosted integration runtime machine must join a Windows domain.
You need permission to update the domain controller's settings.
How to configure

NOTE
Replace REALM.COM and AD.COM in the following tutorial with your own realm name and domain controller.

On the KDC ser ver :


1. Edit the KDC configuration in the krb5.conf file to let KDC trust the Windows domain by referring to the
following configuration template. By default, the configuration is located at /etc/krb5.conf.

[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log

[libdefaults]
default_realm = REALM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

[realms]
REALM.COM = {
kdc = node.REALM.COM
admin_server = node.REALM.COM
}
AD.COM = {
kdc = windc.ad.com
admin_server = windc.ad.com
}

[domain_realm]
.REALM.COM = REALM.COM
REALM.COM = REALM.COM
.ad.com = AD.COM
ad.com = AD.COM

[capaths]
AD.COM = {
REALM.COM = .
}

After you configure the file, restart the KDC service.


2. Prepare a principal named krbtgt/[email protected] in the KDC server with the following command:

Kadmin> addprinc krbtgt/[email protected]

3. In the hadoop.security.auth_to_local HDFS service configuration file, add


RULE:[1:$1@$0](.*\@AD.COM)s/\@.*// .

On the domain controller :


1. Run the following Ksetup commands to add a realm entry:
C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM

2. Establish trust from the Windows domain to the Kerberos realm. [password] is the password for the
principal krbtgt/[email protected].

C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /password:[password]

3. Select the encryption algorithm that's used in Kerberos.


a. Select Ser ver Manager > Group Policy Management > Domain > Group Policy Objects >
Default or Active Domain Policy , and then select Edit .
b. On the Group Policy Management Editor pane, select Computer Configuration > Policies >
Windows Settings > Security Settings > Local Policies > Security Options , and then configure
Network security: Configure Encr yption types allowed for Kerberos .
c. Select the encryption algorithm you want to use when you connect to the KDC server. You can select all
the options.

d. Use the Ksetup command to specify the encryption algorithm to be used on the specified realm.

C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-SHA1-96


AES256-CTS-HMAC-SHA1-96

4. Create the mapping between the domain account and the Kerberos principal, so that you can use the
Kerberos principal in the Windows domain.
a. Select Administrative tools > Active Director y Users and Computers .
b. Configure advanced features by selecting View > Advanced Features .
c. On the Advanced Features pane, right-click the account to which you want to create mappings and,
on the Name Mappings pane, select the Kerberos Names tab.
d. Add a principal from the realm.
On the self-hosted integration runtime machine:
Run the following Ksetup commands to add a realm entry.

C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>


C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM

In your data factor y:


Configure the HDFS connector by using Windows authentication together with either your domain account
or Kerberos principal to connect to the HDFS data source. For configuration details, see the HDFS linked
service properties section.

Lookup activity properties


For information about Lookup activity properties, see Lookup activity in Azure Data Factory.

Delete activity properties


For information about Delete activity properties, see Delete activity in Azure Data Factory.

Legacy models
NOTE
The following models are still supported as is for backward compatibility. We recommend that you use the previously
discussed new model, because the Azure Data Factory authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to FileShare
P RO P ERT Y DESC RIP T IO N REQ UIRED

folderPath The path to the folder. A wildcard filter Yes


is supported. Allowed wildcards are *
(matches zero or more characters) and
? (matches zero or a single
character); use ^ to escape if your
actual file name has a wildcard or this
escape character inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName The name or wildcard filter for the files No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, allowed wildcards are *


(matches zero or more characters) and
? (matches zero or a single
character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual folder
name has a wildcard or this escape
character inside.

modifiedDatetimeStart Files are filtered based on the attribute No


Last Modified. The files are selected if
their last modified time is within the
range of modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format 2018-12-01T05:00:00Z.

Be aware that the overall performance


of data movement will be affected by
enabling this setting when you want to
apply a file filter to large numbers of
files.

The properties can be NULL, which


means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeEnd Files are filtered based on the attribute No


Last Modified. The files are selected if
their last modified time is within the
range of modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format 2018-12-01T05:00:00Z.

Be aware that the overall performance


of data movement will be affected by
enabling this setting when you want to
apply a file filter to large numbers of
files.

The properties can be NULL, which


means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat,
JsonFormat, AvroFormat, OrcFormat,
ParquetFormat. Set the type property
under format to one of these values.
For more information, see the Text
format, JSON format, Avro format,
ORC format, and Parquet format
sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are: Gzip, Deflate,
Bzip2, and ZipDeflate.
Supported levels are: Optimal and
Fastest.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a specified name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

Example:

{
"name": "HDFSDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy Copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


source must be set to HdfsSource.

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink is a
file-based store, an empty folder or
subfolder will not be copied or created
at the sink.
Allowed values are true (default) and
false.

distcpSettings The property group when you're using No


HDFS DistCp.

resourceManagerEndpoint The YARN Resource Manager endpoint Yes, if using DistCp


P RO P ERT Y DESC RIP T IO N REQ UIRED

tempScriptPath A folder path that's used to store the Yes, if using DistCp
temp DistCp command script. The
script file is generated by Data Factory
and will be removed after the Copy job
is finished.

distcpOptions Additional options are provided to No


DistCp command.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example: HDFS source in Copy activity using DistCp

"source": {
"type": "HdfsSource",
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}

Next steps
For a list of data stores that are supported as sources and sinks by the Copy activity in Azure Data Factory, see
supported data stores.
Copy and transform data from Hive using Azure
Data Factory
5/6/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Hive. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Hive connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Hive to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Hive connector.
Linked service properties
The following properties are supported for Hive linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Hive

host IP address or host name of the Hive Yes


server, separated by ';' for multiple
hosts (only when
serviceDiscoveryMode is enabled).

port The TCP port that the Hive server uses Yes
to listen for client connections. If you
connect to Azure HDInsights, specify
port as 443.

serverType The type of Hive server. No


Allowed values are: HiveSer ver1 ,
HiveSer ver2 , HiveThriftSer ver

thriftTransportProtocol The transport protocol to use in the No


Thrift layer.
Allowed values are: Binar y , SASL ,
HTTP

authenticationType The authentication method used to Yes


access the Hive server.
Allowed values are: Anonymous ,
Username ,
UsernameAndPassword ,
WindowsAzureHDInsightSer vice .
Kerberos authentication is not
supported now.

serviceDiscoveryMode true to indicate using the ZooKeeper No


service, false not.

zooKeeperNameSpace The namespace on ZooKeeper under No


which Hive Server 2 nodes are added.

useNativeQuery Specifies whether the driver uses No


native HiveQL queries, or converts
them into an equivalent form in
HiveQL.

username The user name that you use to access No


Hive Server.

password The password corresponding to the No


user. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

httpPath The partial URL corresponding to the No


Hive server.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when connecting over TLS. The default
value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

storageReference A reference to the linked service of the No


storage account used for staging data
in mapping data flow. This is required
only when using the Hive linked
service in mapping data flow

Example:
{
"name": "HiveLinkedService",
"properties": {
"type": "Hive",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Hive dataset.
To copy data from Hive, set the type property of the dataset to HiveObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: HiveObject

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table including schema No (if "query" in activity source is
part. This property is supported for specified)
backward compatibility. For new
workload, use schema and table .

Example

{
"name": "HiveDataset",
"properties": {
"type": "HiveObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Hive linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Hive source.
HiveSource as source
To copy data from Hive, set the source type in the copy activity to HiveSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: HiveSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromHive",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hive input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HiveSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Mapping data flow properties


The hive connector is supported as an inline dataset source in mapping data flows. Read using a query or
directly from a Hive table in HDInsight. Hive data gets staged in a storage account as parquet files before getting
transformed as part of a data flow.
Source properties
The below table lists the properties supported by a hive source. You can edit these properties in the Source
options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Store Store must be hive yes hive store

Format Whether you are yes table or query format


reading from a table
or query

Schema name If reading from a yes, if format is String schemaName


table, the schema of table
the source table

Table name If reading from a yes, if format is String tableName


table, the table name table

Query If format is query , yes, if format is String query


the source query on query
the Hive linked
service

Staged Hive table will always yes true staged


be staged.

Storage Container Storage container yes String storageContainer


used to stage data
before reading from
Hive or writing to
Hive. The hive cluster
must have access to
this container.

Staging database The schema/database no true or false stagingDatabaseNam


where the user e
account specified in
the linked service has
access to. It is used
to create external
tables during staging
and dropped
afterwards

Pre SQL Scripts SQL code to run on no String preSQLs


the Hive table before
reading the data

Source example
Below is an example of a Hive source configuration:
These settings translate into the following data flow script:

source(
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false,
format: 'table',
store: 'hive',
schemaName: 'default',
tableName: 'hivesampletable',
staged: true,
storageContainer: 'khive',
storageFolderPath: '',
stagingDatabaseName: 'default') ~> hivesource

Known limitations
Complex types such as arrays, maps, structs, and unions are not supported for read.
Hive connector only supports Hive tables in Azure HDInsight of version 4.0 or greater (Apache Hive 3.1.0)

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an HTTP endpoint by using Azure
Data Factory
5/6/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from an HTTP endpoint. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this HTTP connector, the REST connector and the Web table connector are:
REST connector specifically support copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before REST
connector becomes available, you may happen to use the HTTP connector to copy data from RESTful API,
which is supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.

Supported capabilities
This HTTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an HTTP source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use this HTTP connector to:
Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
Retrieve data by using one of the following authentications: Anonymous , Basic , Digest , Windows , or
ClientCer tificate .
Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.

TIP
To test an HTTP request for data retrieval before you configure the HTTP connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the HTTP connector.

Linked service properties


The following properties are supported for the HTTP linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


HttpSer ver .

url The base URL to the web server. Yes

enableServerCertificateValidation Specify whether to enable server No


TLS/SSL certificate validation when you (the default is true )
connect to an HTTP endpoint. If your
HTTPS server uses a self-signed
certificate, set this property to false .

authenticationType Specifies the authentication type. Yes


Allowed values are Anonymous ,
Basic, Digest , Windows , and
ClientCer tificate . User-based OAuth
isn't supported. You can additionally
configure authentication headers in
authHeader property. See the
sections that follow this table for more
properties and JSON samples for these
authentication types.

authHeaders Additional HTTP request headers for No


authentication.
For example, to use API key
authentication, you can select
authentication type as “Anonymous”
and specify API key in the header.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to use to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, the default Azure Integration
Runtime is used.

Using Basic, Digest, or Windows authentication


Set the authenticationType property to Basic , Digest , or Windows . In addition to the generic properties that
are described in the preceding section, specify the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

userName The user name to use to access the Yes


HTTP endpoint.

password The password for the user (the Yes


userName value). Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

Example

{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<HTTP endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Using ClientCertificate authentication


To use ClientCertificate authentication, set the authenticationType property to ClientCer tificate . In addition
to the generic properties that are described in the preceding section, specify the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

embeddedCertData Base64-encoded certificate data. Specify either embeddedCer tData or


cer tThumbprint .
P RO P ERT Y DESC RIP T IO N REQ UIRED

certThumbprint The thumbprint of the certificate that's Specify either embeddedCer tData or
installed on your self-hosted cer tThumbprint .
Integration Runtime machine's cert
store. Applies only when the self-
hosted type of Integration Runtime is
specified in the connectVia property.

password The password that's associated with No


the certificate. Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

If you use cer tThumbprint for authentication and the certificate is installed in the personal store of the local
computer, grant read permissions to the self-hosted Integration Runtime:
1. Open the Microsoft Management Console (MMC). Add the Cer tificates snap-in that targets Local
Computer .
2. Expand Cer tificates > Personal , and then select Cer tificates .
3. Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys .
4. On the Security tab, add the user account under which the Integration Runtime Host Service
(DIAHostService) is running, with read access to the certificate.
Example 1: Using cer tThumbprint

{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"certThumbprint": "<thumbprint of certificate>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Using embeddedCer tData


{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"embeddedCertData": "<Base64-encoded cert data>",
"password": {
"type": "SecureString",
"value": "password of cert"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Using authentication headers


In addition, you can configure request headers for authentication along with the built-in authentication types.
Example: Using API key authentication

{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"url": "<HTTP endpoint>",
"authenticationType": "Anonymous",
"authHeader": {
"x-api-key": {
"type": "SecureString",
"value": "<API key>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HTTP under location settings in format-based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


dataset must be set to
HttpSer verLocation .

relativeUrl A relative URL to the resource that No


contains the data. The HTTP connector
copies data from the combined URL:
[URL specified in linked
service][relative URL specified
in dataset]
.

NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HttpServerLocation",
"relativeUrl": "<relative url>"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy Activity properties


This section provides a list of properties that the HTTP source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
HTTP as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HTTP under storeSettings settings in format-based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
HttpReadSettings .

requestMethod The HTTP method. No


Allowed values are Get (default) and
Post .

additionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40 .

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "HttpReadSettings",
"requestMethod": "Post",
"additionalHeaders": "<header key: header value>\n<header key: header value>\n",
"requestBody": "<body for POST HTTP request>"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to HttpFile .

relativeUrl A relative URL to the resource that No


contains the data. When this property
isn't specified, only the URL that's
specified in the linked service definition
is used.
P RO P ERT Y DESC RIP T IO N REQ UIRED

requestMethod The HTTP method. Allowed values are No


Get (default) and Post .

additionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No

format If you want to retrieve data from the No


HTTP endpoint as-is without parsing it,
and then copy the data to a file-based
store, skip the format section in both
the input and output dataset
definitions.

If you want to parse the HTTP


response content during copy, the
following file format types are
supported: TextFormat , JsonFormat ,
AvroFormat , OrcFormat , and
ParquetFormat . Under format , set
the type property to one of these
values. For more information, see
JSON format, Text format, Avro format,
Orc format, and Parquet format.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.

Supported types: GZip , Deflate ,


BZip2 , and ZipDeflate .
Supported levels: Optimal and
Fastest .

NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.

Example 1: Using the Get method (default)

{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
}
}
}
Example 2: Using the Post method

{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"requestMethod": "Post",
"requestBody": "<body for POST HTTP request>"
}
}
}

Legacy copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to HttpSource .

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40 .

Example

"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<HTTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from HubSpot using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from HubSpot. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This HubSpot connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from HubSpot to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HubSpot connector.

Linked service properties


The following properties are supported for HubSpot linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Hubspot

clientId The client ID associated with your Yes


HubSpot application. Learn how to
create an app in HubSpot from here.
P RO P ERT Y DESC RIP T IO N REQ UIRED

clientSecret The client secret associated with your Yes


HubSpot application. Mark this field as
a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

accessToken The access token obtained when Yes


initially authenticating your OAuth
integration. Learn how to get access
token with your client ID and secret
from here. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

refreshToken The refresh token obtained when Yes


initially authenticating your OAuth
integration. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "HubSpotLinkedService",
"properties": {
"type": "Hubspot",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HubSpot dataset.
To copy data from HubSpot, set the type property of the dataset to HubspotObject . The following properties
are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: HubspotObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "HubSpotDataset",
"properties": {
"type": "HubspotObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<HubSpot linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by HubSpot source.
HubspotSource as source
To copy data from HubSpot, set the source type in the copy activity to HubspotSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
HubspotSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Companies where
Company_Id = xxx"
.

Example:
"activities":[
{
"name": "CopyFromHubspot",
"type": "Copy",
"inputs": [
{
"referenceName": "<HubSpot input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HubspotSource",
"query": "SELECT * FROM Companies where Company_Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Impala by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from Impala. It builds on the
Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
This Impala connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Impala to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a
driver to use this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Impala connector.
Linked service properties
The following properties are supported for Impala linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Impala .

host The IP address or host name of the Yes


Impala server (that is,
192.168.222.160).

port The TCP port that the Impala server No


uses to listen for client connections.
The default value is 21050.

authenticationType The authentication type to use. Yes


Allowed values are Anonymous ,
SASLUsername , and
UsernameAndPassword .

username The user name used to access the No


Impala server. The default value is
anonymous when you use
SASLUsername.

password The password that corresponds to the No


user name when you use
UsernameAndPassword. Mark this field
as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted by using TLS.
The default value is false .

trustedCertPath The full path of the .pem file that No


contains trusted CA certificates used
to verify the server when you connect
over TLS. This property can be set only
when you use TLS on Self-hosted
Integration Runtime. The default value
is the cacerts.pem file installed with the
integration runtime.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false .

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when you connect over TLS. The
default value is false .
P RO P ERT Y DESC RIP T IO N REQ UIRED

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false .

connectVia The integration runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "ImpalaLinkedService",
"properties": {
"type": "Impala",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"authenticationType" : "UsernameAndPassword",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Impala dataset.
To copy data from Impala, set the type property of the dataset to ImpalaObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: ImpalaObject

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example

{
"name": "ImpalaDataset",
"properties": {
"type": "ImpalaObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Impala linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Impala source type.
Impala as a source type
To copy data from Impala, set the source type in the copy activity to ImpalaSource . The following properties
are supported in the copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to ImpalaSource .

query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromImpala",
"type": "Copy",
"inputs": [
{
"referenceName": "<Impala input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ImpalaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to IBM Informix using Azure
Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an IBM Informix data
store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Informix connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Informix source to any supported sink data store, or copy from any supported source
data store to Informix sink. For a list of data stores that are supported as sources/sinks by the copy activity, see
the Supported data stores table.

Prerequisites
To use this Informix connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the Informix ODBC driver for the data store on the Integration Runtime machine. For driver installation
and setup, refer Informix ODBC Driver Guide article in IBM Knowledge Center for details, or contact IBM
support team for driver installation guidance.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Informix connector.

Linked service properties


The following properties are supported for Informix linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Informix

connectionString The ODBC connection string excluding Yes


the credential portion. You can specify
the connection string or use the
system DSN (Data Source Name) you
set up on the Integration Runtime
machine (you need still specify the
credential portion in linked service
accordingly).
You can also put a password in Azure K
ey Vault and pull the password
configuration out of the connection st
ring.
Refer to Store credentials in Azure Key
Vault with more details.

authenticationType Type of authentication used to connect Yes


to the Informix data store.
Allowed values are: Basic and
Anonymous .

userName Specify user name if you are using No


Basic authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format. Mark
this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:
{
"name": "InformixLinkedService",
"properties": {
"type": "Informix",
"typeProperties": {
"connectionString": "<Informix connection string or DSN>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Informix dataset.
To copy data from Informix, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: InformixTable

tableName Name of the table in the Informix. No for source (if "query" in activity
source is specified);
Yes for sink

Example

{
"name": "InformixDataset",
"properties": {
"type": "InformixTable",
"linkedServiceName": {
"referenceName": "<Informix linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Informix source.
Informix as source
To copy data from Informix, the following properties are supported in the copy activity source section:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
InformixSource

query Use the custom query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromInformix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Informix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "InformixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Informix as sink
To copy data to Informix, the following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to: InformixSink

writeBatchTimeout Wait time for the batch insert No


operation to complete before it times
out.
Allowed values are: timespan. Example:
"00:30:00" (30 minutes).

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
P RO P ERT Y DESC RIP T IO N REQ UIRED

preCopyScript Specify a SQL query for Copy Activity No


to execute before writing data into
data store in each run. You can use this
property to clean up the pre-loaded
data.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run.
Specify a value only when you want to
limit concurrent connections.

Example:

"activities":[
{
"name": "CopyToInformix",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Informix output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "InformixSink"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Jira using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Jira. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Jira connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Jira to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Jira connector.

Linked service properties


The following properties are supported for Jira linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Jira Yes

host The IP address or host name of the Jira Yes


service. (for example, jira.example.com)
P RO P ERT Y DESC RIP T IO N REQ UIRED

port The TCP port that the Jira server uses No


to listen for client connections. The
default value is 443 if connecting
through HTTPS, or 8080 if connecting
through HTTP.

username The user name that you use to access Yes


Jira Service.

password The password corresponding to the Yes


user name that you provided in the
username field. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "JiraLinkedService",
"properties": {
"type": "Jira",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Jira dataset.
To copy data from Jira, set the type property of the dataset to JiraObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: JiraObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "JiraDataset",
"properties": {
"type": "JiraObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Jira linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Jira source.
JiraSource as source
To copy data from Jira, set the source type in the copy activity to JiraSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: JiraSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromJira",
"type": "Copy",
"inputs": [
{
"referenceName": "<Jira input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "JiraSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
JSON format in Azure Data Factory
5/14/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the JSON files or write the data into JSON format .
JSON format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure
Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the JSON dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Json .

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .
P RO P ERT Y DESC RIP T IO N REQ UIRED

encodingName The encoding type used to read/write No


test files.
Allowed values are as follows: "UTF-8",
"UTF-16", "UTF-16BE", "UTF-32", "UTF-
32BE", "US-ASCII", "UTF-7", "BIG5",
"EUC-JP", "EUC-KR", "GB2312",
"GB18030", "JOHAB", "SHIFT-JIS",
"CP875", "CP866", "IBM00858",
"IBM037", "IBM273", "IBM437",
"IBM500", "IBM737", "IBM775",
"IBM850", "IBM852", "IBM855",
"IBM857", "IBM860", "IBM861",
"IBM863", "IBM864", "IBM865",
"IBM869", "IBM870", "IBM01140",
"IBM01141", "IBM01142",
"IBM01143", "IBM01144",
"IBM01145", "IBM01146",
"IBM01147", "IBM01148",
"IBM01149", "ISO-2022-JP", "ISO-
2022-KR", "ISO-8859-1", "ISO-8859-
2", "ISO-8859-3", "ISO-8859-4", "ISO-
8859-5", "ISO-8859-6", "ISO-8859-7",
"ISO-8859-8", "ISO-8859-9", "ISO-
8859-13", "ISO-8859-15",
"WINDOWS-874", "WINDOWS-1250",
"WINDOWS-1251", "WINDOWS-
1252", "WINDOWS-1253",
"WINDOWS-1254", "WINDOWS-
1255", "WINDOWS-1256",
"WINDOWS-1257", "WINDOWS-
1258".

compression Group of properties to configure file No


compression. Configure this section
when you want to do
compression/decompression during
activity execution.

type The compression codec used to No.


(under compression ) read/write JSON files.
Allowed values are bzip2 , gzip ,
deflate , ZipDeflate , TarGzip , Tar ,
snappy , or lz4 . Default is not
compressed.
Note currently Copy activity doesn't
support "snappy" & "lz4", and
mapping data flow doesn't support
"ZipDeflate"", "TarGzip" and "Tar".
Note when using copy activity to
decompress ZipDeflate /TarGzip /Tar
file(s) and write to file-based sink data
store, by default files are extracted to
the folder:
<path specified in
dataset>/<folder named as source
compressed file>/
, use preserveZipFileNameAsFolder /
preserveCompressionFileNameAsFolder
on copy activity source to control
whether to preserve the name of the
compressed file(s) as folder structure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

level The compression ratio. No


(under compression ) Allowed values are Optimal or
Fastest .
- Fastest: The compression operation
should complete as quickly as possible,
even if the resulting file is not
optimally compressed.
- Optimal: The compression operation
should be optimally compressed, even
if the operation takes a longer time to
complete. For more information, see
Compression Level topic.

Below is an example of JSON dataset on Azure Blob Storage:

{
"name": "JSONDataset",
"properties": {
"type": "Json",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "gzip"
}
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the JSON source and sink.
Learn about how to extract data from JSON files and map to sink data store/format or vice versa from schema
mapping.
JSON as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to JSONSource .

formatSettings A group of properties. Refer to JSON No


read settings table below.
P RO P ERT Y DESC RIP T IO N REQ UIRED

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Supported JSON read settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to JsonReadSettings .

compressionProperties A group of properties on how to No


decompress data for a given
compression codec.

preserveZipFileNameAsFolder Applies when input dataset is No


(under compressionProperties -> configured with ZipDeflate
type as ZipDeflateReadSettings ) compression. Indicates whether to
preserve the source zip file name as
folder structure during copy.
- When set to true (default) , Data
Factory writes unzipped files to
<path specified in
dataset>/<folder named as source
zip file>/
.
- When set to false , Data Factory
writes unzipped files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source zip files
to avoid racing or unexpected
behavior.

preserveCompressionFileNameAsFolde Applies when input dataset is No


r configured with TarGzip /Tar
(under compressionProperties -> compression. Indicates whether to
type as TarGZipReadSettings or preserve the source compressed file
TarReadSettings ) name as folder structure during copy.
- When set to true (default) , Data
Factory writes decompressed files to
<path specified in
dataset>/<folder named as source
compressed file>/
.
- When set to false , Data Factory
writes decompressed files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source files to
avoid racing or unexpected behavior.

JSON as sink
The following properties are supported in the copy activity *sink* section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to JSONSink .

formatSettings A group of properties. Refer to JSON No


write settings table below.

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Supported JSON write settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to JsonWriteSettings .

filePattern Indicate the pattern of data stored in No


each JSON file. Allowed values are:
setOfObjects (JSON Lines) and
arrayOfObjects . The default value is
setOfObjects . See JSON file patterns
section for details about these
patterns.

JSON file patterns


When copying data from JSON files, copy activity can automatically detect and parse the following patterns of
JSON files. When writing data to JSON files, you can configure the file pattern on copy activity sink.
Type I: setOfObjects
Each file contains single object, JSON lines, or concatenated objects.
single object JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}

JSON Lines (default for sink)


{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":
"567834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":
"789037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":
"345626404","switch1":"Germany","switch2":"UK"}

concatenated JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}

Type II: arrayOfObjects


Each file contains an array of objects.
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]

Mapping data flow properties


In mapping data flows, you can read and write to JSON format in the following data stores: Azure Blob Storage,
Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Source properties
The below table lists the properties supported by a json source. You can edit these properties in the Source
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root ['<from>',
'<to>']

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Single document Mapping data flows no true or false singleDocument


read one JSON
document from each
file

Unquoted column If Unquoted no true or false unquotedColumnNa


names column names is mes
selected, mapping
data flows reads
JSON columns that
aren't surrounded by
quotes.

Has comments Select Has no true or false asComments


comments if the
JSON data has C or
C++ style
commenting

Single quoted Reads JSON columns no true or false singleQuoted


that aren't
surrounded by
quotes

Backslash escaped Select Backslash no true or false backslashEscape


escaped if
backslashes are used
to escape characters
in the JSON data

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Source format options


Using a JSON dataset as a source in your data flow allows you to set five additional settings. These settings can
be found under the JSON settings accordion in the Source Options tab. For Document Form setting, you
can select one of Single document , Document per line and Array of documents types.
Default
By default, JSON data is read in the following format.

{ "json": "record 1" }


{ "json": "record 2" }
{ "json": "record 3" }

Single document
If Single document is selected, mapping data flows read one JSON document from each file.

File1.json
{
"json": "record 1"
}
File2.json
{
"json": "record 2"
}
File3.json
{
"json": "record 3"
}

If Document per line is selected, mapping data flows read one JSON document from each line in a file.
File1.json
{"json": "record 1 }

File2.json
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","s
witch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","s
witch1":"US","switch2":"UK"}

File3.json
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","s
witch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","s
witch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"345626404","s
witch1":"Germany","switch2":"UK"}

If Array of documents is selected, mapping data flows read one array of document from a file.

File.json
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]

NOTE
If data flows throw an error stating "corrupt_record" when previewing your JSON data, it is likely that your data contains
contains a single document in your JSON file. Setting "single document" should clear that error.

Unquoted column names


If Unquoted column names is selected, mapping data flows reads JSON columns that aren't surrounded by
quotes.
{ json: "record 1" }
{ json: "record 2" }
{ json: "record 3" }

Has comments
Select Has comments if the JSON data has C or C++ style commenting.

{ "json": /** comment **/ "record 1" }


{ "json": "record 2" }
{ /** comment **/ "json": "record 3" }

Single quoted
Select Single quoted if the JSON fields and values use single quotes instead of double quotes.

{ 'json': 'record 1' }


{ 'json': 'record 2' }
{ 'json': 'record 3' }

Backslash escaped
Select Backslash escaped if backslashes are used to escape characters in the JSON data.

{ "json": "record 1" }


{ "json": "\} \" \' \\ \n \\n record 2" }
{ "json": "record 3" }

Sink Properties
The below table lists the properties supported by a json sink. You can edit these properties in the Settings tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

File name option The naming format no Pattern: String filePattern


of the data written. Per partition: String[] partitionFileNames
By default, one file As data in column: rowUrlColumn
per partition in String partitionFileNames
format Output to single file:
part-#####-tid- ['<fileName>']
<guid>

Creating JSON structures in a derived column


You can add a complex column to your data flow via the derived column expression builder. In the derived
column transformation, add a new column and open the expression builder by clicking on the blue box. To make
a column complex, you can enter the JSON structure manually or use the UX to add subcolumns interactively.
Using the expression builder UX
In the output schema side pane, hover over a column and click the plus icon. Select Add subcolumn to make
the column a complex type.
You can add additional columns and subcolumns in the same way. For each non-complex field, an expression
can be added in the expression editor to the right.

Entering the JSON structure manually


To manually add a JSON structure, add a new column and enter the expression in the editor. The expression
follows the following general format:

@(
field1=0,
field2=@(
field1=0
)
)

If this expression were entered for a column named "complexColumn", then it would be written to the sink as
the following JSON:

{
"complexColumn": {
"field1": 0,
"field2": {
"field1": 0
}
}
}

Sample manual script for complete hierarchical definition


@(
title=Title,
firstName=FirstName,
middleName=MiddleName,
lastName=LastName,
suffix=Suffix,
contactDetails=@(
email=EmailAddress,
phone=Phone
),
address=@(
line1=AddressLine1,
line2=AddressLine2,
city=City,
state=StateProvince,
country=CountryRegion,
postCode=PostalCode
),
ids=[
toString(CustomerID), toString(AddressID), rowguid
]
)

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Magento using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Magento. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Magento connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Magento to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Magento connector.

Linked service properties


The following properties are supported for Magento linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Magento

host The URL of the Magento instance. Yes


(that is, 192.168.222.110/magento3)

accessToken The access token from Magento. Mark Yes


this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "MagentoLinkedService",
"properties": {
"type": "Magento",
"typeProperties": {
"host" : "192.168.222.110/magento3",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Magento dataset.
To copy data from Magento, set the type property of the dataset to MagentoObject . The following properties
are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MagentoObject
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "MagentoDataset",
"properties": {
"type": "MagentoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Magento linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Magento source.
Magento as source
To copy data from Magento, set the source type in the copy activity to MagentoSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MagentoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Customers" .

Example:
"activities":[
{
"name": "CopyFromMagento",
"type": "Copy",
"inputs": [
{
"referenceName": "<Magento input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MagentoSource",
"query": "SELECT * FROM Customers where Id > XXX"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MariaDB using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from MariaDB. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This MariaDB connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from MariaDB to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
This connector currently supports MariaDB of version 10.0 to 10.2.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MariaDB connector.
Linked service properties
The following properties are supported for MariaDB linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MariaDB

connectionString An ODBC connection string to connect Yes


to MariaDB.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault


{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MariaDB dataset.
To copy data from MariaDB, set the type property of the dataset to MariaDBTable . There is no additional type-
specific property in this type of dataset.
Example

{
"name": "MariaDBDataset",
"properties": {
"type": "MariaDBTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<MariaDB linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MariaDB source.
MariaDB as source
To copy data from MariaDB, set the source type in the copy activity to MariaDBSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MariaDBSource
P RO P ERT Y DESC RIP T IO N REQ UIRED

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Marketo using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Marketo. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Marketo connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Marketo to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Currently, Marketo instance which is integrated with external CRM is not supported.

NOTE
This Marketo connector is built on top of the Marketo REST API. Be aware that the Marketo has concurrent request limit
on service side. If you hit errors saying "Error while attempting to use REST API: Max rate limit '100' exceeded with in '20'
secs (606)" or "Error while attempting to use REST API: Concurrent access limit '10' reached (615)", consider to reduce the
concurrent copy activity runs to reduce the number of requests to the service.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Marketo connector.

Linked service properties


The following properties are supported for Marketo linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Marketo

endpoint The endpoint of the Marketo server. Yes


(i.e. 123-ABC-321.mktorest.com)

clientId The client Id of your Marketo service. Yes

clientSecret The client secret of your Marketo Yes


service. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "MarketoLinkedService",
"properties": {
"type": "Marketo",
"typeProperties": {
"endpoint" : "123-ABC-321.mktorest.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Marketo dataset.
To copy data from Marketo, set the type property of the dataset to MarketoObject . The following properties
are supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MarketoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "MarketoDataset",
"properties": {
"type": "MarketoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Marketo linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Marketo source.
Marketo as source
To copy data from Marketo, set the source type in the copy activity to MarketoSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MarketoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Activitiy_Types" .

Example:
"activities":[
{
"name": "CopyFromMarketo",
"type": "Copy",
"inputs": [
{
"referenceName": "<Marketo input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MarketoSource",
"query": "SELECT top 1000 * FROM Activitiy_Types"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Microsoft Access using
Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Microsoft Access
data store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Microsoft Access connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Microsoft Access source to any supported sink data store, or copy from any supported
source data store to Microsoft Access sink. For a list of data stores that are supported as sources/sinks by the
copy activity, see the Supported data stores table.

Prerequisites
To use this Microsoft Access connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the Microsoft Access ODBC driver for the data store on the Integration Runtime machine.

NOTE
Microsoft Access 2016 version of ODBC driver doesn't work with this connector. Use driver version 2013 or 2010 instead.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Microsoft Access connector.

Linked service properties


The following properties are supported for Microsoft Access linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MicrosoftAccess

connectionString The ODBC connection string excluding Yes


the credential portion. You can specify
the connection string or use the
system DSN (Data Source Name) you
set up on the Integration Runtime
machine (you need still specify the
credential portion in linked service
accordingly).
You can also put a password in Azure K
ey Vault and pull the password
configuration out of the connection st
ring. Refer to Store credentials in Azure
Key Vault with more details.

authenticationType Type of authentication used to connect Yes


to the Microsoft Access data store.
Allowed values are: Basic and
Anonymous .

userName Specify user name if you are using No


Basic authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format. Mark
this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:
{
"name": "MicrosoftAccessLinkedService",
"properties": {
"type": "MicrosoftAccess",
"typeProperties": {
"connectionString": "Driver={Microsoft Access Driver (*.mdb, *.accdb)};Dbq=<path to your DB file
e.g. C:\\mydatabase.accdb>;",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Microsoft Access dataset.
To copy data from Microsoft Access, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MicrosoftAccessTable

tableName Name of the table in the Microsoft No for source (if "query" in activity
Access. source is specified);
Yes for sink

Example

{
"name": "MicrosoftAccessDataset",
"properties": {
"type": "MicrosoftAccessTable",
"linkedServiceName": {
"referenceName": "<Microsoft Access linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Microsoft Access source.
Microsoft Access as source
To copy data from Microsoft Access, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MicrosoftAccessSource

query Use the custom query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromMicrosoftAccess",
"type": "Copy",
"inputs": [
{
"referenceName": "<Microsoft Access input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MicrosoftAccessSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Microsoft Access as sink


To copy data to Microsoft Access, the following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to:
MicrosoftAccessSink

writeBatchTimeout Wait time for the batch insert No


operation to complete before it times
out.
Allowed values are: timespan. Example:
“00:30:00” (30 minutes).
P RO P ERT Y DESC RIP T IO N REQ UIRED

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).

preCopyScript Specify a SQL query for Copy Activity No


to execute before writing data into
data store in each run. You can use this
property to clean up the pre-loaded
data.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run.Specify a value only
when you want to limit concurrent con
nections.

Example:

"activities":[
{
"name": "CopyToMicrosoftAccess",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Microsoft Access output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MicrosoftAccessSink"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to MongoDB by using Azure
Data Factory
6/8/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to a MongoDB
database. It builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
ADF release this new version of MongoDB connector which provides better native MongoDB support. If you are using
the previous MongoDB connector in your solution which is supported as-is for backward compatibility, refer to MongoDB
connector (legacy) article.

Supported capabilities
You can copy data from MongoDB database to any supported sink data store, or copy data from any supported
source data store to MongoDB database. For a list of data stores that are supported as sources/sinks by the copy
activity, see the Supported data stores table.
Specifically, this MongoDB connector supports versions up to 4.2 .

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.

Linked service properties


The following properties are supported for MongoDB linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MongoDbV2

connectionString Specify the MongoDB connection Yes


string e.g.
mongodb://[username:password@]host[:port]
[/[database][?options]]
. Refer to MongoDB manual on
connection string for more details.

You can also put a connection string in


Azure Key Vault. Refer to Store
credentials in Azure Key Vault with
more details.

database Name of the database that you want Yes


to access.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDbV2",
"typeProperties": {
"connectionString": "mongodb://[username:password@]host[:port][/[database][?options]]",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MongoDbV2Collection

collectionName Name of the collection in MongoDB Yes


database.

Example:

{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbV2Collection",
"typeProperties": {
"collectionName": "<Collection name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MongoDB source and sink.
MongoDB as source
The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MongoDbV2Source

filter Specifies selection filter using query No


operators. To return all documents in a
collection, omit this parameter or pass
an empty document ({}).

cursorMethods.project Specifies the fields to return in the No


documents for projection. To return all
fields in the matching documents, omit
this parameter.

cursorMethods.sort Specifies the order in which the query No


returns matching documents. Refer to
cursor.sort().

cursorMethods.limit Specifies the maximum number of No


documents the server returns. Refer to
cursor.limit().
P RO P ERT Y DESC RIP T IO N REQ UIRED

cursorMethods.skip Specifies the number of documents to No


skip and from where MongoDB begins
to return results. Refer to cursor.skip().

batchSize Specifies the number of documents to No


return in each batch of the response (the default is 100 )
from MongoDB instance. In most
cases, modifying the batch size will not
affect the user or the application.
Cosmos DB limits each batch cannot
exceed 40 MB in size, which is the sum
of the batchSize number of
documents' size, so decrease this value
if your document size being large.

TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.

Example:

"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbV2Source",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

MongoDB as sink
The following properties are supported in the Copy Activity sink section:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity sink must be set to
MongoDbV2Sink .

writeBehavior Describes how to write data to No


MongoDB. Allowed values: inser t and (the default is inser t )
upser t .

The behavior of upser t is to replace


the document if a document with the
same _id already exists; otherwise,
insert the document.

Note : Data Factory automatically


generates an _id for a document if
an _id isn't specified either in the
original document or by column
mapping. This means that you must
ensure that, for upser t to work as
expected, your document has an ID.

writeBatchSize The writeBatchSize property controls No


the size of documents to write in each (the default is 10,000 )
batch. You can try increasing the value
for writeBatchSize to improve
performance and decreasing the value
if your document size being large.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out. (the default is 00:30:00 - 30 minutes)
The allowed value is timespan.

TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.

Example
"activities":[
{
"name": "CopyToMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MongoDbV2Sink",
"writeBehavior": "upsert"
}
}
}
]

Import and export JSON documents


You can use this MongoDB connector to easily:
Copy documents between two MongoDB collections as-is.
Import JSON documents from various sources to MongoDB, including from Azure Cosmos DB, Azure Blob
storage, Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from a MongoDB collection to various file-based stores.
To achieve such schema-agnostic copy, skip the "structure" (also called schema) section in dataset and schema
mapping in copy activity.

Schema mapping
To copy data from MongoDB to tabular sink or reversed, refer to schema mapping.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data
Factory (legacy)
5/6/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database.
It builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
ADF release a new MongoDB connector which provides better native MongoDB support comparing to this ODBC-based
implementation, refer to MongoDB connector article on details. This legacy MongoDB connector is kept supported as-is
for backward compability, while for any new workload, please use the new connector.

Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports:
MongoDB versions 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6 .
Copying data using Basic or Anonymous authentication.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in MongoDB driver, therefore you don't need to manually install any
driver when copying data from MongoDB.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.

Linked service properties


The following properties are supported for MongoDB linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MongoDb

server IP address or host name of the Yes


MongoDB server.

port TCP port that the MongoDB server No (default is 27017)


uses to listen for client connections.

databaseName Name of the MongoDB database that Yes


you want to access.

authenticationType Type of authentication used to connect Yes


to the MongoDB database.
Allowed values are: Basic, and
Anonymous .

username User account to access MongoDB. Yes (if basic authentication is used).

password Password for the user. Mark this field Yes (if basic authentication is used).
as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

authSource Name of the MongoDB database that No. For basic authentication, default is
you want to use to check your to use the admin account and the
credentials for authentication. database specified using
databaseName property.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDb",
"typeProperties": {
"server": "<server name>",
"databaseName": "<database name>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MongoDbCollection

collectionName Name of the collection in MongoDB Yes


database.

Example:

{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MongoDB source.
MongoDB as source
The following properties are supported in the copy activity source section:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MongoDbSource

query Use the custom SQL-92 query to read No (if "collectionName" in dataset is
data. For example: select * from specified)
MyTable.

Example:

"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

TIP
When specify the SQL query, pay attention to the DateTime format. For example:
SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02' or to use
parameter
SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-
MM-dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd
HH:mm:ss')}'

Schema by Data Factory


Azure Data Factory service infers schema from a MongoDB collection by using the latest 100 documents in
the collection. If these 100 documents do not contain full schema, some columns may be ignored during the
copy operation.

Data type mapping for MongoDB


When copying data from MongoDB, the following mappings are used from MongoDB data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

M O N GO DB DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Binary Byte[]

Boolean Boolean

Date DateTime

NumberDouble Double

NumberInt Int32

NumberLong Int64

ObjectID String

String String

UUID Guid

Object Renormalized into flatten columns with “_" as nested


separator

NOTE
To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section.
Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression,
Symbol, Timestamp, Undefined.

Support for complex types using virtual tables


Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your MongoDB database. For
complex types such as arrays or objects with different types across the documents, the driver re-normalizes data
into corresponding virtual tables. Specifically, if a table contains such columns, the driver generates the
following virtual tables:
A base table , which contains the same data as the real table except for the complex type columns. The base
table uses the same name as the real table that it represents.
A vir tual table for each complex type column, which expands the nested data. The virtual tables are named
using the name of the real table, a separator “_" and the name of the array or object.
Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. You can
access the content of MongoDB arrays by querying and joining the virtual tables.
Example
For example, ExampleTable here is a MongoDB table that has one column with an array of Objects in each cell –
Invoices, and one column with an array of Scalar types – Ratings.
_ID C USTO M ER N A M E IN VO IC ES SERVIC E L EVEL RAT IN GS

1111 ABC [{invoice_id:"123", Silver [5,6]


item:"toaster",
price:"456",
discount:"0.2"},
{invoice_id:"124",
item:"oven", price:
"1235", discount:
"0.2"}]

2222 XYZ [{invoice_id:"135", Gold [1,2]


item:"fridge", price:
"12543", discount:
"0.0"}]

The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base
table named “ExampleTable", shown in the example. The base table contains all the data of the original table, but
the data from the arrays has been omitted and is expanded in the virtual tables.

_ID C USTO M ER N A M E SERVIC E L EVEL

1111 ABC Silver

2222 XYZ Gold

The following tables show the virtual tables that represent the original arrays in the example. These tables
contain the following:
A reference back to the original primary key column corresponding to the row of the original array (via the
_id column)
An indication of the position of the data within the original array
The expanded data for each element within the array
Table “ExampleTable_Invoices":

EXA M P L ETA B L E_
IN VO IC ES_DIM 1_
_ID IDX IN VO IC E_ID IT EM P RIC E DISC O UN T

1111 0 123 toaster 456 0.2

1111 1 124 oven 1235 0.2

2222 0 135 fridge 12543 0.0

Table “ExampleTable_Ratings":

_ID EXA M P L ETA B L E_RAT IN GS_DIM 1_IDX EXA M P L ETA B L E_RAT IN GS

1111 0 5

1111 1 6

2222 0 1
_ID EXA M P L ETA B L E_RAT IN GS_DIM 1_IDX EXA M P L ETA B L E_RAT IN GS

2222 1 2

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to MongoDB Atlas using Azure
Data Factory
6/8/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to a MongoDB
Atlas database. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
You can copy data from MongoDB Atlas database to any supported sink data store, or copy data from any
supported source data store to MongoDB Atlas database. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB Atlas connector supports versions up to 4.2 .

Prerequisites
If you use Azure Integration Runtime for copy, make sure you add the effective region's Azure Integration
Runtime IPs to the MongoDB Atlas IP Access List.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB Atlas connector.

Linked service properties


The following properties are supported for MongoDB Atlas linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MongoDbAtlas
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectionString Specify the MongoDB Atlas connection Yes


string e.g.
mongodb+srv://<username>:
<password>@<clustername>.
<randomString>.
<hostName>/<dbname>?
<otherProperties>
.

You can also put a connection string in


Azure Key Vault. Refer to Store
credentials in Azure Key Vault with
more details.

database Name of the database that you want Yes


to access.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "MongoDbAtlasLinkedService",
"properties": {
"type": "MongoDbAtlas",
"typeProperties": {
"connectionString": "mongodb+srv://<username>:<password>@<clustername>.<randomString>.
<hostName>/<dbname>?<otherProperties>",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB Atlas dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MongoDbAtlasCollection

collectionName Name of the collection in MongoDB Yes


Atlas database.

Example:
{
"name": "MongoDbAtlasDataset",
"properties": {
"type": "MongoDbAtlasCollection",
"typeProperties": {
"collectionName": "<Collection name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<MongoDB Atlas linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MongoDB Atlas source and sink.
MongoDB Atlas as source
The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
MongoDbAtlasSource

filter Specifies selection filter using query No


operators. To return all documents in a
collection, omit this parameter or pass
an empty document ({}).

cursorMethods.project Specifies the fields to return in the No


documents for projection. To return all
fields in the matching documents, omit
this parameter.

cursorMethods.sort Specifies the order in which the query No


returns matching documents. Refer to
cursor.sort().

cursorMethods.limit Specifies the maximum number of No


documents the server returns. Refer to
cursor.limit().

cursorMethods.skip Specifies the number of documents to No


skip and from where MongoDB Atlas
begins to return results. Refer to
cursor.skip().
P RO P ERT Y DESC RIP T IO N REQ UIRED

batchSize Specifies the number of documents to No


return in each batch of the response (the default is 100 )
from MongoDB Atlas instance. In most
cases, modifying the batch size will not
affect the user or the application.
Cosmos DB limits each batch cannot
exceed 40MB in size, which is the sum
of the batchSize number of
documents' size, so decrease this value
if your document size being large.

TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.

Example:

"activities":[
{
"name": "CopyFromMongoDbAtlas",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB Atlas input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbAtlasSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

MongoDB Atlas as sink


The following properties are supported in the Copy Activity sink section:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity sink must be set to
MongoDbAtlasSink .

writeBehavior Describes how to write data to No


MongoDB Atlas. Allowed values: (the default is inser t )
inser t and upser t .

The behavior of upser t is to replace


the document if a document with the
same _id already exists; otherwise,
insert the document.

Note : Data Factory automatically


generates an _id for a document if
an _id isn't specified either in the
original document or by column
mapping. This means that you must
ensure that, for upser t to work as
expected, your document has an ID.

writeBatchSize The writeBatchSize property controls No


the size of documents to write in each (the default is 10,000 )
batch. You can try increasing the value
for writeBatchSize to improve
performance and decreasing the value
if your document size being large.

writeBatchTimeout The wait time for the batch insert No


operation to finish before it times out. (the default is 00:30:00 - 30 minutes)
The allowed value is timespan.

TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.

Example
"activities":[
{
"name": "CopyToMongoDBAtlas",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MongoDbAtlasSink",
"writeBehavior": "upsert"
}
}
}
]

Import and Export JSON documents


You can use this MongoDB Atlas connector to easily:
Copy documents between two MongoDB Atlas collections as-is.
Import JSON documents from various sources to MongoDB Atlas, including from Azure Cosmos DB, Azure
Blob storage, Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
Export JSON documents from a MongoDB Atlas collection to various file-based stores.
To achieve such schema-agnostic copy, skip the "structure" (also called schema) section in dataset and schema
mapping in copy activity.

Schema mapping
To copy data from MongoDB Atlas to tabular sink or reversed, refer to schema mapping.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MySQL using Azure Data Factory
5/6/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MySQL database. It
builds on the copy activity overview article that presents a general overview of copy activity.

NOTE
To copy data from or to Azure Database for MySQL service, use the specialized Azure Database for MySQL connector.

Supported capabilities
This MySQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MySQL connector supports MySQL version 5.6, 5.7 and 8.0 .

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in MySQL driver starting from version 3.7, therefore you don't need to
manually install any driver.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MySQL connector.

Linked service properties


The following properties are supported for MySQL linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


MySql

connectionString Specify information needed to connect Yes


to the Azure Database for MySQL
instance.
You can also put password in Azure
Key Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

A typical connection string is Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=<password> .


More properties you can set per your case:

P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED

SSLMode This option specifies DISABLED (0) / PREFERRED No


whether the driver uses TLS (1) (Default) / REQUIRED
encryption and verification (2) / VERIFY_CA (3) /
when connecting to VERIFY_IDENTITY (4)
MySQL. E.g.,
SSLMode=<0/1/2/3/4> .

SSLCert The full path and name of a Yes, if using two-way SSL
.pem file containing the SSL verification.
certificate used for proving
the identity of the client.
To specify a private key for
encrypting this certificate
before sending it to the
server, use the SSLKey
property.

SSLKey The full path and name of a Yes, if using two-way SSL
file containing the private verification.
key used for encrypting the
client-side certificate during
two-way SSL verification.
P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED

UseSystemTrustStore This option specifies Enabled (1) / Disabled (0) No


whether to use a CA (Default)
certificate from the system
trust store, or from a
specified PEM file. E.g.
UseSystemTrustStore=
<0/1>;

Example:

{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using MySQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MySQL dataset.
To copy data from MySQL, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: MySqlTable

tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)

Example

{
"name": "MySQLDataset",
"properties":
{
"type": "MySqlTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<MySQL linked service name>",
"type": "LinkedServiceReference"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by MySQL source.
MySQL as source
To copy data from MySQL, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: MySqlSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MySqlSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.

Data type mapping for MySQL


When copying data from MySQL, the following mappings are used from MySQL data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

bigint unsigned Decimal

bit(1) Boolean
M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

bit(M), M>1 Byte[]

blob Byte[]

bool Int16

char String

date Datetime

datetime Datetime

decimal Decimal, String

double Double

double precision Double

enum String

float Single

int Int32

int unsigned Int64

integer Int32

integer unsigned Int64

long varbinary Byte[]

long varchar String

longblob Byte[]

longtext String

mediumblob Byte[]

mediumint Int32

mediumint unsigned Int64

mediumtext String

numeric Decimal
M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

real Double

set String

smallint Int16

smallint unsigned Int32

text String

time TimeSpan

timestamp Datetime

tinyblob Byte[]

tinyint Int16

tinyint unsigned Int16

tinytext String

varchar String

year Int

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Netezza by using Azure Data
Factory
5/6/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from Netezza. The article builds
on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

TIP
For data migration scenario from Netezza to Azure, learn more from Use Azure Data Factory to migrate data from on-
premises Netezza server to Azure.

Supported capabilities
This Netezza connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Netezza to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Netezza connector supports parallel copying from source. See the Parallel copy from Netezza section for details.
Azure Data Factory provides a built-in driver to enable connectivity. You don't need to manually install any driver
to use this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
You can create a pipeline that uses a copy activity by using the .NET SDK, the Python SDK, Azure PowerShell, the
REST API, or an Azure Resource Manager template. See the Copy Activity tutorial for step-by-step instructions
on how to create a pipeline that has a copy activity.
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the Netezza connector.
Linked service properties
The following properties are supported for the Netezza linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Netezza .

connectionString An ODBC connection string to connect Yes


to Netezza.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to use to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, the default Azure Integration
Runtime is used.

A typical connection string is Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=<password> .


The following table describes more properties that you can set:

P RO P ERT Y DESC RIP T IO N REQ UIRED

SecurityLevel The level of security that the driver No


uses for the connection to the data
store. The driver supports SSL
connections with one-way
authentication using SSL version 3.
Example:
SecurityLevel=preferredSecured .
Supported values are:
- Only unsecured
(onlyUnSecured ): The driver doesn't
use SSL.
- Preferred unsecured
(preferredUnSecured) (default) : If
the server provides a choice, the driver
doesn't use SSL.
- Preferred secured
(preferredSecured) : If the server
provides a choice, the driver uses SSL.
- Only secured (onlySecured) : The
driver doesn't connect unless an SSL
connection is available.

CaCertFile The full path to the SSL certificate Yes, if SSL is enabled
that's used by the server. Example:
CaCertFile=<cert path>;

Example
{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the Netezza dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets.
To copy data from Netezza, set the type property of the dataset to NetezzaTable . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: NetezzaTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "NetezzaDataset",
"properties": {
"type": "NetezzaTable",
"linkedServiceName": {
"referenceName": "<Netezza linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy Activity properties


This section provides a list of properties that the Netezza source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
Netezza as source

TIP
To load data from Netezza efficiently by using data partitioning, learn more from Parallel copy from Netezza section.

To copy data from Netezza, set the source type in Copy Activity to NetezzaSource . The following properties
are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity source must be set to
NetezzaSource .

query Use the custom SQL query to read No (if "tableName" in dataset is
data. Example: specified)
"SELECT * FROM MyTable"

partitionOptions Specifies the data partitioning options No


used to load data from Netezza.
Allow values are: None (default),
DataSlice , and DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from a Netezza database is controlled
by parallelCopies setting on the
copy activity.
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when partition option isn't
None .

partitionColumnName Specify the name of the source column No


in integer type that will be used by
range partitioning for parallel copy. If
not specified, the primary key of the
table is autodetected and used as the
partition column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionColumnName in
WHERE clause. See example in Parallel
copy from Netezza section.

partitionUpperBound The maximum value of the partition No


column to copy data out.
Apply when partition option is
DynamicRange . If you use query to
retrieve source data, hook
?AdfRangePartitionUpbound in the
WHERE clause. For an example, see the
Parallel copy from Netezza section.

partitionLowerBound The minimum value of the partition No


column to copy data out.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionLowbound in the
WHERE clause. For an example, see the
Parallel copy from Netezza section.

Example:
"activities":[
{
"name": "CopyFromNetezza",
"type": "Copy",
"inputs": [
{
"referenceName": "<Netezza input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "NetezzaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Parallel copy from Netezza


The Data Factory Netezza connector provides built-in data partitioning to copy data from Netezza in parallel. You
can find data partitioning options on the Source table of the copy activity.

When you enable partitioned copy, Data Factory runs parallel queries against your Netezza source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Netezza database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Netezza database. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table. Par tition option : Data Slice.

During execution, Data Factory automatically partitions the


data based on Netezza's built-in data slices, and copies data
by partitions.
SC EN A RIO SUGGEST ED SET T IN GS

Load large amount of data by using a custom query. Par tition option : Data Slice.
Quer y :
SELECT * FROM <TABLENAME> WHERE mod(datasliceid, ?
AdfPartitionCount) = ?AdfDataSliceCondition AND
<your_additional_where_clause>
.
During execution, Data Factory replaces
?AdfPartitionCount (with parallel copy number set on
copy activity) and ?AdfDataSliceCondition with the data
slice partition logic, and sends to Netezza.

Load large amount of data by using a custom query, having Par tition options : Dynamic range partition.
an integer column with evenly distributed value for range Quer y :
partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against the partition column to
retrieve data only between the lower and upper range.

During execution, Data Factory replaces


?AdfRangePartitionColumnName ,
?AdfRangePartitionUpbound , and
?AdfRangePartitionLowbound with the actual column
name and value ranges for each partition, and sends to
Netezza.
For example, if your partition column "ID" set with the lower
bound as 1 and the upper bound as 80, with parallel copy
set as 4, Data Factory retrieves data by 4 partitions. Their
IDs are between [1,20], [21, 40], [41, 60], and [61, 80],
respectively.

Example: quer y with data slice par tition

"source": {
"type": "NetezzaSource",
"query":"SELECT * FROM <TABLENAME> WHERE mod(datasliceid, ?AdfPartitionCount) = ?AdfDataSliceCondition
AND <your_additional_where_clause>",
"partitionOption": "DataSlice"
}

Example: quer y with dynamic range par tition


"source": {
"type": "NetezzaSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfRangePartitionColumnName <= ?AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?AdfRangePartitionLowbound AND <your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<dynamic_range_partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column>",
"partitionLowerBound": "<lower_value_of_partition_column>"
}
}

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from an OData source by using Azure
Data Factory
5/6/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from an OData source. The
article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
This OData connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an OData source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this OData connector supports:
OData version 3.0 and 4.0.
Copying data by using one of the following authentications: Anonymous , Basic , Windows , and AAD
ser vice principal .

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to an OData connector.

Linked service properties


The following properties are supported for an OData linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


OData .

url The root URL of the OData service. Yes

authenticationType The type of authentication used to Yes


connect to the OData source. Allowed
values are Anonymous , Basic,
Windows , and AadSer vicePrincipal.
User-based OAuth isn't supported. You
can additionally configure
authentication headers in
authHeader property.

authHeaders Additional HTTP request headers for No


authentication.
For example, to use API key
authentication, you can select
authentication type as “Anonymous”
and specify API key in the header.

userName Specify userName if you use Basic or No


Windows authentication.

password Specify password for the user account No


you specified for userName . Mark this
field as a SecureString type to store it
securely in Data Factory. You also can
reference a secret stored in Azure Key
Vault.

servicePrincipalId Specify the Azure Active Directory No


application's client ID.

aadServicePrincipalCredentialType Specify the credential type to use for No


service principal authentication.
Allowed values are:
ServicePrincipalKey or
ServicePrincipalCert .

servicePrincipalKey Specify the Azure Active Directory No


application's key. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

servicePrincipalEmbeddedCert Specify the base64 encoded certificate No


of your application registered in Azure
Active Directory. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

servicePrincipalEmbeddedCertPasswor Specify the password of your certificate No


d if your certificate is secured with a
password. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

tenant Specify the tenant information (domain No


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.

aadResourceId Specify the AAD resource you are No


requesting for authorization.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your AAD
application is registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

connectVia The Integration Runtime to use to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, the default Azure Integration
Runtime is used.

Example 1: Using Anonymous authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "https://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Using Basic authentication


{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Basic",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: Using Windows authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Windows",
"userName": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 4: Using ser vice principal key authentication


{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Example 5: Using ser vice principal cer t authentication

{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalEmbeddedCert": {
"type": "SecureString",
"value": "<base64 encoded string of (.pfx) certificate data>"
},
"servicePrincipalEmbeddedCertPassword": {
"type": "SecureString",
"value": "<password of your certificate>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource e.g. https://tenant.sharepoint.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Example 6: Using API key authentication


{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Anonymous",
"authHeader": {
"APIKey": {
"type": "SecureString",
"value": "<API key>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the OData dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from OData, set the type property of the dataset to ODataResource . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to ODataResource .

path The path to the OData resource. Yes

Example

{
"name": "ODataDataset",
"properties":
{
"type": "ODataResource",
"schema": [],
"linkedServiceName": {
"referenceName": "<OData linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"path": "Products"
}
}
}

Copy Activity properties


This section provides a list of properties that the OData source supports.
For a full list of sections and properties that are available for defining activities, see Pipelines.
OData as source
To copy data from OData, the following properties are supported in the Copy Activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity source must be set to
ODataSource .

query OData query options for filtering data. No


Example:
"$select=Name,Description&$top=5"
.

Note : The OData connector copies


data from the combined URL:
[URL specified in linked
service]/[path specified in
dataset]?[query specified in
copy activity source]
. For more information, see OData URL
components.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. If not specified, the
default value is 00:30:00 (30
minutes).

Example

"activities":[
{
"name": "CopyFromOData",
"type": "Copy",
"inputs": [
{
"referenceName": "<OData input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ODataSource",
"query": "$select=Name,Description&$top=5"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.

Data type mapping for OData


When you copy data from OData, the following mappings are used between OData data types and Azure Data
Factory interim data types. To learn how Copy Activity maps the source schema and data type to the sink, see
Schema and data type mappings.

O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Edm.Binary Byte[]

Edm.Boolean Bool

Edm.Byte Byte[]

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid Guid

Edm.Int16 Int16

Edm.Int32 Int32

Edm.Int64 Int64

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

NOTE
OData complex data types (such as Object ) aren't supported.

Copy data from Project Online


To copy data from Project Online, you can use the OData connector and an access token obtained from tools like
Postman.
Cau t i on

The access token expires in 1 hour by default, you need to get a new access token when it expires.
1. Use Postman to get the access token:
a. Navigate to Authorization tab on the Postman Website.
b. In the Type box, select OAuth 2.0 , and in the Add authorization data to box, select Request
Headers .
c. Fill the following information in the Configure New Token page to get a new access token:
Grant type : Select Authorization Code .
Callback URL : Enter https://www.localhost.com/ .
Auth URL : Enter
https://login.microsoftonline.com/common/oauth2/authorize?resource=https://<your tenant
name>.sharepoint.com
. Replace <your tenant name> with your own tenant name.
Access Token URL : Enter https://login.microsoftonline.com/common/oauth2/token .
Client ID : Enter your AAD service principal ID.
Client Secret : Enter your service principal secret.
Client Authentication : Select Send as Basic Auth header .
d. You will be asked to login with your username and password.
e. Once you get your access token, please copy and save it for the next step.

2. Create the OData linked service:


Ser vice URL : Enter https://<your tenant name>.sharepoint.com/sites/pwa/_api/Projectdata . Replace
<your tenant name> with your own tenant name.
Authentication type : Select Anonymous .
Auth headers :
Proper ty name : Choose Authorization .
Value : Enter the access token copied from step 1.
Test the linked service.
3. Create the OData dataset:
a. Create the dataset with the OData linked service created in step 2.
b. Preview data.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to ODBC data stores using
Azure Data Factory
5/10/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data
store. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This ODBC connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC-compatible data stores using
Basic or Anonymous authentication. A 64-bit ODBC driver is required. For ODBC sink, ADF support ODBC
version 2.0 standard.

Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the 64-bit ODBC driver for the data store on the Integration Runtime machine.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.

Linked service properties


The following properties are supported for ODBC linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Odbc

connectionString The connection string excluding the Yes


credential portion. You can specify the
connection string with pattern like
Driver={SQL
Server};Server=Server.database.windows.net;
Database=TestDatabase;
, or use the system DSN (Data Source
Name) you set up on the Integration
Runtime machine with
DSN=<name of the DSN on IR
machine>;
(you need still specify the credential
portion in linked service accordingly).
You can also put a password in Azure K
ey Vault and pull the password
configuration out of the connection st
ring. Refer to Store credentials in Azure
Key Vault with more details.

authenticationType Type of authentication used to connect Yes


to the ODBC data store.
Allowed values are: Basic and
Anonymous .

userName Specify user name if you are using No


Basic authentication.

password Specify password for the user account No


you specified for the userName. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

credential The access credential portion of the No


connection string specified in driver-
specific property-value format.
Example:
"RefreshToken=<secret refresh
token>;"
. Mark this field as a SecureString.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example 1: using Basic authentication


{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": "<connection string>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: using Anonymous authentication

{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": "<connection string>",
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ODBC dataset.
To copy data from/to ODBC-compatible data store, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: OdbcTable

tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink

Example
{
"name": "ODBCDataset",
"properties": {
"type": "OdbcTable",
"schema": [],
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ODBC source.
ODBC as source
To copy data from ODBC-compatible data store, the following properties are supported in the copy activity
source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: OdbcSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OdbcSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
ODBC as sink
To copy data to ODBC-compatible data store, set the sink type in the copy activity to OdbcSink . The following
properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to: OdbcSink

writeBatchTimeout Wait time for the batch insert No


operation to complete before it times
out.
Allowed values are: timespan. Example:
"00:30:00" (30 minutes).

writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).

preCopyScript Specify a SQL query for Copy Activity No


to execute before writing data into
data store in each run. You can use this
property to clean up the pre-loaded
data.
NOTE
For "writeBatchSize", if it's not set (auto-detected), copy activity first detects whether the driver supports batch operations,
and set it to 10000 if it does, or set it to 1 if it doesn't. If you explicitly set the value other than 0, copy activity honors the
value and fails at runtime if the driver doesn't support batch operations.

Example:

"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Troubleshoot connectivity issues


To troubleshoot connection issues, use the Diagnostics tab of Integration Runtime Configuration
Manager .
1. Launch Integration Runtime Configuration Manager .
2. Switch to the Diagnostics tab.
3. Under the "Test Connection" section, select the type of data store (linked service).
4. Specify the connection string that is used to connect to the data store, choose the authentication and
enter user name , password , and/or credentials .
5. Click Test connection to test the connection to the data store.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Office 365 into Azure using Azure
Data Factory
6/8/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory integrates with Microsoft Graph data connect, allowing you to bring the rich organizational
data in your Office 365 tenant into Azure in a scalable way and build analytics applications and extract insights
based on these valuable data assets. Integration with Privileged Access Management provides secured access
control for the valuable curated data in Office 365. Please refer to this link for an overview on Microsoft Graph
data connect and refer to this link for licensing information.
This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Office 365. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
ADF Office 365 connector and Microsoft Graph data connect enables at scale ingestion of different types of
datasets from Exchange Email enabled mailboxes, including address book contacts, calendar events, email
messages, user information, mailbox settings, and so on. Refer here to see the complete list of datasets available.
For now, within a single copy activity you can only copy data from Office 365 into Azure Blob Storage ,
Azure Data Lake Storage Gen1 , and Azure Data Lake Storage Gen2 in JSON format (type
setOfObjects). If you want to load Office 365 into other types of data stores or in other formats, you can chain
the first copy activity with a subsequent copy activity to further load data into any of the supported ADF
destination stores (refer to "supported as a sink" column in the "Supported data stores and formats" table).

IMPORTANT
The Azure subscription containing the data factory and the sink data store must be under the same Azure Active
Directory (Azure AD) tenant as Office 365 tenant.
Ensure the Azure Integration Runtime region used for copy activity as well as the destination is in the same region
where the Office 365 tenant users' mailbox is located. Refer here to understand how the Azure IR location is
determined. Refer to table here for the list of supported Office regions and corresponding Azure regions.
Service Principal authentication is the only authentication mechanism supported for Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2 as destination stores.

Prerequisites
To copy data from Office 365 into Azure, you need to complete the following prerequisite steps:
Your Office 365 tenant admin must complete on-boarding actions as described here.
Create and configure an Azure AD web application in Azure Active Directory. For instructions, see Create an
Azure AD application.
Make note of the following values, which you will use to define the linked service for Office 365:
Tenant ID. For instructions, see Get tenant ID.
Application ID and Application key. For instructions, see Get application ID and authentication key.
Add the user identity who will be making the data access request as the owner of the Azure AD web
application (from the Azure AD web application > Settings > Owners > Add owner).
The user identity must be in the Office 365 organization you are getting data from and must not be a
Guest user.

Approving new data access requests


If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access
request), you will see the copy activity status as "In Progress", and only when you click into "Details" link under
Actions will you see the status as "RequestingConsent". A member of the data access approver group needs to
approve the request in the Privileged Access Management before the data extraction can proceed.
Refer here on how the approver can approve the data access request, and refer here for an explanation on the
overall integration with Privileged Access Management, including how to set up the data access approver group.

Policy validation
If ADF is created as part of a managed app and Azure policies assignments are made on resources within the
management resource group, then for every copy activity run, ADF will check to make sure the policy
assignments are enforced. Refer here for a list of supported policies.

Getting started
TIP
For a walkthrough of using Office 365 connector, see Load data from Office 365 article.

You can create a pipeline with the copy activity by using one of the following tools or SDKs. Select a link to go to
a tutorial with step-by-step instructions to create a pipeline with a copy activity.
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template.
The following sections provide details about properties that are used to define Data Factory entities specific to
Office 365 connector.

Linked service properties


The following properties are supported for Office 365 linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Office365

office365TenantId Azure tenant ID to which the Office Yes


365 account belongs.
P RO P ERT Y DESC RIP T IO N REQ UIRED

servicePrincipalTenantId Specify the tenant information under Yes


which your Azure AD web application
resides.

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Mark this Yes


field as a SecureString to store it
securely in Data Factory.

connectVia The Integration Runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
The difference between office365TenantId and ser vicePrincipalTenantId and the corresponding value to provide:
If you are an enterprise developer developing an application against Office 365 data for your own organization's usage,
then you should supply the same tenant ID for both properties, which is your organization's AAD tenant ID.
If you are an ISV developer developing an application for your customers, then office365TenantId will be your
customer's (application installer) AAD tenant ID and servicePrincipalTenantId will be your company's AAD tenant ID.

Example:

{
"name": "Office365LinkedService",
"properties": {
"type": "Office365",
"typeProperties": {
"office365TenantId": "<Office 365 tenant id>",
"servicePrincipalTenantId": "<AAD app service principal tenant id>",
"servicePrincipalId": "<AAD app service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<AAD app service principal key>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Office 365 dataset.
To copy data from Office 365, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: Office365Table
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the dataset to extract from Yes


Office 365. Refer here for the list of
Office 365 datasets available for
extraction.

If you were setting dateFilterColumn , startTime , endTime , and userScopeFilterUri in dataset, it is still
supported as-is, while you are suggested to use the new model in activity source going forward.
Example

{
"name": "DS_May2019_O365_Message",
"properties": {
"type": "Office365Table",
"linkedServiceName": {
"referenceName": "<Office 365 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [],
"typeProperties": {
"tableName": "BasicDataSet_v0.Event_v1"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Office 365 source.
Office 365 as source
To copy data from Office 365, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
Office365Source

allowedGroups Group selection predicate. Use this No


property to select up to 10 user
groups for whom the data will be
retrieved. If no groups are specified,
then data will be returned for the
entire organization.

userScopeFilterUri When allowedGroups property is not No


specified, you can use a predicate
expression that is applied on the entire
tenant to filter the specific rows to
extract from Office 365. The predicate
format should match the query format
of Microsoft Graph APIs, e.g.
https://graph.microsoft.com/v1.0/users?
$filter=Department eq 'Finance'
.
P RO P ERT Y DESC RIP T IO N REQ UIRED

dateFilterColumn Name of the DateTime filter column. Yes if dataset has one or more
Use this property to limit the time DateTime columns. Refer here for list of
range for which Office 365 data is datasets that require this DateTime
extracted. filter.

startTime Start DateTime value to filter on. Yes if dateFilterColumn is specified

endTime End DateTime value to filter on. Yes if dateFilterColumn is specified

outputColumns Array of the columns to copy to sink. No

Example:

"activities": [
{
"name": "CopyFromO365ToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Office 365 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "Office365Source",
"dateFilterColumn": "CreatedDateTime",
"startTime": "2019-04-28T16:00:00.000Z",
"endTime": "2019-05-05T16:00:00.000Z",
"userScopeFilterUri": "https://graph.microsoft.com/v1.0/users?$filter=Department eq
'Finance'",
"outputColumns": [
{
"name": "Id"
},
{
"name": "CreatedDateTime"
},
{
"name": "LastModifiedDateTime"
},
{
"name": "ChangeKey"
},
{
"name": "Categories"
},
{
"name": "OriginalStartTimeZone"
},
{
"name": "OriginalEndTimeZone"
},
{
"name": "ResponseStatus"
},
{
"name": "iCalUId"
},
{
"name": "ReminderMinutesBeforeStart"
},
{
"name": "IsReminderOn"
},
{
"name": "HasAttachments"
},
{
"name": "Subject"
},
{
"name": "Body"
},
{
"name": "Importance"
},
{
"name": "Sensitivity"
},
{
"name": "Start"
},
{
"name": "End"
},
{
"name": "Location"
},
{
"name": "IsAllDay"
},
{
"name": "IsCancelled"
},
{
"name": "IsOrganizer"
},
{
"name": "Recurrence"
},
{
"name": "ResponseRequested"
},
{
"name": "ShowAs"
},
{
"name": "Type"
},
{
"name": "Attendees"
},
{
"name": "Organizer"
},
{
"name": "WebLink"
},
{
"name": "Attachments"
},
{
"name": "BodyPreview"
},
{
"name": "Locations"
},
{
"name": "OnlineMeetingUrl"
},
{
"name": "OriginalStart"
},
{
"name": "SeriesMasterId"
}
]
},
"sink": {
"type": "BlobSink"
}
}
}
]

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Oracle by using Azure Data
Factory
5/6/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the copy activity in Azure Data Factory to copy data from and to an Oracle
database. It builds on the copy activity overview.

Supported capabilities
This Oracle connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an Oracle database to any supported sink data store. You also can copy data from any
supported source data store to an Oracle database. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.
Specifically, this Oracle connector supports:
The following versions of an Oracle database:
Oracle 19c R1 (19.1) and higher
Oracle 18c R1 (18.1) and higher
Oracle 12c R1 (12.1) and higher
Oracle 11g R1 (11.1) and higher
Oracle 10g R1 (10.1) and higher
Oracle 9i R2 (9.2) and higher
Oracle 8i R3 (8.1.7) and higher
Oracle Database Cloud Exadata Service
Parallel copying from an Oracle source. See the Parallel copy from Oracle section for details.

NOTE
Oracle proxy server isn't supported.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver
when you copy data from and to Oracle.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Oracle connector.

Linked service properties


The Oracle linked service supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Oracle .

connectionString Specifies the information needed to Yes


connect to the Oracle Database
instance.
You can also put a password in Azure
Key Vault, and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault with more details.

Suppor ted connection type : You


can use Oracle SID or Oracle
Ser vice Name to identify your
database:
- If you use SID:
Host=<host>;Port=<port>;Sid=
<sid>;User Id=
<username>;Password=<password>;
- If you use Service Name:
Host=<host>;Port=
<port>;ServiceName=
<servicename>;User Id=
<username>;Password=<password>;
For advanced Oracle native connection
options, you can choose to add an
entry in TNSNAMES.ORA file on the
Oracle server, and in ADF Oracle linked
service, choose to use Oracle Service
Name connection type and configure
the corresponding service name.

connectVia The integration runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, the default Azure Integration
Runtime is used.

TIP
If you get an error, "ORA-01025: UPI parameter out of range", and your Oracle version is 8i, add WireProtocolMode=1
to your connection string. Then try again.

If you have multiple Oracle instances for failover scenario, you can create Oracle linked service and fill in the
primary host, port, user name, password, etc., and add a new "Additional connection proper ties " with
property name as AlternateServers and value as
(HostName=<secondary host>:PortNumber=<secondary port>:ServiceName=<secondary service name>) - do not miss
the brackets and pay attention to the colons ( : ) as separator. As an example, the following value of alternate
servers defines two alternate database servers for connection failover:
(HostName=AccountingOracleServer:PortNumber=1521:SID=Accounting,HostName=255.201.11.24:PortNumber=1522:ServiceName=ABackup.NA.MyCompany)
.
More connection properties you can set in connection string per your case:

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES

ArraySize The number of bytes the connector An integer from 1 to 4294967296 (4


can fetch in a single network round GB). Default value is 60000 . The value
trip. E.g., ArraySize=10485760 . 1 does not define the number of bytes,
but indicates allocating space for
Larger values increase throughput by exactly one row of data.
reducing the number of times to fetch
data across the network. Smaller
values increase response time, as there
is less of a delay waiting for the server
to transmit data.

To enable encryption on Oracle connection, you have two options:


To use Triple-DES Encr yption (3DES) and Advanced Encr yption Standard (AES) , on the Oracle
server side, go to Oracle Advanced Security (OAS) and configure the encryption settings. For details, see
this Oracle documentation. The Oracle Application Development Framework (ADF) connector
automatically negotiates the encryption method to use the one you configure in OAS when establishing a
connection to Oracle.
To use TLS :
1. Get the TLS/SSL certificate info. Get the Distinguished Encoding Rules (DER)-encoded certificate
information of your TLS/SSL cert, and save the output (----- Begin Certificate … End Certificate ----
-) as a text file.

openssl x509 -inform DER -in [Full Path to the DER Certificate including the name of the DER
Certificate] -text

Example: Extract cert info from DERcert.cer, and then save the output to cert.txt.

openssl x509 -inform DER -in DERcert.cer -text


Output:
-----BEGIN CERTIFICATE-----
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX
-----END CERTIFICATE-----

2. Build the keystore or truststore . The following command creates the truststore file, with or
without a password, in PKCS-12 format.

openssl pkcs12 -in [Path to the file created in the previous step] -out [Path and name of
TrustStore] -passout pass:[Keystore PWD] -nokeys -export

Example: Create a PKCS12 truststore file, named MyTrustStoreFile, with a password.

openssl pkcs12 -in cert.txt -out MyTrustStoreFile -passout pass:ThePWD -nokeys -export

3. Place the truststore file on the self-hosted IR machine. For example, place the file at
C:\MyTrustStoreFile.
4. In Azure Data Factory, configure the Oracle connection string with EncryptionMethod=1 and the
corresponding TrustStore / TrustStorePassword value. For example,
Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=
<password>;EncryptionMethod=1;TrustStore=C:\\MyTrustStoreFile;TrustStorePassword=
<trust_store_password>
.
Example:

{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties supported by the Oracle dataset. For a full list of sections and
properties available for defining datasets, see Datasets.
To copy data from and to Oracle, set the type property of the dataset to OracleTable . The following properties
are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to OracleTable .

schema Name of the schema. No for source, Yes for sink

table Name of the table/view. No for source, Yes for sink


P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .

Example:

{
"name": "OracleDataset",
"properties":
{
"type": "OracleTable",
"schema": [],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
},
"linkedServiceName": {
"referenceName": "<Oracle linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


This section provides a list of properties supported by the Oracle source and sink. For a full list of sections and
properties available for defining activities, see Pipelines.
Oracle as source

TIP
To load data from Oracle efficiently by using data partitioning, learn more from Parallel copy from Oracle.

To copy data from Oracle, set the source type in the copy activity to OracleSource . The following properties are
supported in the copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to OracleSource .

oracleReaderQuery Use the custom SQL query to read No


data. An example is
"SELECT * FROM MyTable" .
When you enable partitioned load, you
need to hook any corresponding built-
in partition parameters in your query.
For examples, see the Parallel copy
from Oracle section.

partitionOptions Specifies the data partitioning options No


used to load data from Oracle.
Allowed values are: None (default),
PhysicalPar titionsOfTable , and
DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from an Oracle database is controlled
by the parallelCopies setting on
the copy activity.
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when the partition option isn't
None .

partitionNames The list of physical partitions that No


needs to be copied.
Apply when the partition option is
PhysicalPartitionsOfTable . If you
use a query to retrieve the source
data, hook
?AdfTabularPartitionName in the
WHERE clause. For an example, see the
Parallel copy from Oracle section.

partitionColumnName Specify the name of the source column No


in integer type that will be used by
range partitioning for parallel copy. If
not specified, the primary key of the
table is auto-detected and used as the
partition column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionColumnName in
the WHERE clause. For an example, see
the Parallel copy from Oracle section.

partitionUpperBound The maximum value of the partition No


column to copy data out.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionUpbound in the
WHERE clause. For an example, see the
Parallel copy from Oracle section.

partitionLowerBound The minimum value of the partition No


column to copy data out.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionLowbound in the
WHERE clause. For an example, see the
Parallel copy from Oracle section.

Example: copy data by using a basic quer y without par tition


"activities":[
{
"name": "CopyFromOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Oracle as sink
To copy data to Oracle, set the sink type in the copy activity to OracleSink . The following properties are
supported in the copy activity sink section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to OracleSink .

writeBatchSize Inserts data into the SQL table when No (default is 10,000)
the buffer size reaches
writeBatchSize .
Allowed values are Integer (number of
rows).

writeBatchTimeout The wait time for the batch insert No


operation to complete before it times
out.
Allowed values are Timespan. An
example is 00:30:00 (30 minutes).

preCopyScript Specify a SQL query for the copy No


activity to run before writing data into
Oracle in each run. You can use this
property to clean up the preloaded
data.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example:
"activities":[
{
"name": "CopyToOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Oracle output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OracleSink"
}
}
}
]

Parallel copy from Oracle


The Data Factory Oracle connector provides built-in data partitioning to copy data from Oracle in parallel. You
can find data partitioning options on the Source tab of the copy activity.

When you enable partitioned copy, Data Factory runs parallel queries against your Oracle source to load data by
partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Oracle database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Oracle database. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table, with physical partitions. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partitions, and copies data by partitions.

Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer column for data partitioning. Par tition column : Specify the column used to partition
data. If not specified, the primary key column is used.
SC EN A RIO SUGGEST ED SET T IN GS

Load a large amount of data by using a custom query, with Par tition option : Physical partitions of table.
physical partitions. Quer y :
SELECT * FROM <TABLENAME> PARTITION("?
AdfTabularPartitionName") WHERE
<your_additional_where_clause>
.
Par tition name : Specify the partition name(s) to copy data
from. If not specified, Data Factory automatically detects the
physical partitions on the table you specified in the Oracle
dataset.

During execution, Data Factory replaces


?AdfTabularPartitionName with the actual partition name,
and sends to Oracle.

Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer column for Quer y :
data partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against partition column to
retrieve data only between the lower and upper range.

During execution, Data Factory replaces


?AdfRangePartitionColumnName ,
?AdfRangePartitionUpbound , and
?AdfRangePartitionLowbound with the actual column
name and value ranges for each partition, and sends to
Oracle.
For example, if your partition column "ID" is set with the
lower bound as 1 and the upper bound as 80, with parallel
copy set as 4, Data Factory retrieves data by 4 partitions.
Their IDs are between [1,20], [21, 40], [41, 60], and [61, 80],
respectively.

TIP
When copying data from a non-partitioned table, you can use "Dynamic range" partition option to partition against an
integer column. If your source data doesn't have such type of column, you can leverage ORA_HASH function in source
query to generate a column and use it as partition column.

Example: quer y with physical par tition

"source": {
"type": "OracleSource",
"query":"SELECT * FROM <TABLENAME> PARTITION(\"?AdfTabularPartitionName\") WHERE
<your_additional_where_clause>",
"partitionOption": "PhysicalPartitionsOfTable",
"partitionSettings": {
"partitionNames": [
"<partitionA_name>",
"<partitionB_name>"
]
}
}

Example: quer y with dynamic range par tition


"source": {
"type": "OracleSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfRangePartitionColumnName <= ?AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?AdfRangePartitionLowbound AND <your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column>",
"partitionLowerBound": "<lower_value_of_partition_column>"
}
}

Data type mapping for Oracle


When you copy data from and to Oracle, the following mappings apply. To learn about how the copy activity
maps the source schema and data type to the sink, see Schema and data type mappings.

O RA C L E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

BFILE Byte[]

BLOB Byte[]
(only supported on Oracle 10g and higher)

CHAR String

CLOB String

DATE DateTime

FLOAT Decimal, String (if precision > 28)

INTEGER Decimal, String (if precision > 28)

LONG String

LONG RAW Byte[]

NCHAR String

NCLOB String

NUMBER (p,s) Decimal, String (if p > 28)

NUMBER without precision and scale Double

NVARCHAR2 String

RAW Byte[]

ROWID String

TIMESTAMP DateTime

TIMESTAMP WITH LOCAL TIME ZONE String

TIMESTAMP WITH TIME ZONE String

UNSIGNED INTEGER Number

VARCHAR2 String
O RA C L E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

XML String

NOTE
The data types INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND aren't supported.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Oracle Cloud Storage by using
Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from Oracle Cloud Storage. To learn about Azure Data Factory, read the
introductory article.

Supported capabilities
This Oracle Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Oracle Cloud Storage connector supports copying files as is or parsing files with the supported
file formats and compression codecs. It takes advantage of Oracle Cloud Storage's S3-compatible
interoperability.

Prerequisites
To copy data from Oracle Cloud Storage, please refer here for the prerequisites and required permission.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Cloud Storage.

Linked service properties


The following properties are supported for Oracle Cloud Storage linked services:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


OracleCloudStorage .
P RO P ERT Y DESC RIP T IO N REQ UIRED

accessKeyId ID of the secret access key. To find the Yes


access key and secret, see
Prerequisites.

secretAccessKey The secret access key itself. Mark this Yes


field as SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

serviceUrl Specify the custom endpoint as Yes


https://<namespace>.compat.objectstorage.
<region identifier>.oraclecloud.com
. Refer here for more details

connectVia The integration runtime to be used to No


connect to the data store. You can use
the Azure integration runtime or the
self-hosted integration runtime (if your
data store is in a private network). If
this property isn't specified, the service
uses the default Azure integration
runtime.

Here's an example:

{
"name": "OracleCloudStorageLinkedService",
"properties": {
"type": "OracleCloudStorage",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://<namespace>.compat.objectstorage.<region identifier>.oraclecloud.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Oracle Cloud Storage under location settings in a format-based
dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location Yes


in the dataset must be set to
OracleCloudStorageLocation .

bucketName The Oracle Cloud Storage bucket Yes


name.

folderPath The path to folder under the given No


bucket. If you want to use a wildcard
to filter the folder, skip this setting and
specify that in activity source settings.

fileName The file name under the given bucket No


and folder path. If you want to use a
wildcard to filter the files, skip this
setting and specify that in activity
source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Oracle Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "OracleCloudStorageLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties that the Oracle Cloud Storage source supports.
Oracle Cloud Storage as a source type
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Oracle Cloud Storage under storeSettings settings in a format-
based copy source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
OracleCloudStorageReadSettings .

Locate the files to copy:

OPTION 1: static path Copy from the given bucket or


folder/file path specified in the dataset.
If you want to copy all files from a
bucket or folder, additionally specify
wildcardFileName as * .

OPTION 2: Oracle Cloud Storage prefix Prefix for the Oracle Cloud Storage key No
- prefix name under the given bucket
configured in the dataset to filter
source Oracle Cloud Storage files.
Oracle Cloud Storage keys whose
names start with
bucket_in_dataset/this_prefix are
selected. It utilizes Oracle Cloud
Storage's service-side filter, which
provides better performance than a
wildcard filter.

OPTION 3: wildcard The folder path with wildcard No


- wildcardFolderPath characters under the given bucket
configured in a dataset to filter source
folders.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your folder name has a
wildcard or this escape character
inside.
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED

OPTION 3: wildcard The file name with wildcard characters Yes


- wildcardFileName under the given bucket and folder
path (or wildcard folder path) to filter
source files.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character). Use
^ to escape if your file name has a
wildcard or this escape character
inside. See more examples in Folder
and file filter examples.

OPTION 3: a list of files Indicates to copy a given file set. Point No


- fileListPath to a text file that includes a list of files
you want to copy, one file per line,
which is the relative path to the path
configured in the dataset.
When you're using this option, do not
specify the file name in the dataset.
See more examples in File list
examples.

Additional settings:

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. Note that
when recursive is set to true and the
sink is a file-based store, an empty
folder or subfolder isn't copied or
created at the sink.
Allowed values are true (default) and
false .
This property doesn't apply when you
configure fileListPath .

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files are filtered based on the attribute: No


last modified.
The files will be selected if their last
modified time is within the time range
between modifiedDatetimeStart
and modifiedDatetimeEnd . The time
is applied to the UTC time zone in the
format of "2018-12-01T05:00:00Z".
The properties can be NULL , which
means no file attribute filter will be
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL , the
files whose last modified attribute is
greater than or equal to the datetime
value will be selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL , the files whose last modified
attribute is less than the datetime
value will be selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.
P RO P ERT Y DESC RIP T IO N REQ UIRED

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:

"activities":[
{
"name": "CopyFromOracleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "OracleCloudStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Folder and file filter examples


This section describes the resulting behavior of the folder path and file name with wildcard filters.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)

bucket Folder*/* false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/* true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv false bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

bucket Folder*/*.csv true bucket


FolderA
File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This section describes the resulting behavior of using a file list path in the Copy activity source.
Assume that you have the following source folder structure and want to copy the files in bold:
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT DATA FA C TO RY C O N F IGURAT IO N

bucket File1.csv In dataset:


FolderA Subfolder1/File3.csv - Bucket: bucket
File1.csv Subfolder1/File5.csv - Folder path: FolderA
File2.json
Subfolder1 In copy activity source:
File3.csv - File list path:
File4.json bucket/Metadata/FileListToCopy.txt
File5.csv
Metadata The file list path points to a text file in
FileListToCopy.txt the same data store that includes a list
of files you want to copy, one file per
line, with the relative path to the path
configured in the dataset.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity.

Delete activity properties


To learn details about the properties, check Delete activity.

Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Oracle Eloqua using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Eloqua. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Oracle Eloqua connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Eloqua to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Eloqua connector.

Linked service properties


The following properties are supported for Oracle Eloqua linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Eloqua

endpoint The endpoint of the Eloqua server. Yes


Eloqua supports multiple data centers,
to determine your endpoint, login to
https://login.eloqua.com with your
credential, then copy the base URL
portion from the redirected URL with
the pattern of xxx.xxx.eloqua.com .

username The site name and user name of your Yes


Eloqua account in the form:
SiteName\Username e.g.
Eloqua\Alice .

password The password corresponding to the Yes


user name. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "EloquaLinkedService",
"properties": {
"type": "Eloqua",
"typeProperties": {
"endpoint" : "<base URL e.g. xxx.xxx.eloqua.com>",
"username" : "<site name>\\<user name e.g. Eloqua\\Alice>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Eloqua dataset.
To copy data from Oracle Eloqua, set the type property of the dataset to EloquaObject . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: EloquaObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "EloquaDataset",
"properties": {
"type": "EloquaObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Eloqua linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Eloqua source.
Eloqua as source
To copy data from Oracle Eloqua, set the source type in the copy activity to EloquaSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: EloquaSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .

Example:
"activities":[
{
"name": "CopyFromEloqua",
"type": "Copy",
"inputs": [
{
"referenceName": "<Eloqua input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "EloquaSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of supported data stored by Azure Data Factory, see supported data stores.
Copy data from Oracle Responsys using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Responsys. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Oracle Responsys connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Responsys to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Responsys connector.

Linked service properties


The following properties are supported for Oracle Responsys linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Responsys

endpoint The endpoint of the Respopnsys server Yes

clientId The client ID associated with the Yes


Responsys application.
P RO P ERT Y DESC RIP T IO N REQ UIRED

clientSecret The client secret associated with the Yes


Responsys application. You can choose
to mark this field as a SecureString to
store it securely in ADF, or store
password in Azure Key Vault and let
ADF copy activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "OracleResponsysLinkedService",
"properties": {
"type": "Responsys",
"typeProperties": {
"endpoint" : "<endpoint>",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Responsys dataset.
To copy data from Oracle Responsys, set the type property of the dataset to ResponsysObject . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: ResponsysObject
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "OracleResponsysDataset",
"properties": {
"type": "ResponsysObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Oracle Responsys linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Responsys source.
Oracle Responsys as source
To copy data from Oracle Responsys, set the source type in the copy activity to ResponsysSource . The
following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
ResponsysSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromOracleResponsys",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle Responsys input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ResponsysSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Oracle Service Cloud using Azure
Data Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Oracle Service Cloud.
It builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Oracle Service Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Service Cloud to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Service Cloud connector.

Linked service properties


The following properties are supported for Oracle Service Cloud linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


OracleSer viceCloud

host The URL of the Oracle Service Cloud Yes


instance.

username The user name that you use to access Yes


Oracle Service Cloud server.

password The password corresponding to the Yes


user name that you provided in the
username key. You can choose to mark
this field as a SecureString to store it
securely in ADF, or store password in
Azure Key Vault and let ADF copy
activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "OracleServiceCloudLinkedService",
"properties": {
"type": "OracleServiceCloud",
"typeProperties": {
"host" : "<host>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true,
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Service Cloud dataset.
To copy data from Oracle Service Cloud, set the type property of the dataset to OracleSer viceCloudObject .
The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to:
OracleSer viceCloudObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "OracleServiceCloudDataset",
"properties": {
"type": "OracleServiceCloudObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<OracleServiceCloud linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Oracle Service Cloud source.
Oracle Service Cloud as source
To copy data from Oracle Service Cloud, set the source type in the copy activity to
OracleSer viceCloudSource . The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
OracleSer viceCloudSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromOracleServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<OracleServiceCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleServiceCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
ORC format in Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the ORC files or write the data into ORC format .
ORC format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob,
Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the ORC dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Orc.

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .

compressionCodec The compression codec to use when No


writing to ORC files. When reading
from ORC files, Data Factories
automatically determine the
compression codec based on the file
metadata.
Supported types are none , zlib ,
snappy (default), and lzo . Note
currently Copy activity doesn't support
LZO when read/write ORC files.

Below is an example of ORC dataset on Azure Blob Storage:


{
"name": "OrcDataset",
"properties": {
"type": "Orc",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
}
}
}
}

Note the following points:


Complex data types (e.g. MAP, LIST, STRUCT) are currently supported only in Data Flows, not in Copy Activity.
To use complex types in data flows, do not import the file schema in the dataset, leaving schema blank in the
dataset. Then, in the Source transformation, import the projection.
White space in column name is not supported.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the ORC source and sink.
ORC as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to OrcSource .

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

ORC as sink
The following properties are supported in the copy activity *sink* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to OrcSink .

formatSettings A group of properties. Refer to ORC No


write settings table below.
P RO P ERT Y DESC RIP T IO N REQ UIRED

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Supported ORC write settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to OrcWriteSettings .

maxRowsPerFile When writing data into a folder, you No


can choose to write to multiple files
and specify the max rows per file.

fileNamePrefix Applicable when maxRowsPerFile is No


configured.
Specify the file name prefix when
writing data to multiple files, resulted
in this pattern:
<fileNamePrefix>_00000.
<fileExtension>
. If not specified, file name prefix will be
auto generated. This property does
not apply when source is file-based
store or partition-option-enabled data
store.

Mapping data flow properties


In mapping data flows, you can read and write to ORC format in the following data stores: Azure Blob Storage,
Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
You can point to ORC files either using ORC dataset or using an inline dataset.
Source properties
The below table lists the properties supported by an ORC source. You can edit these properties in the Source
options tab.
When using inline dataset, you will see additional file settings, which are the same as the properties described in
dataset properties section.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes orc format


orc
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root [<from>, <to>]

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Source example
The associated data flow script of an ORC source configuration is:

source(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn: 'fileName',
format: 'orc') ~> OrcSource

Sink properties
The below table lists the properties supported by an ORC sink. You can edit these properties in the Settings
tab.
When using inline dataset, you will see additional file settings, which are the same as the properties described in
dataset properties section.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes orc format


orc

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

File name option The naming format no Pattern: String filePattern


of the data written. Per partition: String[] partitionFileNames
By default, one file As data in column: rowUrlColumn
per partition in String partitionFileNames
format Output to single file:
part-#####-tid- ['<fileName>']
<guid>

Sink example
The associated data flow script of an ORC sink configuration is:

OrcSource sink(
format: 'orc',
filePattern:'output[n].orc',
truncate: true,
allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> OrcSink

Using Self-hosted Integration Runtime


IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying ORC files as-is , you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK and
Microsoft Visual C++ 2010 Redistributable Package on your IR machine. Check the following paragraph with
more details.

For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
To install Visual C++ 2010 Redistributable Package : Visual C++ 2010 Redistributable Package is not
installed with self-hosted IR installations. You can find it from here.
TIP
If you copy data to/from ORC format using Self-hosted Integration Runtime and hit error saying "An error occurred when
invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64 MB and max 1G.

Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Parquet format in Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the Parquet files or write the data into Parquet format .
Parquet format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure
Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Parquet dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Parquet .

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .

compressionCodec The compression codec to use when No


writing to Parquet files. When reading
from Parquet files, Data Factories
automatically determine the
compression codec based on the file
metadata.
Supported types are "none ", "gzip ",
"snappy " (default), and "lzo ". Note
currently Copy activity doesn't support
LZO when read/write Parquet files.

NOTE
White space in column name is not supported for Parquet files.

Below is an example of Parquet dataset on Azure Blob Storage:


{
"name": "ParquetDataset",
"properties": {
"type": "Parquet",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compressionCodec": "snappy"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Parquet source and sink.
Parquet as source
The following properties are supported in the copy activity *source* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
ParquetSource .

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Parquet as sink
The following properties are supported in the copy activity *sink* section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to ParquetSink .

formatSettings A group of properties. Refer to No


Parquet write settings table below.

storeSettings A group of properties on how to write No


data to a data store. Each file-based
connector has its own supported write
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .
Supported Parquet write settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to ParquetWriteSettings .

maxRowsPerFile When writing data into a folder, you No


can choose to write to multiple files
and specify the max rows per file.

fileNamePrefix Applicable when maxRowsPerFile is No


configured.
Specify the file name prefix when
writing data to multiple files, resulted
in this pattern:
<fileNamePrefix>_00000.
<fileExtension>
. If not specified, file name prefix will be
auto generated. This property does
not apply when source is file-based
store or partition-option-enabled data
store.

Mapping data flow properties


In mapping data flows, you can read and write to parquet format in the following data stores: Azure Blob
Storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Source properties
The below table lists the properties supported by a parquet source. You can edit these properties in the Source
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes parquet format


parquet

Wild card paths All files matching the no String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is no String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source no true or false fileList


is pointing to a text
file that lists files to
process
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Column to store file Create a new column no String rowUrlColumn


name with the source file
name and path

After completion Delete or move the no Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root [<from>, <to>]

Filter by last modified Choose to filter files no Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

Source example
The below image is an example of a parquet source configuration in mapping data flows.

The associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn: 'fileName',
format: 'parquet') ~> ParquetSource

Sink properties
The below table lists the properties supported by a parquet sink. You can edit these properties in the Settings
tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Format Format must be yes parquet format


parquet
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Clear the folder If the destination no true or false truncate


folder is cleared prior
to write

File name option The naming format no Pattern: String filePattern


of the data written. Per partition: String[] partitionFileNames
By default, one file As data in column: rowUrlColumn
per partition in String partitionFileNames
format Output to single file:
part-#####-tid- ['<fileName>']
<guid>

Sink example
The below image is an example of a parquet sink configuration in mapping data flows.

The associated data flow script is:

ParquetSource sink(
format: 'parquet',
filePattern:'output[n].parquet',
truncate: true,
allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> ParquetSink

Data type support


Parquet complex data types (e.g. MAP, LIST, STRUCT) are currently supported only in Data Flows, not in Copy
Activity. To use complex types in data flows, do not import the file schema in the dataset, leaving schema blank
in the dataset. Then, in the Source transformation, import the projection.

Using Self-hosted Integration Runtime


IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying Parquet files as-is , you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK and
Microsoft Visual C++ 2010 Redistributable Package on your IR machine. Check the following paragraph with
more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
To install Visual C++ 2010 Redistributable Package : Visual C++ 2010 Redistributable Package is not
installed with self-hosted IR installations. You can find it from here.

TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64 MB and max 1G.

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from PayPal using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from PayPal. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This PayPal connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from PayPal to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PayPal connector.

Linked service properties


The following properties are supported for PayPal linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


PayPal

host The URL of the PayPal instance. (that is, Yes


api.sandbox.paypal.com)

clientId The client ID associated with your Yes


PayPal application.

clientSecret The client secret associated with your Yes


PayPal application. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "PayPalLinkedService",
"properties": {
"type": "PayPal",
"typeProperties": {
"host" : "api.sandbox.paypal.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PayPal dataset.
To copy data from PayPal, set the type property of the dataset to PayPalObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: PayPalObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "PayPalDataset",
"properties": {
"type": "PayPalObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<PayPal linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by PayPal source.
PayPal as source
To copy data from PayPal, set the source type in the copy activity to PayPalSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: PayPalSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM
Payment_Experience"
.

Example:
"activities":[
{
"name": "CopyFromPayPal",
"type": "Copy",
"inputs": [
{
"referenceName": "<PayPal input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PayPalSource",
"query": "SELECT * FROM Payment_Experience"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Phoenix using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Phoenix. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Phoenix connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Phoenix to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Phoenix connector.
Linked service properties
The following properties are supported for Phoenix linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Phoenix

host The IP address or host name of the Yes


Phoenix server. (that is,
192.168.222.160)

port The TCP port that the Phoenix server No


uses to listen for client connections.
The default value is 8765. If you
connect to Azure HDInsights, specify
port as 443.

httpPath The partial URL corresponding to the No


Phoenix server. (that is,
/gateway/sandbox/phoenix/version).
Specify /hbasephoenix0 if using
HDInsights cluster.

authenticationType The authentication mechanism used to Yes


connect to the Phoenix server.
Allowed values are: Anonymous ,
UsernameAndPassword ,
WindowsAzureHDInsightSer vice

username The user name used to connect to the No


Phoenix server.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false.
P RO P ERT Y DESC RIP T IO N REQ UIRED

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when connecting over TLS. The default
value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting,
e.g. specify /hbasephoenix0 instead of /hbasephoenix .

Example:

{
"name": "PhoenixLinkedService",
"properties": {
"type": "Phoenix",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbasephoenix0",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Phoenix dataset.
To copy data from Phoenix, set the type property of the dataset to PhoenixObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: PhoenixObject
P RO P ERT Y DESC RIP T IO N REQ UIRED

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "PhoenixDataset",
"properties": {
"type": "PhoenixObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Phoenix linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Phoenix source.
Phoenix as source
To copy data from Phoenix, set the source type in the copy activity to PhoenixSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
PhoenixSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromPhoenix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Phoenix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PhoenixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from PostgreSQL by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a PostgreSQL
database. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This PostgreSQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from PostgreSQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this PostgreSQL connector supports PostgreSQL version 7.4 and above .

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in PostgreSQL driver starting from version 3.7, therefore you don't
need to manually install any driver.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PostgreSQL connector.

Linked service properties


The following properties are supported for PostgreSQL linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


PostgreSql

connectionString An ODBC connection string to connect Yes


to Azure Database for PostgreSQL.
You can also put password in Azure
Key Vault and pull the password
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

A typical connection string is


Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=<Password> . More properties you can
set per your case:

P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED

EncryptionMethod (EM) The method the driver uses 0 (No Encryption) No


to encrypt data sent (Default) / 1 (SSL) / 6
between the driver and the (RequestSSL)
database server. E.g.,
EncryptionMethod=
<0/1/6>;

ValidateServerCertificate Determines whether the 0 (Disabled) (Default) / 1 No


(VSC) driver validates the (Enabled)
certificate that is sent by
the database server when
SSL encryption is enabled
(Encryption Method=1).
E.g.,
ValidateServerCertificate=
<0/1>;

Example:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=
<Password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using PostgreSQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PostgreSQL dataset.
To copy data from PostgreSQL, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: PostgreSqlTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "PostgreSQLDataset",
"properties":
{
"type": "PostgreSqlTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<PostgreSQL linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it's still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by PostgreSQL source.
PostgreSQL as source
To copy data from PostgreSQL, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
PostgreSqlSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"MySchema\".\"MyTable\""
.

NOTE
Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query.

Example:

"activities":[
{
"name": "CopyFromPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<PostgreSQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PostgreSqlSource",
"query": "SELECT * FROM \"MySchema\".\"MyTable\""
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
Lookup activity properties
To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Presto using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Presto. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Presto connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Presto to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Presto connector.

Linked service properties


The following properties are supported for Presto linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Presto

host The IP address or host name of the Yes


Presto server. (e.g. 192.168.222.160)

serverVersion The version of the Presto server. (e.g. Yes


0.148-t)
P RO P ERT Y DESC RIP T IO N REQ UIRED

catalog The catalog context for all request Yes


against the server.

port The TCP port that the Presto server No


uses to listen for client connections.
The default value is 8080.

authenticationType The authentication mechanism used to Yes


connect to the Presto server.
Allowed values are: Anonymous ,
LDAP

username The user name used to connect to the No


Presto server.

password The password corresponding to the No


user name. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when connecting over TLS. The default
value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

timeZoneID The local time zone used by the No


connection. Valid values for this option
are specified in the IANA Time Zone
Database. The default value is the
system time zone.

Example:
{
"name": "PrestoLinkedService",
"properties": {
"type": "Presto",
"typeProperties": {
"host" : "<host>",
"serverVersion" : "0.148-t",
"catalog" : "<catalog>",
"port" : "<port>",
"authenticationType" : "LDAP",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"timeZoneID" : "Europe/Berlin"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Presto dataset.
To copy data from Presto, set the type property of the dataset to PrestoObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: PrestoObject

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "PrestoDataset",
"properties": {
"type": "PrestoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Presto linked service name>",
"type": "LinkedServiceReference"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Presto source.
Presto as source
To copy data from Presto, set the source type in the copy activity to PrestoSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: PrestoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromPresto",
"type": "Copy",
"inputs": [
{
"referenceName": "<Presto input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PrestoSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from QuickBooks Online using Azure
Data Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from QuickBooks Online. It
builds on the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This QuickBooks connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from QuickBooks Online to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
This connector supports QuickBooks OAuth 2.0 authentication.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
QuickBooks connector.

Linked service properties


The following properties are supported for QuickBooks linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


QuickBooks
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectionProperties A group of properties that defines how Yes


to connect to QuickBooks.

Under connectionProperties :

endpoint The endpoint of the QuickBooks Yes


Online server. (that is,
quickbooks.api.intuit.com)

companyId The company ID of the QuickBooks Yes


company to authorize. For info about
how to find the company ID, see How
do I find my Company ID.

consumerKey The client ID of your QuickBooks Yes


Online application for OAuth 2.0
authentication. Learn more from here.

consumerSecret The client secret of your QuickBooks Yes


Online application for OAuth 2.0
authentication. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

refreshToken The OAuth 2.0 refresh token Yes


associated with the QuickBooks
application. Learn more from here.
Note refresh token will be expired after
180 days. Customer need to regularly
update the refresh token.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

Example:
{
"name": "QuickBooksLinkedService",
"properties": {
"type": "QuickBooks",
"typeProperties": {
"connectionProperties":{
"endpoint":"quickbooks.api.intuit.com",
"companyId":"<company id>",
"consumerKey":"<consumer key>",
"consumerSecret":{
"type": "SecureString",
"value": "<clientSecret>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by QuickBooks dataset.
To copy data from QuickBooks Online, set the type property of the dataset to QuickBooksObject . The
following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: QuickBooksObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "QuickBooksDataset",
"properties": {
"type": "QuickBooksObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<QuickBooks linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by QuickBooks source.
QuickBooks as source
To copy data from QuickBooks Online, set the source type in the copy activity to QuickBooksSource . The
following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
QuickBooksSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Bill" WHERE Id =
'123'"
.

Example:

"activities":[
{
"name": "CopyFromQuickBooks",
"type": "Copy",
"inputs": [
{
"referenceName": "<QuickBooks input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "QuickBooksSource",
"query": "SELECT * FROM \"Bill\" WHERE Id = '123' "
},
"sink": {
"type": "<sink type>"
}
}
}
]

Copy data from Quickbooks Desktop


The Copy Activity in Azure Data Factory cannot copy data directly from Quickbooks Desktop. To copy data from
Quickbooks Desktop, export your Quickbooks data to a comma-separated-values (CSV) file and then upload the
file to Azure Blob Storage. From there, you can use Data Factory to copy the data to the sink of your choice.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to a REST endpoint by using
Azure Data Factory
5/6/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to a REST endpoint.
The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.
The difference among this REST connector, HTTP connector, and the Web table connector are:
REST connector specifically supports copying data from RESTful APIs;
HTTP connector is generic to retrieve data from any HTTP endpoint, for example, to download file. Before
this REST connector you may happen to use HTTP connector to copy data from RESTful API, which is
supported but less functional comparing to REST connector.
Web table connector extracts table content from an HTML webpage.

Supported capabilities
You can copy data from a REST source to any supported sink data store. You also can copy data from any
supported source data store to a REST sink. For a list of data stores that Copy Activity supports as sources and
sinks, see Supported data stores and formats.
Specifically, this generic REST connector supports:
Copying data from a REST endpoint by using the GET or POST methods and copying data to a REST
endpoint by using the POST , PUT or PATCH methods.
Copying data by using one of the following authentications: Anonymous , Basic , AAD ser vice principal ,
and managed identities for Azure resources .
Pagination in the REST APIs.
For REST as source, copying the REST JSON response as-is or parse it by using schema mapping. Only
response payload in JSON is supported.

TIP
To test a request for data retrieval before you configure the REST connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the REST connector.

Linked service properties


The following properties are supported for the REST linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


RestSer vice .

url The base URL of the REST service. Yes

enableServerCertificateValidation Whether to validate server-side No


TLS/SSL certificate when connecting to (the default is true )
the endpoint.

authenticationType Type of authentication used to connect Yes


to the REST service. Allowed values are
Anonymous , Basic,
AadSer vicePrincipal, and
ManagedSer viceIdentity . User-
based OAuth isn't supported. You can
additionally configure authentication
headers in authHeader property.
Refer to corresponding sections below
on more properties and examples
respectively.

authHeaders Additional HTTP request headers for No


authentication.
For example, to use API key
authentication, you can select
authentication type as “Anonymous”
and specify API key in the header.

connectVia The Integration Runtime to use to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, this property uses the
default Azure Integration Runtime.
Use basic authentication
Set the authenticationType property to Basic . In addition to the generic properties that are described in the
preceding section, specify the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

userName The user name to use to access the Yes


REST endpoint.

password The password for the user (the Yes


userName value). Mark this field as a
SecureString type to store it securely
in Data Factory. You can also reference
a secret stored in Azure Key Vault.

Example

{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<REST endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use AAD service principal authentication


Set the authenticationType property to AadSer vicePrincipal . In addition to the generic properties that are
described in the preceding section, specify the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

servicePrincipalId Specify the Azure Active Directory Yes


application's client ID.

servicePrincipalKey Specify the Azure Active Directory Yes


application's key. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. Retrieve it by
hovering the mouse in the top-right
corner of the Azure portal.
P RO P ERT Y DESC RIP T IO N REQ UIRED

aadResourceId Specify the AAD resource you are Yes


requesting for authorization, for
example,
https://management.core.windows.net
.

azureCloudType For service principal authentication, No


specify the type of Azure cloud
environment to which your AAD
application is registered.
Allowed values are AzurePublic,
AzureChina , AzureUsGovernment ,
and AzureGermany . By default, the
data factory's cloud environment is
used.

Example

{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use managed identities for Azure resources authentication


Set the authenticationType property to ManagedSer viceIdentity . In addition to the generic properties that
are described in the preceding section, specify the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

aadResourceId Specify the AAD resource you are Yes


requesting for authorization, for
example,
https://management.core.windows.net
.

Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "ManagedServiceIdentity",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Using authentication headers


In addition, you can configure request headers for authentication along with the built-in authentication types.
Example: Using API key authentication

{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint>",
"authenticationType": "Anonymous",
"authHeader": {
"x-api-key": {
"type": "SecureString",
"value": "<API key>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties that the REST dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from REST, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to RestResource .
P RO P ERT Y DESC RIP T IO N REQ UIRED

relativeUrl A relative URL to the resource that No


contains the data. When this property
isn't specified, only the URL that's
specified in the linked service definition
is used. The HTTP connector copies
data from the combined URL:
[URL specified in linked
service]/[relative URL specified
in dataset]
.

If you were setting requestMethod , additionalHeaders , requestBody and paginationRules in dataset, it is still
supported as-is, while you are suggested to use the new model in activity going forward.
Example:

{
"name": "RESTDataset",
"properties": {
"type": "RestResource",
"typeProperties": {
"relativeUrl": "<relative url>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<REST linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy Activity properties


This section provides a list of properties supported by the REST source and sink.
For a full list of sections and properties that are available for defining activities, see Pipelines.
REST as source
The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to RestSource .

requestMethod The HTTP method. Allowed values are No


GET (default) and POST .

additionalHeaders Additional HTTP request headers. No

requestBody The body for the HTTP request. No

paginationRules The pagination rules to compose next No


page requests. Refer to pagination
support section on details.
P RO P ERT Y DESC RIP T IO N REQ UIRED

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. The default value is
00:01:40 .

requestInterval The time to wait before sending the No


request for next page. The default
value is 00:00:01

NOTE
REST connector ignores any "Accept" header specified in additionalHeaders . As REST connector only support response
in JSON, it will auto generate a header of Accept: application/json .

Example 1: Using the Get method with pagination

"activities":[
{
"name": "CopyFromREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<REST input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RestSource",
"additionalHeaders": {
"x-user-defined": "helloworld"
},
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
},
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example 2: Using the Post method


"activities":[
{
"name": "CopyFromREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<REST input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RestSource",
"requestMethod": "Post",
"requestBody": "<body for POST REST request>",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]

REST as sink
The following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to RestSink .

requestMethod The HTTP method. Allowed values are No


POST (default), PUT , and PATCH .

additionalHeaders Additional HTTP request headers. No

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to write the
data. The default value is 00:01:40 .

requestInterval The interval time between different No


requests in millisecond. Request
interval value should be a number
between [10, 60000].

httpCompressionType HTTP compression type to use while No


sending data with Optimal
Compression Level. Allowed values are
none and gzip .
P RO P ERT Y DESC RIP T IO N REQ UIRED

writeBatchSize Number of records to write to the No


REST sink per batch. The default value
is 10000.

REST connector as sink works with the REST APIs that accept JSON. The data will be sent in JSON with the
following pattern. As needed, you can use the copy activity schema mapping to reshape the source data to
conform to the expected payload by the REST API.

[
{ <data object> },
{ <data object> },
...
]

Example:

"activities":[
{
"name": "CopyToREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<REST output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "RestSink",
"requestMethod": "POST",
"httpRequestTimeout": "00:01:40",
"requestInterval": 10,
"writeBatchSize": 10000,
"httpCompressionType": "none",
},
}
}
]

Pagination support
When copying data from REST APIs, normally, the REST API limits its response payload size of a single request
under a reasonable number; while to return large amount of data, it splits the result into multiple pages and
requires callers to send consecutive requests to get next page of the result. Usually, the request for one page is
dynamic and composed by the information returned from the response of previous page.
This generic REST connector supports the following pagination patterns:
Next request’s absolute or relative URL = property value in current response body
Next request’s absolute or relative URL = header value in current response headers
Next request’s query parameter = property value in current response body
Next request’s query parameter = header value in current response headers
Next request’s header = property value in current response body
Next request’s header = header value in current response headers
Pagination rules are defined as a dictionary in dataset, which contains one or more case-sensitive key-value
pairs. The configuration will be used to generate the request starting from the second page. The connector will
stop iterating when it gets HTTP status code 204 (No Content), or any of the JSONPath expressions in
"paginationRules" returns null.
Suppor ted keys in pagination rules:

K EY DESC RIP T IO N

AbsoluteUrl Indicates the URL to issue the next request. It can be either
absolute URL or relative URL .

QueryParameters.request_query_parameter OR "request_query_parameter" is user-defined, which references


QueryParameters['request_query_parameter'] one query parameter name in the next HTTP request URL.

Headers.request_header OR Headers['request_header'] "request_header" is user-defined, which references one


header name in the next HTTP request.

Suppor ted values in pagination rules:

VA L UE DESC RIP T IO N

Headers.response_header OR Headers['response_header'] "response_header" is user-defined, which references one


header name in the current HTTP response, the value of
which will be used to issue next request.

A JSONPath expression starting with "$" (representing the The response body should contain only one JSON object.
root of the response body) The JSONPath expression should return a single primitive
value, which will be used to issue next request.

Example:
Facebook Graph API returns response in the following structure, in which case next page's URL is represented in
paging.next :
{
"data": [
{
"created_time": "2017-12-12T14:12:20+0000",
"name": "album1",
"id": "1809938745705498_1809939942372045"
},
{
"created_time": "2017-12-12T14:14:03+0000",
"name": "album2",
"id": "1809938745705498_1809941802371859"
},
{
"created_time": "2017-12-12T14:14:11+0000",
"name": "album3",
"id": "1809938745705498_1809941879038518"
}
],
"paging": {
"cursors": {
"after": "MTAxNTExOTQ1MjAwNzI5NDE=",
"before": "NDMyNzQyODI3OTQw"
},
"previous": "https://graph.facebook.com/me/albums?limit=25&before=NDMyNzQyODI3OTQw",
"next": "https://graph.facebook.com/me/albums?limit=25&after=MTAxNTExOTQ1MjAwNzI5NDE="
}
}

The corresponding REST copy activity source configuration especially the paginationRules is as follows:

"typeProperties": {
"source": {
"type": "RestSource",
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
},
...
},
"sink": {
"type": "<sink type>"
}
}

Use OAuth
This section describes how to use a solution template to copy data from REST connector into Azure Data Lake
Storage in JSON format using OAuth.
About the solution template
The template contains two activities:
Web activity retrieves the bearer token and then pass it to subsequent Copy activity as authorization.
Copy activity copies data from REST to Azure Data Lake Storage.
The template defines two parameters:
SinkContainer is the root folder path where the data is copied to in your Azure Data Lake Storage.
SinkDirector y is the directory path under the root where the data is copied to in your Azure Data Lake
Storage.
How to use this solution template
1. Go to the Copy from REST or HTTP using OAuth template. Create a new connection for Source
Connection.

Below are key steps for new linked service (REST) settings:
a. Under Base URL , specify the url parameter for your own source REST service.
b. For Authentication type , choose Anonymous.
2. Create a new connection for Destination Connection.

3. Select Use this template .


4. You would see the pipeline created as shown in the following example:

5. Select Web activity. In Settings , specify the corresponding URL , Method , Headers , and Body to
retrieve OAuth bearer token from the login API of the service that you want to copy data from. The
placeholder in the template showcases a sample of Azure Active Directory (AAD) OAuth. Note AAD
authentication is natively supported by REST connector, here is just an example for OAuth flow.

P RO P ERT Y DESC RIP T IO N

URL Specify the url to retrieve OAuth bearer token from. for
example, in the sample here it's
https://login.microsoftonline.com/microsoft.onmicrosoft.c
om/oauth2/token
P RO P ERT Y DESC RIP T IO N

Method The HTTP method. Allowed values are Post and Get .

Headers Header is user-defined, which references one header


name in the HTTP request.

Body The body for the HTTP request.

6. In Copy data activity, select Source tab, you could see that the bearer token (access_token) retrieved
from previous step would be passed to Copy data activity as Authorization under Additional headers.
Confirm settings for following properties before starting a pipeline run.

P RO P ERT Y DESC RIP T IO N

Request method The HTTP method. Allowed values are Get (default) and
Post .

Additional headers Additional HTTP request headers.


7. Select Debug , enter the Parameters , and then select Finish .
8. When the pipeline run completes successfully, you would see the result similar to the following example:

9. Click the "Output" icon of WebActivity in Actions column, you would see the access_token returned by
the service.
10. Click the "Input" icon of CopyActivity in Actions column, you would see the access_token retrieved by
WebActivity is passed to CopyActivity for authentication.

Cau t i on

To avoid token being logged in plain text, enable "Secure output" in Web activity and "Secure input" in
Copy activity.

Export JSON response as-is


You can use this REST connector to export REST API JSON response as-is to various file-based stores. To achieve
such schema-agnostic copy, skip the "structure" (also called schema) section in dataset and schema mapping in
copy activity.

Schema mapping
To copy data from REST endpoint to tabular sink, refer to schema mapping.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Salesforce by using Azure
Data Factory
5/26/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce. It builds
on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
This Salesforce connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy
activity, see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API. By default, when copying data from
Salesforce, the connector uses v45 and automatically chooses between REST and Bulk APIs based on the data
size – when the result set is large, Bulk API is used for better performance; when writing data to Salesforce, the
connector uses v40 of Bulk API. You can also explicitly set the API version used to read/write data via
apiVersion property in linked service.

Prerequisites
API permission must be enabled in Salesforce.

Salesforce request limits


Salesforce has limits for both total API requests and concurrent API requests. Note the following points:
If the number of concurrent requests exceeds the limit, throttling occurs and you see random failures.
If the total number of requests exceeds the limit, the Salesforce account is blocked for 24 hours.
You might also receive the "REQUEST_LIMIT_EXCEEDED" error message in both scenarios. For more information,
see the "API request limits" section in Salesforce developer limits.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce connector.

Linked service properties


The following properties are supported for the Salesforce linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Salesforce .

environmentUrl Specify the URL of the Salesforce No


instance.
- Default is
"https://login.salesforce.com" .
- To copy data from sandbox, specify
"https://test.salesforce.com" .
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com"
.

username Specify a user name for the user Yes


account.

password Specify a password for the user Yes


account.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

securityToken Specify a security token for the user No


account.

To learn about security tokens in


general, see Security and the API. The
security token can be skipped only if
you add the Integration Runtime's IP
to the trusted IP address list on
Salesforce. When using Azure IR, refer
to Azure Integration Runtime IP
addresses.

For instructions on how to get and


reset a security token, see Get a
security token. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

apiVersion Specify the Salesforce REST/Bulk API No


version to use, e.g. 48.0 . By default,
the connector uses v45 to copy data
from Salesforce, and uses v40 to copy
data to Salesforce.

connectVia The integration runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example: Store credentials in Data Factor y

{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Store credentials in Key Vault


{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject . The following
properties are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SalesforceObject .

objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:
{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"typeProperties": {
"objectApiName": "MyTable__c"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
}
}
}

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to RelationalTable .

tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce source and sink.
Salesforce as a source type
To copy data from Salesforce, set the source type in the copy activity to SalesforceSource . The following
properties are supported in the copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
SalesforceSource .

query Use the custom query to read data. No (if "objectApiName" in the dataset
You can use Salesforce Object Query is specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified
in "objectApiName" in dataset will be
retrieved.
P RO P ERT Y DESC RIP T IO N REQ UIRED

readBehavior Indicates whether to query the existing No


records, or query all records including
the deleted ones. If not specified, the
default behavior is the former.
Allowed values: quer y (default),
quer yAll.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:

"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]

NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.

Salesforce as a sink type


To copy data to Salesforce, set the sink type in the copy activity to SalesforceSink . The following properties are
supported in the copy activity sink section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to SalesforceSink .

writeBehavior The write behavior for the operation. No (default is Insert)


Allowed values are Inser t and
Upser t .

externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified
field must be defined as "External ID
Field" in the Salesforce object. It can't
have NULL values in the
corresponding input data.

writeBatchSize The row count of data written to No (default is 5,000)


Salesforce in each batch.

ignoreNullValues Indicates whether to ignore NULL No (default is false)


values from input data during a write
operation.
Allowed values are true and false .
- True : Leave the data in the
destination object unchanged when
you do an upsert or update operation.
Insert a defined default value when
you do an insert operation.
- False : Update the data in the
destination object to NULL when you
do an upsert or update operation.
Insert a NULL value when you do an
insert operation.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example: Salesforce sink in a copy activity


"activities":[
{
"name": "CopyToSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]

Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .

Retrieve deleted records from the Salesforce Recycle Bin


To query the soft deleted records from the Salesforce Recycle Bin, you can specify readBehavior as queryAll .
Difference between SOQL and SQL query syntax
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has
different syntax and functionality support, do not mix it. You are suggested to use the SOQL query, which is
natively supported by Salesforce. The following table lists the main differences:

SY N TA X SO Q L M O DE SQ L M O DE

Column selection Need to enumerate the fields to be SELECT * is supported in addition to


copied in the query, e.g. column selection.
SELECT field1, filed2 FROM
objectname

Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"

Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.
SY N TA X SO Q L M O DE SQ L M O DE

Boolean values Represented as False and True , Represented as 0 or 1, e.g.


e.g. SELECT … WHERE IsDeleted=1 .
SELECT … WHERE IsDeleted=True .

Column renaming Not supported. Supported, e.g.:


SELECT a AS b FROM … .

Relationship Supported, e.g. Not supported.


Account_vod__r.nvs_Country__c .

Retrieve data by using a where clause on the DateTime column


When you specify the SOQL or SQL query, pay attention to the DateTime format difference. For example:
SOQL sample :
SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >=
@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-ddTHH:mm:ssZ')} AND LastModifiedDate <
@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-ddTHH:mm:ssZ')}
SQL sample :
SELECT * FROM Account WHERE LastModifiedDate >=
{ts'@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-dd HH:mm:ss')}'} AND LastModifiedDate <
{ts'@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd HH:mm:ss')}'}

Error of MALFORMED_QUERY: Truncated


If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column in
data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to
exclude JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity
runs).

Data type mapping for Salesforce


When you copy data from Salesforce, the following mappings are used from Salesforce data types to Data
Factory interim data types. To learn about how the copy activity maps the source schema and data type to the
sink, see Schema and data type mappings.

SA L ESF O RC E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Auto Number String

Checkbox Boolean

Currency Decimal

Date DateTime

Date/Time DateTime

Email String

ID String

Lookup Relationship String


SA L ESF O RC E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Multi-Select Picklist String

Number Decimal

Percent Decimal

Phone String

Picklist String

Text String

Text Area String

Text Area (Long) String

Text Area (Rich) String

Text (Encrypted) String

URL String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to Salesforce Service Cloud by
using Azure Data Factory
5/6/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Salesforce Service
Cloud. It builds on the Copy Activity overview article that presents a general overview of the copy activity.

Supported capabilities
This Salesforce Service Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce Service Cloud to any supported sink data store. You also can copy data from
any supported source data store to Salesforce Service Cloud. For a list of data stores that are supported as
sources or sinks by the Copy activity, see the Supported data stores table.
Specifically, this Salesforce Service Cloud connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API. By default, when copying data from
Salesforce, the connector uses v45 and automatically chooses between REST and Bulk APIs based on the data
size – when the result set is large, Bulk API is used for better performance; when writing data to Salesforce, the
connector uses v40 of Bulk API. You can also explicitly set the API version used to read/write data via
apiVersion property in linked service.

Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set

Salesforce request limits


Salesforce has limits for both total API requests and concurrent API requests. Note the following points:
If the number of concurrent requests exceeds the limit, throttling occurs and you see random failures.
If the total number of requests exceeds the limit, the Salesforce account is blocked for 24 hours.
You might also receive the "REQUEST_LIMIT_EXCEEDED" error message in both scenarios. For more information,
see the "API request limits" section in Salesforce developer limits.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce Service Cloud connector.

Linked service properties


The following properties are supported for the Salesforce linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SalesforceSer viceCloud .

environmentUrl Specify the URL of the Salesforce No


Service Cloud instance.
- Default is
"https://login.salesforce.com" .
- To copy data from sandbox, specify
"https://test.salesforce.com" .
- To copy data from custom domain,
specify, for example,
"https://[domain].my.salesforce.com"
.

username Specify a user name for the user Yes


account.

password Specify a password for the user Yes


account.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

securityToken Specify a security token for the user No


account.

To learn about security tokens in


general, see Security and the API. The
security token can be skipped only if
you add the Integration Runtime's IP
to the trusted IP address list on
Salesforce. When using Azure IR, refer
to Azure Integration Runtime IP
addresses.

For instructions on how to get and


reset a security token, see Get a
security token. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

apiVersion Specify the Salesforce REST/Bulk API No


version to use, e.g. 48.0 . By default,
the connector uses v45 to copy data
from Salesforce, and uses v40 to copy
data to Salesforce.

connectVia The integration runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example: Store credentials in Data Factor y

{
"name": "SalesforceServiceCloudLinkedService",
"properties": {
"type": "SalesforceServiceCloud",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: Store credentials in Key Vault


{
"name": "SalesforceServiceCloudLinkedService",
"properties": {
"type": "SalesforceServiceCloud",
"typeProperties": {
"username": "<username>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of password in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
},
"securityToken": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name of security token in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce Service Cloud dataset.
To copy data from and to Salesforce Service Cloud, the following properties are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SalesforceSer viceCloudObject .

objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.

IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:
{
"name": "SalesforceServiceCloudDataset",
"properties": {
"type": "SalesforceServiceCloudObject",
"typeProperties": {
"objectApiName": "MyTable__c"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Salesforce Service Cloud linked service name>",
"type": "LinkedServiceReference"
}
}
}

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to RelationalTable .

tableName Name of the table in Salesforce Service No (if "query" in the activity source is
Cloud. specified)

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce Service Cloud source and sink.
Salesforce Service Cloud as a source type
To copy data from Salesforce Service Cloud, the following properties are supported in the copy activity source
section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
SalesforceSer viceCloudSource .

query Use the custom query to read data. No (if "objectApiName" in the dataset
You can use Salesforce Object Query is specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce Service Cloud
object specified in "objectApiName" in
dataset will be retrieved.

readBehavior Indicates whether to query the existing No


records, or query all records including
the deleted ones. If not specified, the
default behavior is the former.
Allowed values: quer y (default),
quer yAll.
IMPORTANT
The "__c" part of API Name is needed for any custom object.

Example:

"activities":[
{
"name": "CopyFromSalesforceServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce Service Cloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceServiceCloudSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Salesforce Service Cloud as a sink type


To copy data to Salesforce Service Cloud, the following properties are supported in the copy activity sink
section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to
SalesforceSer viceCloudSink .

writeBehavior The write behavior for the operation. No (default is Insert)


Allowed values are Inser t and
Upser t .
P RO P ERT Y DESC RIP T IO N REQ UIRED

externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified
field must be defined as "External ID
Field" in the Salesforce Service Cloud
object. It can't have NULL values in the
corresponding input data.

writeBatchSize The row count of data written to No (default is 5,000)


Salesforce Service Cloud in each batch.

ignoreNullValues Indicates whether to ignore NULL No (default is false)


values from input data during a write
operation.
Allowed values are true and false .
- True : Leave the data in the
destination object unchanged when
you do an upsert or update operation.
Insert a defined default value when
you do an insert operation.
- False : Update the data in the
destination object to NULL when you
do an upsert or update operation.
Insert a NULL value when you do an
insert operation.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example:
"activities":[
{
"name": "CopyToSalesforceServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce Service Cloud output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceServiceCloudSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]

Query tips
Retrieve data from a Salesforce Service Cloud report
You can retrieve data from Salesforce Service Cloud reports by specifying a query as {call "<report name>"} .
An example is "query": "{call \"TestReport\"}" .
Retrieve deleted records from the Salesforce Service Cloud Recycle Bin
To query the soft deleted records from the Salesforce Service Cloud Recycle Bin, you can specify readBehavior
as queryAll .
Difference between SOQL and SQL query syntax
When copying data from Salesforce Service Cloud, you can use either SOQL query or SQL query. Note that
these two has different syntax and functionality support, do not mix it. You are suggested to use the SOQL
query, which is natively supported by Salesforce Service Cloud. The following table lists the main differences:

SY N TA X SO Q L M O DE SQ L M O DE

Column selection Need to enumerate the fields to be SELECT * is supported in addition to


copied in the query, e.g. column selection.
SELECT field1, filed2 FROM
objectname

Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"

Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.
SY N TA X SO Q L M O DE SQ L M O DE

Boolean values Represented as False and True , Represented as 0 or 1, e.g.


e.g. SELECT … WHERE IsDeleted=1 .
SELECT … WHERE IsDeleted=True .

Column renaming Not supported. Supported, e.g.:


SELECT a AS b FROM … .

Relationship Supported, e.g. Not supported.


Account_vod__r.nvs_Country__c .

Retrieve data by using a where clause on the DateTime column


When you specify the SOQL or SQL query, pay attention to the DateTime format difference. For example:
SOQL sample :
SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >=
@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-ddTHH:mm:ssZ')} AND LastModifiedDate <
@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-ddTHH:mm:ssZ')}
SQL sample :
SELECT * FROM Account WHERE LastModifiedDate >=
{ts'@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-dd HH:mm:ss')}'} AND LastModifiedDate <
{ts'@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd HH:mm:ss')}'}

Error of MALFORMED_QUERY: Truncated


If you hit error of "MALFORMED_QUERY: Truncated", normally it's due to you have JunctionIdList type column in
data and Salesforce has limitation on supporting such data with large number of rows. To mitigate, try to
exclude JunctionIdList column or limit the number of rows to copy (you can partition to multiple copy activity
runs).

Data type mapping for Salesforce Service Cloud


When you copy data from Salesforce Service Cloud, the following mappings are used from Salesforce Service
Cloud data types to Data Factory interim data types. To learn about how the copy activity maps the source
schema and data type to the sink, see Schema and data type mappings.

SA L ESF O RC E SERVIC E C LO UD DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Auto Number String

Checkbox Boolean

Currency Decimal

Date DateTime

Date/Time DateTime

Email String

ID String

Lookup Relationship String


SA L ESF O RC E SERVIC E C LO UD DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Multi-Select Picklist String

Number Decimal

Percent Decimal

Phone String

Picklist String

Text String

Text Area String

Text Area (Long) String

Text Area (Rich) String

Text (Encrypted) String

URL String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Salesforce Marketing Cloud using
Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Salesforce Marketing
Cloud. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Salesforce Marketing Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce Marketing Cloud to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
The Salesforce Marketing Cloud connector supports OAuth 2 authentication, and it supports both legacy and
enhanced package types. The connector is built on top of the Salesforce Marketing Cloud REST API.

NOTE
This connector doesn't support retrieving custom objects or custom data extensions.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Salesforce Marketing Cloud connector.

Linked service properties


The following properties are supported for Salesforce Marketing Cloud linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SalesforceMarketingCloud

connectionProperties A group of properties that defines how Yes


to connect to Salesforce Marketing
Cloud.

Under connectionProperties :
P RO P ERT Y DESC RIP T IO N REQ UIRED

authenticationType Specifies the authentication method to Yes


use. Allowed values are
Enhanced sts OAuth 2.0 or
OAuth_2.0 .

Salesforce Marketing Cloud legacy


package only supports OAuth_2.0 ,
while enhanced package needs
Enhanced sts OAuth 2.0 .
Since August 1, 2019, Salesforce
Marketing Cloud has removed the
ability to create legacy packages. All
new packages are enhanced packages.

host For enhanced package, the host Yes


should be your subdomain which is
represented by a 28-character string
starting with the letters "mc", e.g.
mc563885gzs27c5t9-63k636ttgm .
For legacy package, specify
www.exacttargetapis.com .

clientId The client ID associated with the Yes


Salesforce Marketing Cloud
application.

clientSecret The client secret associated with the Yes


Salesforce Marketing Cloud
application. You can choose to mark
this field as a SecureString to store it
securely in ADF, or store the secret in
Azure Key Vault and let ADF copy
activity pull from there when
performing data copy - learn more
from Store credentials in Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example: using enhanced STS OAuth 2 authentication for enhanced package


{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"connectionProperties": {
"host": "<subdomain e.g. mc563885gzs27c5t9-63k636ttgm>",
"authenticationType": "Enhanced sts OAuth 2.0",
"clientId": "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints": true,
"useHostVerification": true,
"usePeerVerification": true
}
}
}
}

Example: using OAuth 2 authentication for legacy package

{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"connectionProperties": {
"host": "www.exacttargetapis.com",
"authenticationType": "OAuth_2.0",
"clientId": "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints": true,
"useHostVerification": true,
"usePeerVerification": true
}
}
}
}

If you were using Salesforce Marketing Cloud linked service with the following payload, it is still supported as-is,
while you are suggested to use the new one going forward which adds enhanced package support.
{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"clientId": "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints": true,
"useHostVerification": true,
"usePeerVerification": true
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Salesforce Marketing Cloud dataset.
To copy data from Salesforce Marketing Cloud, set the type property of the dataset to
SalesforceMarketingCloudObject . The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to:
SalesforceMarketingCloudObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "SalesforceMarketingCloudDataset",
"properties": {
"type": "SalesforceMarketingCloudObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<SalesforceMarketingCloud linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Salesforce Marketing Cloud source.
Salesforce Marketing Cloud as source
To copy data from Salesforce Marketing Cloud, set the source type in the copy activity to
SalesforceMarketingCloudSource . The following properties are supported in the copy activity source
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
SalesforceMarketingCloudSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:

"activities":[
{
"name": "CopyFromSalesforceMarketingCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<SalesforceMarketingCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceMarketingCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse via Open
Hub using Azure Data Factory
5/11/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW) via Open Hub. It builds on the copy activity overview article that presents a general overview
of copy activity.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP Business Warehouse via Open Hub connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Business Warehouse via Open Hub to any supported sink data store. For a list of
data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse Open Hub connector supports:
SAP Business Warehouse version 7.01 or higher (in a recent SAP Suppor t Package Stack released
after the year 2015) . SAP BW/4HANA is not supported by this connector.
Copying data via Open Hub Destination local table, which underneath can be DSO, InfoCube, MultiProvider,
DataSource, etc.
Copying data using basic authentication.
Connecting to an SAP application server or SAP message server.
Retrieving data via RFC.

SAP BW Open Hub Integration


SAP BW Open Hub Service is an efficient way to extract data from SAP BW. The following diagram shows one of
the typical flows customers have in their SAP system, in which case data flows from SAP ECC -> PSA -> DSO ->
Cube.
SAP BW Open Hub Destination (OHD) defines the target to which the SAP data is relayed. Any objects supported
by SAP Data Transfer Process (DTP) can be used as open hub data sources, for example, DSO, InfoCube,
DataSource, etc. Open Hub Destination type - where the relayed data is stored - can be database tables (local or
remote) and flat files. This SAP BW Open Hub connector support copying data from OHD local table in BW. In
case you are using other types, you can directly connect to the database or file system using other connectors.
Delta extraction flow
ADF SAP BW Open Hub Connector offers two optional properties: excludeLastRequest and baseRequestId
which can be used to handle delta load from Open Hub.
excludeLastRequestId : Whether to exclude the records of the last request. Default value is true.
baseRequestId : The ID of request for delta loading. Once it is set, only data with requestId larger than the
value of this property will be retrieved.
Overall, the extraction from SAP InfoProviders to Azure Data Factory (ADF) consists of two steps:
1. SAP BW Data Transfer Process (DTP) This step copies the data from an SAP BW InfoProvider to an
SAP BW Open Hub table
2. ADF data copy In this step, the Open Hub table is read by the ADF Connector

In the first step, a DTP is executed. Each execution creates a new SAP request ID. The request ID is stored in the
Open Hub table and is then used by the ADF connector to identify the delta. The two steps run asynchronously:
the DTP is triggered by SAP, and the ADF data copy is triggered through ADF.
By default, ADF is not reading the latest delta from the Open Hub table (option "exclude last request" is true).
Hereby, the data in ADF is not 100% up to date with the data in the Open Hub table (the last delta is missing). In
return, this procedure ensures that no rows get lost caused by the asynchronous extraction. It works fine even
when ADF is reading the Open Hub table while the DTP is still writing into the same table.
You typically store the max copied request ID in the last run by ADF in a staging data store (such as Azure Blob in
above diagram). Therefore, the same request is not read a second time by ADF in the subsequent run.
Meanwhile, note the data is not automatically deleted from the Open Hub table.
For proper delta handling, it is not allowed to have request IDs from different DTPs in the same Open Hub table.
Therefore, you must not create more than one DTP for each Open Hub Destination (OHD). When needing Full
and Delta extraction from the same InfoProvider, you should create two OHDs for the same InfoProvider.

Prerequisites
To use this SAP Business Warehouse Open Hub connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.13 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install
Assemblies to GAC option as shown in the following image.

SAP user being used in the Data Factory BW connector needs to have following permissions:
Authorization for RFC and SAP BW.
Permissions to the “Execute” Activity of Authorization Object “S_SDSAUTH”.
Create SAP Open Hub Destination type as Database Table with "Technical Key" option checked. It is also
recommended to leave the Deleting Data from Table as unchecked although it is not required. Use the
DTP (directly execute or integrate into existing process chain) to land data from source object (such as
cube) you have chosen to the open hub destination table.

Getting started
TIP
For a walkthrough of using SAP BW Open Hub connector, see Load data from SAP Business Warehouse (BW) by using
Azure Data Factory.

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse Open Hub connector.

Linked service properties


The following properties are supported for SAP Business Warehouse Open Hub linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapOpenHub

server Name of the server on which the SAP Yes


BW instance resides.

systemNumber System number of the SAP BW system. Yes


Allowed value: two-digit decimal
number represented as a string.

messageServer The host name of the SAP message No


server.
Use to connect to an SAP message
server.

messageServerService The service name or port number of No


the message server.
Use to connect to an SAP message
server.

systemId The ID of the SAP system where the No


table is located.
Use to connect to an SAP message
server.

logonGroup The logon group for the SAP system. No


Use to connect to an SAP message
server.

clientId Client ID of the client in the SAP W Yes


system.
Allowed value: three-digit decimal
number represented as a string.

language Language that the SAP system uses. No (default value is EN)

userName Name of the user who has access to Yes


the SAP server.
P RO P ERT Y DESC RIP T IO N REQ UIRED

password Password for the user. Mark this field Yes


as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "SapBwOpenHubLinkedService",
"properties": {
"type": "SapOpenHub",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP BW Open Hub dataset.
To copy data from and to SAP BW Open Hub, set the type property of the dataset to SapOpenHubTable . The
following properties are supported.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SapOpenHubTable .

openHubDestinationName The name of the Open Hub Yes


Destination to copy data from.

If you were setting excludeLastRequest and baseRequestId in dataset, it is still supported as-is, while you are
suggested to use the new model in activity source going forward.
Example:
{
"name": "SAPBWOpenHubDataset",
"properties": {
"type": "SapOpenHubTable",
"typeProperties": {
"openHubDestinationName": "<open hub destination name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP BW Open Hub linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP BW Open Hub source.
SAP BW Open Hub as source
To copy data from SAP BW Open Hub, the following properties are supported in the copy activity source
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
SapOpenHubSource .

excludeLastRequest Whether to exclude the records of the No (default is true )


last request.

baseRequestId The ID of request for delta loading. No


Once it is set, only data with requestId
larger than the value of this property
will be retrieved.

customRfcReadTableFunctionModule A custom RFC function module that No


can be used to read data from an SAP
table.
You can use a custom RFC function
module to define how the data is
retrieved from your SAP system and
returned to Data Factory. The custom
function module must have an
interface implemented (import, export,
tables) that's similar to
/SAPDS/RFC_READ_TABLE2 , which is
the default interface used by Data
Factory.

TIP
If your Open Hub table only contains the data generated by single request ID, for example, you always do full load and
overwrite the existing data in the table, or you only run the DTP once for test, remember to uncheck the
"excludeLastRequest" option in order to copy the data out.
To speed up the data loading, you can set parallelCopies on the copy activity to load data from SAP BW Open
Hub in parallel. For example, if you set parallelCopies to four, Data Factory concurrently executes four RFC
calls, and each RFC call retrieves a portion of data from your SAP BW Open Hub table partitioned by the DTP
request ID and package ID. This applies when the number of unique DTP request ID + package ID is bigger than
the value of parallelCopies . When copying data into file-based data store, it's also recommanded to write to a
folder as multiple files (only specify folder name), in which case the performance is better than writing to a
single file.
Example:

"activities":[
{
"name": "CopyFromSAPBWOpenHub",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW Open Hub input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapOpenHubSource",
"excludeLastRequest": true
},
"sink": {
"type": "<sink type>"
},
"parallelCopies": 4
}
}
]

Data type mapping for SAP BW Open Hub


When copying data from SAP BW Open Hub, the following mappings are used from SAP BW data types to
Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity
maps the source schema and data type to the sink.

SA P A B A P T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

C (String) String

I (integer) Int32

F (Float) Double

D (Date) String

T (Time) String

P (BCD Packed, Currency, Decimal, Qty) Decimal


SA P A B A P T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

N (Numc) String

X (Binary and Raw) String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Troubleshooting tips
Symptoms: If you are running SAP BW on HANA and observe only subset of data is copied over using ADF
copy activity (1 million rows), the possible cause is that you enable "SAP HANA Execution" option in your DTP, in
which case ADF can only retrieve the first batch of data.
Resolution: Disable "SAP HANA Execution" option in DTP, reprocess the data, then try executing the copy
activity again.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse by using
Azure Data Factory
7/7/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article shows how to use Azure Data Factory to copy data from SAP Business Warehouse (BW) via Open
Hub to Azure Data Lake Storage Gen2. You can use a similar process to copy data to other supported sink data
stores.

TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction
flow, see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.

Prerequisites
Azure Data Factor y : If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD) with destination type "Database Table" : To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions :
Authorization for Remote Function Calls (RFC) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0 . Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Open on the Open Azure Data Factor y Studio tile to open
the Data Factory UI in a separate tab.
1. On the home page, select Ingest to open the Copy Data tool.
2. On the Proper ties page, specify a Task name , and then select Next .
3. On the Source data store page, select +Create new connection . Select SAP BW Open Hub from
the connector gallery, and then select Continue . To filter the connectors, you can type SAP in the search
box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New , and then select Self-hosted . Enter a Name , and then
select Next . Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Ser ver name , System number , Client ID, Language (if other than EN ),
User name , and Password .
c. Select Test connection to validate the settings, and then select Finish .
d. A new connection is created. Select Next .
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next .
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP)
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next .

7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue .
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next .
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name.
Then select Next .
10. On the File format setting page, select Next to use the default settings.

11. On the Settings page, expand Performance settings . Enter a value for Degree of copy parallelism
such as 5 to load from SAP BW in parallel. Then select Next .
12. On the Summar y page, review the settings. Then select Next .
13. On the Deployment page, select Monitor to monitor the pipeline.

14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back
to the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.

16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.

17. To view the maximum Request ID , go back to the activity-monitoring view and select Output under
Actions .

Incremental copy from SAP BW Open Hub


TIP
See SAP BW Open Hub connector delta extraction flow to learn how the SAP BW Open Hub connector in Data Factory
copies incremental data from SAP BW. This article can also help you understand basic connector configuration.
Now, let's continue to configure incremental copy from SAP BW Open Hub.
Incremental copy uses a "high-watermark" mechanism that's based on the request ID . That ID is automatically
generated in SAP BW Open Hub Destination by the DTP. The following diagram shows this workflow:

On the data factory home page, select Pipeline templates in the Discover more section to use the built-in
template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage : In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub : This is the source to copy data from. Refer to the previous full-copy
walkthrough for detailed configuration.
Azure Data Lake Storage Gen2 : This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.

3. This template generates a pipeline with the following three activities and makes them chained on-
success: Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName : Specify the Open Hub table name to copy data from.
Data_Destination_Container : Specify the destination Azure Data Lake Storage Gen2 container
to copy data to. If the container doesn't exist, the Data Factory copy activity creates one during
execution.
Data_Destination_Director y : Specify the folder path under the Azure Data Lake Storage Gen2
container to copy data to. If the path doesn't exist, the Data Factory copy activity creates a path
during execution.
HighWatermarkBlobContainer : Specify the container to store the high-watermark value.
HighWatermarkBlobDirector y : Specify the folder path under the container to store the high-
watermark value.
HighWatermarkBlobName : Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobContainer+HighWatermarkBlobDirectory+HighWatermarkBlobName, such as
container/path/requestIdCache.txt. Create a blob with content 0.

LogicAppURL : In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST
URL .

a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go
to Logic Apps Designer .
b. Create a trigger of When an HTTP request is received . Specify the HTTP request body as
follows:

{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}

c. Add a Create blob action. For Folder path and Blob name , use the same values that you
configured previously in HighWatermarkBlobContainer+HighWatermarkBlobDirectory and
HighWatermarkBlobName.
d. Select Save . Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to
validate the configuration. Or, select Publish to publish all the changes, and then select Add trigger to
execute a run.

SAP BW Open Hub Destination configurations


This section introduces configuration of the SAP BW side to use the SAP BW Open Hub connector in Data
Factory to copy data.
Configure delta extraction in SAP BW
If you need both historical copy and incremental copy or only incremental copy, configure delta extraction in
SAP BW.
1. Create the Open Hub Destination. You can create the OHD in SAP Transaction RSA1, which automatically
creates the required transformation and data-transfer process. Use the following settings:
ObjectType : You can use any object type. Here, we use InfoCube as an example.
Destination Type : Select Database Table .
Key of the Table : Select Technical Key .
Extraction : Select Keep Data and Inser t Records into Table .

You might increase the number of parallel running SAP work processes for the DTP:

2. Schedule the DTP in process chains.


A delta DTP for a cube only works if the necessary rows haven't been compressed. Make sure that BW
cube compression isn't running before the DTP to the Open Hub table. The easiest way to do this is to
integrate the DTP into your existing process chains. In the following example, the DTP (to the OHD) is
inserted into the process chain between the Adjust (aggregate rollup) and Collapse (cube compression)
steps.

Configure full extraction in SAP BW


In addition to delta extraction, you might want a full extraction of the same SAP BW InfoProvider. This usually
applies if you want to do full copy but not incremental, or you want to resync delta extraction.
You can't have more than one DTP for the same OHD. So, you must create an additional OHD before delta
extraction.

For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Inser t Records . Otherwise, data will be
extracted many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full . You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request . Otherwise, nothing will
be extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy
activity until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways
to avoid this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched .
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched , you can use the following option to run the delta DTP manually:
No Data Transfer; Delta Status in Source: Fetched

Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Copy data from SAP Business Warehouse using
Azure Data Factory
5/11/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP Business
Warehouse (BW). It builds on the copy activity overview article that presents a general overview of copy activity.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP Business Warehouse connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Business Warehouse to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse connector supports:
SAP Business Warehouse version 7.x .
Copying data from InfoCubes and Quer yCubes (including BEx queries) using MDX queries.
Copying data using basic authentication.

Prerequisites
To use this SAP Business Warehouse connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP NetWeaver librar y on the Integration Runtime machine. You can get the SAP Netweaver
library from your SAP administrator, or directly from the SAP Software Download Center. Search for the SAP
Note #1025361 to get the download location for the most recent version. Make sure that you pick the 64-
bit SAP NetWeaver library which matches your Integration Runtime installation. Then install all files included
in the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the
SAP Client Tools installation.
TIP
To troubleshoot connectivity issue to SAP BW, make sure:
All dependency libraries extracted from the NetWeaver RFC SDK are in place in the %windir%\system32 folder. Usually
it has icudt34.dll, icuin34.dll, icuuc34.dll, libicudecnumber.dll, librfc32.dll, libsapucum.dll, sapcrypto.dll, sapcryto_old.dll,
sapnwrfc.dll.
The needed ports used to connect to SAP Server are enabled on the Self-hosted IR machine, which usually are port
3300 and 3201.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse connector.

Linked service properties


The following properties are supported for SAP Business Warehouse (BW) linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapBw

server Name of the server on which the SAP Yes


BW instance resides.

systemNumber System number of the SAP BW system. Yes


Allowed value: two-digit decimal
number represented as a string.

clientId Client ID of the client in the SAP W Yes


system.
Allowed value: three-digit decimal
number represented as a string.

userName Name of the user who has access to Yes


the SAP server.

password Password for the user. Mark this field Yes


as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "SapBwLinkedService",
"properties": {
"type": "SapBw",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP BW dataset.
To copy data from SAP BW, set the type property of the dataset to SapBwCube . While there are no type-specific
properties supported for the SAP BW dataset of type RelationalTable.
Example:

{
"name": "SAPBWDataset",
"properties": {
"type": "SapBwCube",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP BW linked service name>",
"type": "LinkedServiceReference"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP BW source.
SAP BW as source
To copy data from SAP BW, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: SapBwSource

query Specifies the MDX query to read data Yes


from the SAP BW instance.

Example:

"activities":[
{
"name": "CopyFromSAPBW",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapBwSource",
"query": "<MDX query for SAP BW>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.

Data type mapping for SAP BW


When copying data from SAP BW, the following mappings are used from SAP BW data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

SA P B W DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

ACCP Int

CHAR String

CLNT String
SA P B W DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

CURR Decimal

CUKY String

DEC Decimal

FLTP Double

INT1 Byte

INT2 Int16

INT4 Int

LANG String

LCHR String

LRAW Byte[]

PREC Int16

QUAN Decimal

RAW Byte[]

RAWSTRING Byte[]

STRING String

UNIT String

DATS String

NUMC String

TIMS String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Cloud for Customer (C4C)
using Azure Data Factory
5/11/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from/to SAP Cloud for
Customer (C4C). It builds on the copy activity overview article that presents a general overview of copy activity.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP Cloud for Customer connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Cloud for Customer to any supported sink data store, or copy data from any
supported source data store to SAP Cloud for Customer. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this connector enables Azure Data Factory to copy data from/to SAP Cloud for Customer including
the SAP Cloud for Sales, SAP Cloud for Service, and SAP Cloud for Social Engagement solutions.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Cloud for Customer connector.

Linked service properties


The following properties are supported for SAP Cloud for Customer linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapCloudForCustomer .

url The URL of the SAP C4C OData Yes


service.

username Specify the user name to connect to Yes


the SAP C4C.

password Specify the password for the user Yes


account you specified for the
username. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

connectVia The Integration Runtime to be used to No


connect to the data store. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "SAPC4CLinkedService",
"properties": {
"type": "SapCloudForCustomer",
"typeProperties": {
"url": "https://<tenantname>.crm.ondemand.com/sap/c4c/odata/v1/c4codata/" ,
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP Cloud for Customer dataset.
To copy data from SAP Cloud for Customer, set the type property of the dataset to
SapCloudForCustomerResource . The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to:
SapCloudForCustomerResource
P RO P ERT Y DESC RIP T IO N REQ UIRED

path Specify path of the SAP C4C OData Yes


entity.

Example:

{
"name": "SAPC4CDataset",
"properties": {
"type": "SapCloudForCustomerResource",
"typeProperties": {
"path": "<path e.g. LeadCollection>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP C4C linked service>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP Cloud for Customer source.
SAP C4C as source
To copy data from SAP Cloud for Customer, set the source type in the copy activity to
SapCloudForCustomerSource . The following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapCloudForCustomerSource

query Specify the custom OData query to No


read data.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. If not specified, the
default value is 00:30:00 (30
minutes).

Sample query to get data for a specific day:


"query": "$filter=CreatedOn ge datetimeoffset'2017-07-31T10:02:06.4202620Z' and CreatedOn le
datetimeoffset'2017-08-01T10:02:06.4202620Z'"

Example:
"activities":[
{
"name": "CopyFromSAPC4C",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP C4C input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapCloudForCustomerSource",
"query": "<custom query e.g. $top=10>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

SAP C4C as sink


To copy data to SAP Cloud for Customer, set the sink type in the copy activity to SapCloudForCustomerSink .
The following properties are supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapCloudForCustomerSink

writeBehavior The write behavior of the operation. No. Default “Insert”.


Could be “Insert”, “Update”.

writeBatchSize The batch size of write operation. The No. Default 10.
batch size to get best performance
may be different for different table or
server.

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example:
"activities":[
{
"name": "CopyToSapC4c",
"type": "Copy",
"inputs": [{
"type": "DatasetReference",
"referenceName": "<dataset type>"
}],
"outputs": [{
"type": "DatasetReference",
"referenceName": "SapC4cDataset"
}],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SapCloudForCustomerSink",
"writeBehavior": "Insert",
"writeBatchSize": 30
},
"parallelCopies": 10,
"dataIntegrationUnits": 4,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "ErrorLogBlobLinkedService",
"type": "LinkedServiceReference"
},
"path": "incompatiblerows"
}
}
}
]

Data type mapping for SAP Cloud for Customer


When copying data from SAP Cloud for Customer, the following mappings are used from SAP Cloud for
Customer data types to Azure Data Factory interim data types. See Schema and data type mappings to learn
about how copy activity maps the source schema and data type to the sink.

SA P C 4C O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Edm.Binary Byte[]

Edm.Boolean Bool

Edm.Byte Byte[]

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid Guid
SA P C 4C O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Edm.Int16 Int16

Edm.Int32 Int32

Edm.Int64 Int64

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP ECC by using Azure Data
Factory
5/11/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the copy activity in Azure Data Factory to copy data from SAP Enterprise Central
Component (ECC). For more information, see Copy activity overview.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP ECC connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP ECC to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP ECC connector supports:
Copying data from SAP ECC on SAP NetWeaver version 7.0 and later.
Copying data from any objects exposed by SAP ECC OData services, such as:
SAP tables or views.
Business Application Programming Interface [BAPI] objects.
Data extractors.
Data or intermediate documents (IDOCs) sent to SAP Process Integration (PI) that can be received as
OData via relative adapters.
Copying data by using basic authentication.
The version 7.0 or later refers to SAP NetWeaver version instead of SAP ECC version. For example,SAP ECC 6.0
EHP 7 in general has NetWeaver version >=7.4. In case you are unsure about your environment, here are the
steps to confirm the version from your SAP system:
1. Use SAP GUI to connect to the SAP System.
2. Go to System -> Status .
3. Check the release of the SAP_BASIS, ensure it is equal to or larger than 701.
TIP
To copy data from SAP ECC via an SAP table or view, use the SAP table connector, which is faster and more scalable.

Prerequisites
To use this SAP ECC connector, you need to expose the SAP ECC entities via OData services through SAP
Gateway. More specifically:
Set up SAP Gateway . For servers with SAP NetWeaver versions later than 7.4, SAP Gateway is already
installed. For earlier versions, you must install the embedded SAP Gateway or the SAP Gateway hub
system before exposing SAP ECC data through OData services. To set up SAP Gateway, see the installation
guide.
Activate and configure the SAP OData ser vice . You can activate the OData service through TCODE
SICF in seconds. You can also configure which objects need to be exposed. For more information, see the
step-by-step guidance.
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define the Data Factory entities specific
to the SAP ECC connector.

Linked service properties


The following properties are supported for the SAP ECC linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SapEcc .
P RO P ERT Y DESC RIP T IO N REQ UIRED

url The URL of the SAP ECC OData Yes


service.

username The username used to connect to SAP No


ECC.

password The plaintext password used to No


connect to SAP ECC.

connectVia The integration runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If you don't
specify a runtime, the default Azure
integration runtime is used.

Example

{
"name": "SapECCLinkedService",
"properties": {
"type": "SapEcc",
"typeProperties": {
"url": "<SAP ECC OData URL, e.g.,
http://eccsvrname:8000/sap/opu/odata/sap/zgw100_dd02l_so_srv/>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}

Dataset properties
For a full list of the sections and properties available for defining datasets, see Datasets. The following section
provides a list of the properties supported by the SAP ECC dataset.
To copy data from SAP ECC, set the type property of the dataset to SapEccResource .
The following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

path Path of the SAP ECC OData entity. Yes

Example
{
"name": "SapEccDataset",
"properties": {
"type": "SapEccResource",
"typeProperties": {
"path": "<entity path, e.g., dd04tentitySet>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP ECC linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of the sections and properties available for defining activities, see Pipelines. The following section
provides a list of the properties supported by the SAP ECC source.
SAP ECC as a source
To copy data from SAP ECC, set the type property in the source section of the copy activity to SapEccSource .
The following properties are supported in the copy activity's source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy Yes


activity's source section must be set
to SapEccSource .

query The OData query options to filter the No


data. For example:

"$select=Name,Description&$top=10"

The SAP ECC connector copies data


from the combined URL:

<URL specified in the linked


service>/<path specified in the
dataset>?<query specified in the
copy activity's source section>

For more information, see OData URL


components.

sapDataColumnDelimiter The single character that is used as No


delimiter passed to SAP RFC to split
the output data.

httpRequestTimeout The timeout (the TimeSpan value) for No


the HTTP request to get a response.
This value is the timeout to get a
response, not the timeout to read
response data. If not specified, the
default value is 00:30:00 (30
minutes).

Example
"activities":[
{
"name": "CopyFromSAPECC",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP ECC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapEccSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Data type mappings for SAP ECC


When you're copying data from SAP ECC, the following mappings are used from OData data types for SAP ECC
data to Azure Data Factory interim data types. To learn how the copy activity maps the source schema and data
type to the sink, see Schema and data type mappings.

O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Edm.Binary String

Edm.Boolean Bool

Edm.Byte String

Edm.DateTime DateTime

Edm.Decimal Decimal

Edm.Double Double

Edm.Single Single

Edm.Guid String

Edm.Int16 Int16

Edm.Int32 Int32
O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Edm.Int64 Int64

Edm.SByte Int16

Edm.String String

Edm.Time TimeSpan

Edm.DateTimeOffset DateTimeOffset

NOTE
Complex data types aren't currently supported.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of the data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
Supported data stores.
Copy data from SAP HANA using Azure Data
Factory
5/11/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an SAP HANA
database. It builds on the copy activity overview article that presents a general overview of copy activity.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP HANA connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP HANA database to any supported sink data store. For a list of data stores supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP HANA connector supports:
Copying data from any version of SAP HANA database.
Copying data from HANA information models (such as Analytic and Calculation views) and
Row/Column tables .
Copying data using Basic or Windows authentication.
Parallel copying from a SAP HANA source. See the Parallel copy from SAP HANA section for details.

TIP
To copy data into SAP HANA data store, use generic ODBC connector. See SAP HANA sink section with details. Note the
linked services for SAP HANA connector and ODBC connector are with different type thus cannot be reused.

Prerequisites
To use this SAP HANA connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP HANA ODBC driver on the Integration Runtime machine. You can download the SAP HANA
ODBC driver from the SAP Software Download Center. Search with the keyword SAP HANA CLIENT for
Windows .

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP HANA connector.

Linked service properties


The following properties are supported for SAP HANA linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


SapHana

connectionString Specify information that's needed to Yes


connect to the SAP HANA by using
either basic authentication or
Windows authentication . Refer to
the following samples.
In connection string, server/port is
mandatory (default port is 30015),
and username and password is
mandatory when using basic
authentication. For additional
advanced settings, refer to SAP HANA
ODBC Connection Properties
You can also put password in Azure
Key Vault and pull the password
configuration out of the connection
string. Refer to Store credentials in
Azure Key Vault article with more
details.

userName Specify user name when using No


Windows authentication. Example:
[email protected]

password Specify password for the user account. No


Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example: use basic authentication


{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"connectionString": "SERVERNODE=<server>:<port (optional)>;UID=<userName>;PWD=<Password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: use Windows authentication

{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"connectionString": "SERVERNODE=<server>:<port (optional)>;",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

If you were using SAP HANA linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Example:

{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "<server>:<port (optional)>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP HANA dataset.
To copy data from SAP HANA, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: SapHanaTable

schema Name of the schema in the SAP HANA No (if "query" in activity source is
database. specified)

table Name of the table in the SAP HANA No (if "query" in activity source is
database. specified)

Example:

{
"name": "SAPHANADataset",
"properties": {
"type": "SapHanaTable",
"typeProperties": {
"schema": "<schema name>",
"table": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP HANA linked service name>",
"type": "LinkedServiceReference"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by SAP HANA source.
SAP HANA as source

TIP
To ingest data from SAP HANA efficiently by using data partitioning, learn more from Parallel copy from SAP HANA
section.

To copy data from SAP HANA, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
SapHanaSource
P RO P ERT Y DESC RIP T IO N REQ UIRED

query Specifies the SQL query to read data Yes


from the SAP HANA instance.

partitionOptions Specifies the data partitioning options False


used to ingest data from SAP HANA.
Learn more from Parallel copy from
SAP HANA section.
Allow values
are:None (default),PhysicalPar titions
OfTable , SapHanaDynamicRange .
Learn more from Parallel copy from
SAP HANA section.
PhysicalPartitionsOfTable can
only be used when copying data from
a table but not query.
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from SAP HANA is controlled by the
parallelCopies setting on the copy
activity.

partitionSettings Specify the group of the settings for False


data partitioning.
Apply when partition option is
SapHanaDynamicRange .

partitionColumnName Specify the name of the source column Yes when using
that will be used by partition for SapHanaDynamicRange partition.
parallel copy. If not specified, the index
or the primary key of the table is auto-
detected and used as the partition
column.
Apply when the partition option is
SapHanaDynamicRange . If you use a
query to retrieve the source data,
hook
?
AdfHanaDynamicRangePartitionCondition
in WHERE clause. See example in
Parallel copy from SAP HANA section.

packetSize Specifies the network packet size (in No.


Kilobytes) to split data to multiple Default value is 2048 (2MB).
blocks. If you have large amount of
data to copy, increasing packet size can
increase reading speed from SAP
HANA in most cases. Performance
testing is recommended when
adjusting the packet size.

Example:
"activities":[
{
"name": "CopyFromSAPHANA",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP HANA input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapHanaSource",
"query": "<SQL query for SAP HANA>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed copy source, it is still supported as-is, while you are suggested to use
the new one going forward.

Parallel copy from SAP HANA


The Data Factory SAP HANA connector provides built-in data partitioning to copy data from SAP HANA in
parallel. You can find data partitioning options on the Source table of the copy activity.

When you enable partitioned copy, Data Factory runs parallel queries against your SAP HANA source to retrieve
data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For
example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on
your specified partition option and settings, and each query retrieves a portion of data from your SAP HANA.
You are suggested to enable parallel copy with data partitioning especially when you ingest large amount of
data from your SAP HANA. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS


SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partition type of the specified SAP HANA table, and
choose the corresponding partition strategy:
- Range Par titioning : Get the partition column and
partition ranges defined for the table, then copy the data by
range.
- Hash Par titioning : Use hash partition key as partition
column, then partition and copy the data based on ADF
calculated ranges.
- Round-Robin Par titioning or No Par tition : Use
primary key as partition column, then partition and copy the
data based on ADF calculated ranges.

Load large amount of data by using a custom query. Par tition option : Dynamic range partition.
Quer y :
SELECT * FROM <TABLENAME> WHERE ?
AdfHanaDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to apply
dynamic range partition.

During execution, Data Factory firstly calculates the value


ranges of the specified partition column, by evenly
distributes the rows in a number of buckets according to the
number of distinct partition column values and ADF parallel
copy setting, then replaces
?AdfHanaDynamicRangePartitionCondition with filtering
the partition column value range for each partition, and
sends to SAP HANA.

If you want to use multiple columns as partition column, you


can concatenate the values of each column as one column in
the query and specify it as partition column in ADF, like
SELECT * FROM (SELECT *, CONCAT(<KeyColumn1>,
<KeyColumn2>) AS PARTITIONCOLUMN FROM <TABLENAME>)
WHERE ?AdfHanaDynamicRangePartitionCondition
.

Example: quer y with physical par titions of a table

"source": {
"type": "SapHanaSource",
"partitionOption": "PhysicalPartitionsOfTable"
}

Example: quer y with dynamic range par tition

"source": {
"type": "SapHanaSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfHanaDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "SapHanaDynamicRange",
"partitionSettings": {
"partitionColumnName": "<Partition_column_name>"
}
}
Data type mapping for SAP HANA
When copying data from SAP HANA, the following mappings are used from SAP HANA data types to Azure
Data Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.

SA P H A N A DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

ALPHANUM String

BIGINT Int64

BINARY Byte[]

BINTEXT String

BLOB Byte[]

BOOL Byte

CLOB String

DATE DateTime

DECIMAL Decimal

DOUBLE Double

FLOAT Double

INTEGER Int32

NCLOB String

NVARCHAR String

REAL Single

SECONDDATE DateTime

SHORTTEXT String

SMALLDECIMAL Decimal

SMALLINT Int16

STGEOMETRYTYPE Byte[]

STPOINTTYPE Byte[]

TEXT String

TIME TimeSpan
SA P H A N A DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

TINYINT Byte

VARCHAR String

TIMESTAMP DateTime

VARBINARY Byte[]

SAP HANA sink


Currently, the SAP HANA connector is not supported as sink, while you can use generic ODBC connector with
SAP HANA driver to write data into SAP HANA.
Follow the Prerequisites to set up Self-hosted Integration Runtime and install SAP HANA ODBC driver first.
Create an ODBC linked service to connect to your SAP HANA data store as shown in the following example, then
create dataset and copy activity sink with ODBC type accordingly. Learn more from ODBC connector article.

{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an SAP table by using Azure Data
Factory
5/27/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the copy activity in Azure Data Factory to copy data from an SAP table. For more
information, see Copy activity overview.

TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.

Supported capabilities
This SAP table connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an SAP table to any supported sink data store. For a list of the data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP table connector supports:
Copying data from an SAP table in:
SAP ERP Central Component (SAP ECC) version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
SAP Business Warehouse (SAP BW) version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
SAP S/4HANA.
Other products in SAP Business Suite version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
Copying data from both an SAP transparent table, a pooled table, a clustered table, and a view.
Copying data by using basic authentication or Secure Network Communications (SNC), if SNC is
configured.
Connecting to an SAP application server or SAP message server.
Retrieving data via default or custom RFC.
The version 7.01 or later refers to SAP NetWeaver version instead of SAP ECC version. For example,SAP ECC 6.0
EHP 7 in general has NetWeaver version >=7.4. In case you are unsure about your environment, here are the
steps to confirm the version from your SAP system:
1. Use SAP GUI to connect to the SAP System.
2. Go to System -> Status .
3. Check the release of the SAP_BASIS, ensure it is equal to or larger than 701.
Prerequisites
To use this SAP table connector, you need to:
Set up a self-hosted integration runtime (version 3.17 or later). For more information, see Create and
configure a self-hosted integration runtime.
Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the self-
hosted integration runtime machine. During installation, make sure you select the Install Assemblies to
GAC option in the Optional setup steps window.

The SAP user who's being used in the Data Factory SAP table connector must have the following
permissions:
Authorization for using Remote Function Call (RFC) destinations.
Permissions to the Execute activity of the S_SDSAUTH authorization object. You can refer to SAP Note
460089 on the majority authorization objects. Certain RFCs are required by the underlying NCo
connector, for example RFC_FUNCTION_SEARCH.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define the Data Factory entities specific
to the SAP table connector.

Linked service properties


The following properties are supported for the SAP BW Open Hub linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SapTable .

server The name of the server on which the No


SAP instance is located.
Use to connect to an SAP application
server.

systemNumber The system number of the SAP system. No


Use to connect to an SAP application
server.
Allowed value: A two-digit decimal
number represented as a string.

messageServer The host name of the SAP message No


server.
Use to connect to an SAP message
server.

messageServerService The service name or port number of No


the message server.
Use to connect to an SAP message
server.

systemId The ID of the SAP system where the No


table is located.
Use to connect to an SAP message
server.

logonGroup The logon group for the SAP system. No


Use to connect to an SAP message
server.

clientId The ID of the client in the SAP system. Yes


Allowed value: A three-digit decimal
number represented as a string.

language The language that the SAP system No


uses.
Default value is EN .

userName The name of the user who has access Yes


to the SAP server.
P RO P ERT Y DESC RIP T IO N REQ UIRED

password The password for the user. Mark this Yes


field with the SecureString type to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

sncMode The SNC activation indicator to access No


the SAP server where the table is
located.
Use if you want to use SNC to connect
to the SAP server.
Allowed values are 0 (off, the default)
or 1 (on).

sncMyName The initiator's SNC name to access the No


SAP server where the table is located.
Applies when sncMode is on.

sncPartnerName The communication partner's SNC No


name to access the SAP server where
the table is located.
Applies when sncMode is on.

sncLibraryPath The external security product's library No


to access the SAP server where the
table is located.
Applies when sncMode is on.

sncQop The SNC Quality of Protection level to No


apply.
Applies when sncMode is On.
Allowed values are 1
(Authentication), 2 (Integrity), 3
(Privacy), 8 (Default), 9 (Maximum).

connectVia The integration runtime to be used to Yes


connect to the data store. A self-
hosted integration runtime is required,
as mentioned earlier in Prerequisites.

Example 1: Connect to an SAP application server


{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client ID>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Connect to an SAP message server

{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"messageServer": "<message server name>",
"messageServerService": "<service name or port>",
"systemId": "<system ID>",
"logonGroup": "<logon group>",
"clientId": "<client ID>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 3: Connect by using SNC


{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client ID>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
},
"sncMode": 1,
"sncMyName": "<SNC myname>",
"sncPartnerName": "<SNC partner name>",
"sncLibraryPath": "<SNC library path>",
"sncQop": "8"
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of the sections and properties for defining datasets, see Datasets. The following section provides a
list of the properties supported by the SAP table dataset.
To copy data from and to the SAP BW Open Hub linked service, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SapTableResource .

tableName The name of the SAP table to copy Yes


data from.

Example

{
"name": "SAPTableDataset",
"properties": {
"type": "SapTableResource",
"typeProperties": {
"tableName": "<SAP table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP table linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of the sections and properties for defining activities, see Pipelines. The following section provides a
list of the properties supported by the SAP table source.
SAP table as source
To copy data from an SAP table, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SapTableSource .

rowCount The number of rows to be retrieved. No

rfcTableFields The fields (columns) to copy from the No


SAP table. For example,
column0, column1 .

rfcTableOptions The options to filter the rows in an SAP No


table. For example,
COLUMN0 EQ 'SOMEVALUE' . See also
the SAP query operator table later in
this article.

customRfcReadTableFunctionModule A custom RFC function module that No


can be used to read data from an SAP
table.
You can use a custom RFC function
module to define how the data is
retrieved from your SAP system and
returned to Data Factory. The custom
function module must have an
interface implemented (import, export,
tables) that's similar to
/SAPDS/RFC_READ_TABLE2 , which is
the default interface used by Data
Factory.
Data Factory

partitionOption The partition mechanism to read from No


an SAP table. The supported options
include:
None
PartitionOnInt (normal
integer or integer values with
zero padding on the left, such
as 0000012345 )
PartitionOnCalendarYear (4
digits in the format "YYYY")
PartitionOnCalendarMonth
(6 digits in the format
"YYYYMM")
PartitionOnCalendarDate (8
digits in the format
"YYYYMMDD")
PartitionOntime (6 digits in
the format "HHMMSS", such as
235959 )
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionColumnName The name of the column used to No


partition the data.

partitionUpperBound The maximum value of the column No


specified in partitionColumnName
that will be used to continue with
partitioning.

partitionLowerBound The minimum value of the column No


specified in partitionColumnName
that will be used to continue with
partitioning. (Note:
partitionLowerBound cannot be "0"
when partition option is
PartitionOnInt )

maxPartitionsNumber The maximum number of partitions to No


split the data into.

sapDataColumnDelimiter The single character that is used as No


delimiter passed to SAP RFC to split
the output data.

TIP
If your SAP table has a large volume of data, such as several billion rows, use partitionOption and partitionSetting
to split the data into smaller partitions. In this case, the data is read per partition, and each data partition is retrieved
from your SAP server via a single RFC call.

Taking partitionOption as partitionOnInt as an example, the number of rows in each partition is calculated with this
formula: (total rows falling between partitionUpperBound and partitionLowerBound )/ maxPartitionsNumber .

To load data partitions in parallel to speed up copy, the parallel degree is controlled by the parallelCopies setting on
the copy activity. For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four
queries based on your specified partition option and settings, and each query retrieves a portion of data from your SAP
table. We strongly recommend making maxPartitionsNumber a multiple of the value of the parallelCopies property.
When copying data into file-based data store, it's also recommanded to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.

TIP
The BASXML is enabled by default for this SAP Table connector on Azure Data Factory side.

In rfcTableOptions , you can use the following common SAP query operators to filter the rows:

O P ERATO R DESC RIP T IO N

EQ Equal to

NE Not equal to
O P ERATO R DESC RIP T IO N

LT Less than

LE Less than or equal to

GT Greater than

GE Greater than or equal to

IN As in TABCLASS IN ('TRANSP', 'INTTAB')

LIKE As in LIKE 'Emma%'

Example

"activities":[
{
"name": "CopyFromSAPTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapTableSource",
"partitionOption": "PartitionOnInt",
"partitionSettings": {
"partitionColumnName": "<partition column name>",
"partitionUpperBound": "2000",
"partitionLowerBound": "1",
"maxPartitionsNumber": 500
}
},
"sink": {
"type": "<sink type>"
},
"parallelCopies": 4
}
}
]

Join SAP tables


Currently SAP Table connector only supports one single table with the default function module. To get the joined
data of multiple tables, you can leverage the customRfcReadTableFunctionModule property in the SAP Table
connector following steps below:
Write a custom function module, which can take a query as OPTIONS and apply your own logic to retrieve
the data.
For the "Custom function module", enter the name of your custom function module.
For the "RFC table options", specify the table join statement to feed into your function module as OPTIONS,
such as " <TABLE1> INNER JOIN <TABLE2> ON COLUMN0".

Below is an example:

TIP
You can also consider having the joined data aggregated in the VIEW, which is supported by SAP Table connector. You can
also try to extract related tables to get onboard onto Azure (e.g. Azure Storage, Azure SQL Database), then use Data Flow
to proceed with further join or filter.

Create custom function module


For SAP table, currently we support customRfcReadTableFunctionModule property in the copy source, which
allows you to leverage your own logic and process data.
As a quick guidance, here are some requirements to get started with the "Custom function module":
Definition:
Export data into one of the tables below:

Below are illustrations of how SAP table connector works with custom function module:
1. Build connection with SAP server via SAP NCO.
2. Invoke "Custom function module" with the parameters set as below:
QUERY_TABLE: the table name you set in the ADF SAP Table dataset;
Delimiter: the delimiter you set in the ADF SAP Table Source;
ROWCOUNT/Option/Fields: the Rowcount/Aggregated Option/Fields you set in the ADF Table source.
3. Get the result and parse the data in below ways:
a. Parse the value in the Fields table to get the schemas.

b. Get the values of the output table to see which table contains these values.
c. Get the values in the OUT_TABLE, parse the data and then write it into the sink.

Data type mappings for an SAP table


When you're copying data from an SAP table, the following mappings are used from the SAP table data types to
the Azure Data Factory interim data types. To learn how the copy activity maps the source schema and data type
to the sink, see Schema and data type mappings.

SA P A B A P T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

C (String) String

I (Integer) Int32

F (Float) Double

D (Date) String

T (Time) String

P (BCD Packed, Currency, Decimal, Qty) Decimal

N (Numeric) String

X (Binary and Raw) String

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of the data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
Supported data stores.
Copy data from ServiceNow using Azure Data
Factory
6/15/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from ServiceNow. It builds
on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This ServiceNow connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from ServiceNow to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ServiceNow connector.

Linked service properties


The following properties are supported for ServiceNow linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Ser viceNow

endpoint The endpoint of the ServiceNow server Yes


(
http://<instance>.service-
now.com
).
P RO P ERT Y DESC RIP T IO N REQ UIRED

authenticationType The authentication type to use. Yes


Allowed values are: Basic, OAuth2

username The user name used to connect to the Yes


ServiceNow server for Basic and
OAuth2 authentication.

password The password corresponding to the Yes


user name for Basic and OAuth2
authentication. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

clientId The client ID for OAuth2 No


authentication.

clientSecret The client secret for OAuth2 No


authentication. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "ServiceNowLinkedService",
"properties": {
"type": "ServiceNow",
"typeProperties": {
"endpoint" : "http://<instance>.service-now.com",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ServiceNow dataset.
To copy data from ServiceNow, set the type property of the dataset to Ser viceNowObject . The following
properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: Ser viceNowObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ServiceNowDataset",
"properties": {
"type": "ServiceNowObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<ServiceNow linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by ServiceNow source.
ServiceNow as source
To copy data from ServiceNow, set the source type in the copy activity to Ser viceNowSource . The following
properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to:
Ser viceNowSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Actual.alm_asset" .

Note the following when specifying the schema and column for ServiceNow in query, and refer to
Performance tips on copy performance implication .
Schema: specify the schema as Actual or Display in the ServiceNow query, which you can look at it as the
parameter of sysparm_display_value as true or false when calling ServiceNow restful APIs.
Column: the column name for actual value under Actual schema is [column name]_value , while for display
value under Display schema is [column name]_display_value . Note the column name need map to the
schema being used in the query.
Sample quer y: SELECT col_value FROM Actual.alm_asset OR SELECT col_display_value FROM Display.alm_asset

Example:

"activities":[
{
"name": "CopyFromServiceNow",
"type": "Copy",
"inputs": [
{
"referenceName": "<ServiceNow input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowSource",
"query": "SELECT * FROM Actual.alm_asset"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Performance tips
Schema to use
ServiceNow has 2 different schemas, one is "Actual" which returns actual data, the other is "Display" which
returns the display values of data.
If you have a filter in your query, use "Actual" schema which has better copy performance. When querying
against "Actual" schema, ServiceNow natively support filter when fetching the data to only return the filtered
resultset, whereas when querying "Display" schema, ADF retrieve all the data and apply filter internally.
Index
ServiceNow table index can help improve query performance, refer to Create a table index.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to the SFTP server by using
Azure Data Factory
5/6/2021 • 18 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to copy data from and to the secure FTP (SFTP) server. To learn about Azure Data
Factory, read the introductory article.

Supported capabilities
The SFTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, the SFTP connector supports:
Copying files from and to the SFTP server by using Basic , SSH public key or multi-factor authentication.
Copying files as is or by parsing or generating files with the supported file formats and compression codecs.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SFTP.

Linked service properties


The following properties are supported for the SFTP linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Sftp. Yes

host The name or IP address of the SFTP Yes


server.

port The port on which the SFTP server is No


listening.
The allowed value is an integer, and
the default value is 22.

skipHostKeyValidation Specify whether to skip host key No


validation.
Allowed values are true and false
(default).

hostKeyFingerprint Specify the fingerprint of the host key. Yes, if the "skipHostKeyValidation" is set
to false.

authenticationType Specify the authentication type. Yes


Allowed values are Basic, SshPublicKey
and MultiFactor. For more properties,
see the Use basic authentication
section. For JSON examples, see the
Use SSH public key authentication
section.

connectVia The integration runtime to be used to No


connect to the data store. To learn
more, see the Prerequisites section. If
the integration runtime isn't specified,
the service uses the default Azure
Integration Runtime.

Use basic authentication


To use basic authentication, set the authenticationType property to Basic, and specify the following properties in
addition to the SFTP connector generic properties that were introduced in the preceding section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

userName The user who has access to the SFTP Yes


server.

password The password for the user (userName). Yes


Mark this field as a SecureString to
store it securely in your data factory,
or reference a secret stored in an
Azure key vault.

Example:
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use SSH public key authentication


To use SSH public key authentication, set "authenticationType" property as SshPublicKey , and specify the
following properties besides the SFTP connector generic ones introduced in the last section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

userName The user who has access to the SFTP Yes


server.

privateKeyPath Specify the absolute path to the Specify either privateKeyPath or


private key file that the integration privateKeyContent .
runtime can access. This applies only
when the self-hosted type of
integration runtime is specified in
"connectVia."

privateKeyContent Base64 encoded SSH private key Specify either privateKeyPath or


content. SSH private key should be privateKeyContent .
OpenSSH format. Mark this field as a
SecureString to store it securely in your
data factory, or reference a secret
stored in an Azure key vault.

passPhrase Specify the pass phrase or password to Yes, if the private key file or the key
decrypt the private key if the key file content is protected by a pass phrase.
or the key content is protected by a
pass phrase. Mark this field as a
SecureString to store it securely in your
data factory, or reference a secret
stored in an Azure key vault.

NOTE
The SFTP connector supports an RSA/DSA OpenSSH key. Make sure that your key file content starts with "-----BEGIN
[RSA/DSA] PRIVATE KEY-----". If the private key file is a PPK-format file, use the PuTTY tool to convert from PPK to
OpenSSH format.
Example 1: SshPublicKey authentication using private key filePath

{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: SshPublicKey authentication using private key content

{
"name": "SftpLinkedService",
"type": "Linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "<username>",
"privateKeyContent": {
"type": "SecureString",
"value": "<base64 string of the private key content>"
},
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Use multi-factor authentication


To use multi-factor authentication which is a combination of basic and SSH public key authentications, specify
the user name, password and the private key info described in above sections.
Example: multi-factor authentication
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<host>",
"port": 22,
"authenticationType": "MultiFactor",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"privateKeyContent": {
"type": "SecureString",
"value": "<base64 encoded private key content>"
},
"passPhrase": {
"type": "SecureString",
"value": "<passphrase for private key>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for SFTP under location settings in the format-based dataset:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under location in Yes


dataset must be set to SftpLocation.

folderPath The path to the folder. If you want to No


use a wildcard to filter the folder, skip
this setting and specify the path in
activity source settings.
P RO P ERT Y DESC RIP T IO N REQ UIRED

fileName The file name under the specified No


folderPath. If you want to use a
wildcard to filter files, skip this setting
and specify the file name in activity
source settings.

Example:

{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "SftpLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}

Copy activity properties


For a full list of sections and properties that are available for defining activities, see the Pipelines article. This
section provides a list of properties that are supported by the SFTP source.
SFTP as source
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for SFTP under the storeSettings settings in the format-based Copy
source:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
SftpReadSettings.
P RO P ERT Y DESC RIP T IO N REQ UIRED

Locate the files to copy

OPTION 1: static path Copy from the folder/file path that's


specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .

OPTION 2: wildcard The folder path with wildcard No


- wildcardFolderPath characters to filter source folders.
Allowed wildcards are * (matches
zero or more characters) and ?
(matches zero or a single character);
use ^ to escape if your actual folder
name has a wildcard or this escape
char inside.
For more examples, see Folder and file
filter examples.

OPTION 2: wildcard The file name with wildcard characters Yes


- wildcardFileName under the specified
folderPath/wildcardFolderPath to filter
source files.
Allowed wildcards are * (matches
zero or more characters) and ?
(matches zero or a single character);
use ^ to escape if your actual file
name has wildcard or this escape char
inside. For more examples, see Folder
and file filter examples.

OPTION 3: a list of files Indicates to copy a specified file set. No


- fileListPath Point to a text file that includes a list of
files you want to copy (one file per line,
with the relative path to the path
configured in the dataset).
When you use this option, don't
specify the file name in the dataset. For
more examples, see File list examples.

Additional settings

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink is a
file-based store, an empty folder or
subfolder isn't copied or created at the
sink.
Allowed values are true (default) and
false.
This property doesn't apply when you
configure fileListPath .
P RO P ERT Y DESC RIP T IO N REQ UIRED

deleteFilesAfterCompletion Indicates whether the binary files will No


be deleted from source store after
successfully moving to the destination
store. The file deletion is per file, so
when copy activity fails, you will see
some files have already been copied to
the destination and deleted from
source, while others are still remaining
on source store.
This property is only valid in binary
files copy scenario. The default value:
false.

modifiedDatetimeStart Files are filtered based on the attribute No


Last Modified.
The files are selected if their last
modified time is within the range of
modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to the UTC time zone in the
format of 2018-12-01T05:00:00Z.
The properties can be NULL, which
means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.
This property doesn't apply when you
configure fileListPath .

modifiedDatetimeEnd Same as above. No

enablePartitionDiscovery For files that are partitioned, specify No


whether to parse the partitions from
the file path and add them as
additional source columns.
Allowed values are false (default) and
true .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionRootPath When partition discovery is enabled, No


specify the absolute root path in order
to read partitioned folders as data
columns.

If it is not specified, by default,


- When you use file path in dataset or
list of files on source, partition root
path is the path configured in dataset.
- When you use wildcard folder filter,
partition root path is the sub-path
before the first wildcard.

For example, assuming you configure


the path in dataset as
"root/folder/year=2020/month=08/da
y=27":
- If you specify partition root path as
"root/folder/year=2020", copy activity
will generate two more columns
month and day with value "08" and
"27" respectively, in addition to the
columns inside the files.
- If partition root path is not specified,
no extra column will be generated.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "SftpReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

SFTP as a sink
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for SFTP under storeSettings settings in a format-based Copy sink:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property under Yes


storeSettings must be set to
SftpWriteSettings.
P RO P ERT Y DESC RIP T IO N REQ UIRED

copyBehavior Defines the copy behavior when the No


source is files from a file-based data
store.

Allowed values are:


- Preser veHierarchy (default) :
Preserves the file hierarchy in the
target folder. The relative path of the
source file to the source folder is
identical to the relative path of the
target file to the target folder.
- FlattenHierarchy : All files from the
source folder are in the first level of the
target folder. The target files have
autogenerated names.
- MergeFiles : Merges all files from
the source folder to one file. If the file
name is specified, the merged file
name is the specified name. Otherwise,
it's an autogenerated file name.

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

useTempFileRename Indicate whether to upload to No. Default value is true.


temporary files and rename them, or
directly write to the target folder or file
location. By default, Azure Data
Factory first writes to temporary files
and then renames them when the
upload is finished. This sequence helps
to (1) avoid conflicts that might result
in a corrupted file if you have other
processes writing to the same file, and
(2) ensure that the original version of
the file exists during the transfer. If
your SFTP server doesn't support a
rename operation, disable this option
and make sure that you don't have a
concurrent write to the target file. For
more information, see the
troubleshooting tip at the end of this
table.

operationTimeout The wait time before each write No


request to SFTP server times out.
Default value is 60 min (01:00:00).

TIP
If you receive the error "UserErrorSftpPathNotFound," "UserErrorSftpPermissionDenied," or "SftpOperationFail" when
you're writing data into SFTP, and the SFTP user you use does have the proper permissions, check to see whether your
SFTP server support file rename operation is working. If it isn't, disable the Upload with temp file (
useTempFileRename ) option and try again. To learn more about this property, see the preceding table. If you use a self-
hosted integration runtime for the Copy activity, be sure to use version 4.6 or later.
Example:

"activities":[
{
"name": "CopyToSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BinarySink",
"storeSettings":{
"type": "SftpWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]

Folder and file filter examples


This section describes the behavior that results from using wildcard filters with folder paths and file names.

SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* (empty, use default) false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* (empty, use default) true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)

Folder* *.csv false FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

Folder* *.csv true FolderA


File1.csv
File2.json
Subfolder1
File3.csv
File4.json
File5.csv
AnotherFolderB
File6.csv

File list examples


This table describes the behavior that results from using a file list path in the Copy activity source. It assumes
that you have the following source folder structure and want to copy the files that are in bold type:

A Z URE DATA FA C TO RY
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT C O N F IGURAT IO N

root File1.csv In the dataset:


FolderA Subfolder1/File3.csv - Folder path: root/FolderA
File1.csv Subfolder1/File5.csv
File2.json In the Copy activity source:
Subfolder1 - File list path:
File3.csv root/Metadata/FileListToCopy.txt
File4.json
File5.csv The file list path points to a text file in
Metadata the same data store that includes a list
FileListToCopy.txt of files you want to copy (one file per
line, with the relative path to the path
configured in the dataset).

Lookup activity properties


For information about Lookup activity properties, see Lookup activity in Azure Data Factory.

GetMetadata activity properties


For information about GetMetadata activity properties, see GetMetadata activity in Azure Data Factory.

Delete activity properties


For information about Delete activity properties, see Delete activity in Azure Data Factory.

Legacy models
NOTE
The following models are still supported as is for backward compatibility. We recommend that you use the previously
discussed new model, because the Azure Data Factory authoring UI has switched to generating the new model.

Legacy dataset model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to FileShare.

folderPath The path to the folder. A wildcard filter Yes


is supported. Allowed wildcards are *
(matches zero or more characters) and
? (matches zero or a single
character); use ^ to escape if your
actual file name has a wildcard or this
escape char inside.

Examples: rootfolder/subfolder/, see


more examples in Folder and file filter
examples.

fileName Name or wildcard filter for the files No


under the specified "folderPath". If you
don't specify a value for this property,
the dataset points to all files in the
folder.

For filter, the allowed wildcards are *


(matches zero or more characters) and
? (matches zero or a single
character).
- Example 1: "fileName": "*.csv"
- Example 2:
"fileName": "???20180427.txt"
Use ^ to escape if your actual folder
name has wildcard or this escape char
inside.
P RO P ERT Y DESC RIP T IO N REQ UIRED

modifiedDatetimeStart Files are filtered based on the attribute No


Last Modified. The files are selected if
their last modified time is within the
range of modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of 2018-12-01T05:00:00Z.

The overall performance of data


movement will be affected by enabling
this setting when you want to do file
filter from large numbers of files.

The properties can be NULL, which


means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.

modifiedDatetimeEnd Files are filtered based on the attribute No


Last Modified. The files are selected if
their last modified time is within the
range of modifiedDatetimeStart to
modifiedDatetimeEnd . The time is
applied to UTC time zone in the format
of 2018-12-01T05:00:00Z.

The overall performance of data


movement will be affected by enabling
this setting when you want to do file
filter from large numbers of files.

The properties can be NULL, which


means that no file attribute filter is
applied to the dataset. When
modifiedDatetimeStart has a
datetime value but
modifiedDatetimeEnd is NULL, it
means that the files whose last
modified attribute is greater than or
equal to the datetime value are
selected. When
modifiedDatetimeEnd has a datetime
value but modifiedDatetimeStart is
NULL, it means that the files whose
last modified attribute is less than the
datetime value are selected.
P RO P ERT Y DESC RIP T IO N REQ UIRED

format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.

If you want to parse files with a specific


format, the following file format types
are supported: TextFormat,
JsonFormat, AvroFormat, OrcFormat,
and ParquetFormat. Set the type
property under format to one of these
values. For more information, see Text
format, Json format, Avro format, Orc
format, and Parquet format sections.

compression Specify the type and level of No


compression for the data. For more
information, see Supported file formats
and compression codecs.
Supported types are GZip, Deflate,
BZip2, and ZipDeflate.
Supported levels are Optimal and
Fastest.

TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a specified name, specify folderPath with the folder part and fileName with the file name.
To copy a subset of files under a folder, specify folderPath with the folder part and fileName with the wildcard filter.

NOTE
If you were using fileFilter property for the file filter, it is still supported as is, but we recommend that you use the new
filter capability added to fileName from now on.

Example:
{
"name": "SFTPDataset",
"type": "Datasets",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

Legacy Copy activity source model


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


source must be set to
FileSystemSource

recursive Indicates whether the data is read No


recursively from the subfolders or only
from the specified folder. When
recursive is set to true and the sink is a
file-based store, empty folders and
subfolders won't be copied or created
at the sink.
Allowed values are true (default) and
false

maxConcurrentConnections The upper limit of concurrent No


connections established to the data
store during the activity run. Specify a
value only when you want to limit
concurrent connections.

Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<SFTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]

Next steps
For a list of data stores that are supported as sources and sinks by the Copy activity in Azure Data Factory, see
supported data stores.
Copy data from SharePoint Online List by using
Azure Data Factory
6/8/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use Copy Activity in Azure Data Factory to copy data from SharePoint Online List.
The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities
This SharePoint Online List connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SharePoint Online List to any supported sink data store. For a list of data stores that
Copy Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this SharePoint List Online connector uses service principal authentication and retrieves data via
OData protocol.

TIP
This connector supports copying data from SharePoint Online List but not file. Learn how to copy file from Copy file from
SharePoint Online section.

Prerequisites
The SharePoint List Online connector uses service principal authentication to connect to SharePoint. Follow
these steps to set it up:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant SharePoint Online site permission to your registered application:

NOTE
This operation requires SharePoint Online site owner permission. You can find the owner by going to the site
home page -> click the "X members" in the right corner -> check who has the "Owner" role.

a. Open SharePoint Online site link e.g. https://[your_site_url]/_layouts/15/appinv.aspx (replace the


site URL).
b. Search the application ID you registered, fill the empty fields, and click "Create".
App Domain: localhost.com
Redirect URL: https://www.localhost.com
Permission Request XML:

<AppPermissionRequests AllowAppOnlyPolicy="true">
<AppPermissionRequest Scope="http://sharepoint/content/sitecollection/web" Right="Read"/>
</AppPermissionRequests>

c. Click "Trust It" for this app.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to SharePoint Online List connector.

Linked service properties


The following properties are supported for an SharePoint Online List linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set Yes


to:SharePointOnlineList .
P RO P ERT Y DESC RIP T IO N REQ UIRED

siteUrl The SharePoint Online site url, e.g. Yes


https://contoso.sharepoint.com/sites/siteName
.

servicePrincipalId The Application (client) ID of the Yes


application registered in Azure Active
Directory.

servicePrincipalKey The application's key. Mark this field as Yes


a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

tenantId The tenant ID under which your Yes


application resides.

connectVia The Integration Runtime to use to No


connect to the data store. Learn more
from Prerequisites, earlier in this
article. If not specified, the default
Azure Integration Runtime is used.

Example:

{
"name": "SharePointOnlineList",
"properties": {
"type": "SharePointOnlineList",
"typeProperties": {
"siteUrl": "<site URL>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenantId": "<tenant ID>"
}
}
}

Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following section provides a list of the properties supported by the SAP table dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to
SharePointOnlineLResource .

listName The name of the SharePoint Online Yes


List.

Example
{
"name": "SharePointOnlineListDataset",
"properties":
{
"type": "SharePointOnlineListResource",
"linkedServiceName": {
"referenceName": "<SharePoint Online List linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"listName":"<name of the list>"
}
}
}

Copy Activity properties


For a full list of sections and properties that are available for defining activities, see Pipelines. The following
section provides a list of the properties supported by the SharePoint Online List source.
SharePoint Online List as source
To copy data from SharePoint Online List, the following properties are supported in the Copy Activity source
section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy Yes


Activity source must be set to
SharePointOnlineListSource .

query Custom OData query options for No


filtering data. Example:
"$top=10&$select=Title,Number" .

httpRequestTimeout The timeout (in second) for the HTTP No


request to get a response. Default is
300 (5 minutes).

Example
"activities":[
{
"name": "CopyFromSharePointOnlineList",
"type": "Copy",
"inputs": [
{
"referenceName": "<SharePoint Online List input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source":{
"type":"SharePointOnlineListSource",
"query":"<ODataquerye.g.$top=10&$select=Title,Number>"
},
"sink": {
"type": "<sink type>"
}
}
}
]

NOTE
In Azure Data Factory, you can't select more than one choice data type for a SharePoint Online List source.

Data type mapping for SharePoint Online List


When you copy data from SharePoint Online List, the following mappings are used between SharePoint Online
List data types and Azure Data Factory interim data types.

A Z URE DATA FA C TO RY IN T ERIM DATA


SH A REP O IN T O N L IN E DATA T Y P E O DATA DATA T Y P E TYPE

Single line of text Edm.String String

Multiple lines of text Edm.String String

Choice (menu to choose from) Edm.String String

Number (1, 1.0, 100) Edm.Double Double

Currency ($, ¥, €) Edm.Double Double

Date and Time Edm.DateTime DateTime

Lookup (information already on this Edm.Int32 Int32


site)

Yes/No (check box) Edm.Boolean Boolean


A Z URE DATA FA C TO RY IN T ERIM DATA
SH A REP O IN T O N L IN E DATA T Y P E O DATA DATA T Y P E TYPE

Person or Group Edm.Int32 Int32

Hyperlink or Picture Edm.String String

Calculated (calculation based on other Edm.String / Edm.Double / String / Double / DateTime / Boolean
columns) Edm.DateTime / Edm.Boolean

Attachment Not supported

Task Outcome Not supported

External Data Not supported

Managed Metadata Not supported

Copy file from SharePoint Online


You can copy file from SharePoint Online by using Web activity to authenticate and grab access token from
SPO, then passing to subsequent Copy activity to copy data with HTTP connector as source .

1. Follow the Prerequisites section to create AAD application and grant permission to SharePoint Online.
2. Create a Web Activity to get the access token from SharePoint Online:
URL : https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2 . Replace the tenant ID.
Method : POST
Headers :
Content-Type: application/x-www-form-urlencoded
Body :
grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-
Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]
. Replace the client ID (application ID), client secret (application key), tenant ID, and tenant name (of the
SharePoint tenant).
Cau t i on

Set the Secure Output option to true in Web activity to prevent the token value from being logged in
plain text. Any further activities that consume this value should have their Secure Input option set to true.
3. Chain with a Copy activity with HTTP connector as source to copy SharePoint Online file content:
HTTP linked service:
Base URL :
https://[site-url]/_api/web/GetFileByServerRelativeUrl('[relative-path-to-file]')/$value .
Replace the site URL and relative path to file. Sample relative path to file as
/sites/site2/Shared Documents/TestBook.xlsx .
Authentication type: Anonymous (to use the Bearer token configured in copy activity source
later)
Dataset: choose the format you want. To copy file as-is, select "Binary" type.
Copy activity source:
Request method : GET
Additional header : use the following expression
,
@{concat('Authorization: Bearer ', activity('<Web-activity-name>').output.access_token)}
which uses the Bearer token generated by the upstream Web activity as authorization header.
Replace the Web activity name.
Configure the copy activity sink as usual.

NOTE
Even if an Azure AD application has FullControl permissions on SharePoint Online, you can't copy files from document
libraries with IRM enabled.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from Shopify using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Shopify. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Shopify connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Shopify to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Shopify connector.

Linked service properties


The following properties are supported for Shopify linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Shopify

host The endpoint of the Shopify server. Yes


(that is, mystore.myshopify.com)

accessToken The API access token that can be used Yes


to access Shopify’s data. The token
does not expire if it is offline mode.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example:

{
"name": "ShopifyLinkedService",
"properties": {
"type": "Shopify",
"typeProperties": {
"host" : "mystore.myshopify.com",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Shopify dataset.
To copy data from Shopify, set the type property of the dataset to ShopifyObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: ShopifyObject
P RO P ERT Y DESC RIP T IO N REQ UIRED

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ShopifyDataset",
"properties": {
"type": "ShopifyObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Shopify linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Shopify source.
Shopify as source
To copy data from Shopify, set the source type in the copy activity to ShopifySource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: ShopifySource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Products" WHERE
Product_Id = '123'"
.

Example:
"activities":[
{
"name": "CopyFromShopify",
"type": "Copy",
"inputs": [
{
"referenceName": "<Shopify input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ShopifySource",
"query": "SELECT * FROM \"Products\" WHERE Product_Id = '123'"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Snowflake by using
Azure Data Factory
5/26/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Snowflake, and
use Data Flow to transform data in Snowflake. For more information about Data Factory, see the introductory
article.

Supported capabilities
This Snowflake connector is supported for the following activities:
Copy activity with a supported source/sink matrix table
Mapping data flow
Lookup activity
For the Copy activity, this Snowflake connector supports the following functions:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best
performance.
Copy data to Snowflake that takes advantage of Snowflake's COPY into [table] command to achieve the best
performance. It supports Snowflake on Azure.
If a proxy is required to connect to Snowflake from a self-hosted Integration Runtime, you must configure the
environment variables for HTTP_PROXY and HTTPS_PROXY on the Integration Runtime host.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to a Snowflake
connector.

Linked service properties


The following properties are supported for a Snowflake-linked service.

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Snowflake .

connectionString Specifies the information needed to Yes


connect to the Snowflake instance. You
can choose to put password or entire
connection string in Azure Key Vault.
Refer to the examples below the table,
as well as the Store credentials in Azure
Key Vault article, for more details.

Some typical settings:


- Account name: The full account
name of your Snowflake account
(including additional segments that
identify the region and cloud platform),
e.g. xy12345.east-us-2.azure.
- User name: The login name of the
user for the connection.
- Password: The password for the
user.
- Database: The default database to
use once connected. It should be an
existing database for which the
specified role has privileges.
- Warehouse: The virtual warehouse
to use once connected. It should be an
existing warehouse for which the
specified role has privileges.
- Role: The default access control role
to use in the Snowflake session. The
specified role should be an existing role
that has already been assigned to the
specified user. The default role is
PUBLIC.

connectVia The integration runtime that is used to No


connect to the data store. You can use
the Azure integration runtime or a self-
hosted integration runtime (if your
data store is located in a private
network). If not specified, it uses the
default Azure integration runtime.

Example:

{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=
<username>&password=<password>&db=<database>&warehouse=<warehouse>&role=<myRole>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Password in Azure Key Vault:

{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=<username>&db=
<database>&warehouse=<warehouse>&role=<myRole>",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for the Snowflake dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to SnowflakeTable .

schema Name of the schema. Note the schema No for source, yes for sink
name is case-sensitive in ADF.

table Name of the table/view. Note the table No for source, yes for sink
name is case-sensitive in ADF.

Example:

{
"name": "SnowflakeDataset",
"properties": {
"type": "SnowflakeTable",
"typeProperties": {
"schema": "<Schema name for your Snowflake database>",
"table": "<Table name for your Snowflake database>"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Snowflake source and sink.
Snowflake as the source
Snowflake connector utilizes Snowflake’s COPY into [location] command to achieve the best performance.
If sink data store and format are natively supported by the Snowflake COPY command, you can use the Copy
activity to directly copy from Snowflake to sink. For details, see Direct copy from Snowflake. Otherwise, use
built-in Staged copy from Snowflake.
To copy data from Snowflake, the following properties are supported in the Copy activity source section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


source must be set to
SnowflakeSource .

query Specifies the SQL query to read data No


from Snowflake. If the names of the
schema, table and columns contain
lower case, quote the object identifier
in query e.g.
select * from "schema"."myTable" .
Executing stored procedure is not
supported.

exportSettings Advanced settings used to retrieve No


data from Snowflake. You can
configure the ones supported by the
COPY into command that Data
Factory will pass through when you
invoke the statement.

Under exportSettings :

type The type of export command, set to Yes


SnowflakeExpor tCopyCommand .

additionalCopyOptions Additional copy options, provided as a No


dictionary of key-value pairs. Examples:
MAX_FILE_SIZE, OVERWRITE. For more
information, see Snowflake Copy
Options.

additionalFormatOptions Additional file format options that are No


provided to COPY command as a
dictionary of key-value pairs. Examples:
DATE_FORMAT, TIME_FORMAT,
TIMESTAMP_FORMAT. For more
information, see Snowflake Format
Type Options.

Direct copy from Snowflake


If your sink data store and format meet the criteria described in this section, you can use the Copy activity to
directly copy from Snowflake to sink. Data Factory checks the settings and fails the Copy activity run if the
following criteria is not met:
The sink linked ser vice is Azure Blob storage with shared access signature authentication. If you
want to directly copy data to Azure Data Lake Storage Gen2 in the following supported format, you can
create an Azure Blob linked service with SAS authentication against your ADLS Gen2 account, to avoid
using staged copy from Snowflake.
The sink data format is of Parquet , delimited text , or JSON with the following configurations:
For Parquet format, the compression codec is None , Snappy , or Lzo .
For delimited text format:
rowDelimiter is \r\n , or any single character.
compression can be no compression , gzip , bzip2 , or deflate .
encodingName is left as default or set to utf-8 .
quoteChar is double quote , single quote , or empty string (no quote char).
For JSON format, direct copy only supports the case that source Snowflake table or query result only
has single column and the data type of this column is VARIANT , OBJECT , or ARRAY .
compression can be no compression , gzip , bzip2 , or deflate .
encodingName is left as default or set to utf-8 .
filePattern in copy activity sink is left as default or set to setOfObjects .
In copy activity source, additionalColumns is not specified.
Column mapping is not specified.
Example:
"activities":[
{
"name": "CopyFromSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Snowflake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SnowflakeSource",
"sqlReaderQuery": "SELECT * FROM MYTABLE",
"exportSettings": {
"type": "SnowflakeExportCopyCommand",
"additionalCopyOptions": {
"MAX_FILE_SIZE": "64000000",
"OVERWRITE": true
},
"additionalFormatOptions": {
"DATE_FORMAT": "'MM/DD/YYYY'"
}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

Staged copy from Snowflake


When your sink data store or format is not natively compatible with the Snowflake COPY command, as
mentioned in the last section, enable the built-in staged copy using an interim Azure Blob storage instance. The
staged copy feature also provides you better throughput. Data Factory exports data from Snowflake into staging
storage, then copies the data to sink, and finally cleans up your temporary data from the staging storage. See
Staged copy for details about copying data by using staging.
To use this feature, create an Azure Blob storage linked service that refers to the Azure storage account as the
interim staging. Then specify the enableStaging and stagingSettings properties in the Copy activity.

NOTE
The staging Azure Blob storage linked service must use shared access signature authentication, as required by the
Snowflake COPY command.

Example:
"activities":[
{
"name": "CopyFromSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Snowflake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SnowflakeSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]

Snowflake as sink
Snowflake connector utilizes Snowflake’s COPY into [table] command to achieve the best performance. It
supports writing data to Snowflake on Azure.
If source data store and format are natively supported by Snowflake COPY command, you can use the Copy
activity to directly copy from source to Snowflake. For details, see Direct copy to Snowflake. Otherwise, use
built-in Staged copy to Snowflake.
To copy data to Snowflake, the following properties are supported in the Copy activity sink section.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the Copy activity Yes


sink, set to SnowflakeSink .

preCopyScript Specify a SQL query for the Copy No


activity to run before writing data into
Snowflake in each run. Use this
property to clean up the preloaded
data.
P RO P ERT Y DESC RIP T IO N REQ UIRED

importSettings Advanced settings used to write data No


into Snowflake. You can configure the
ones supported by the COPY into
command that Data Factory will pass
through when you invoke the
statement.

Under importSettings :

type The type of import command, set to Yes


SnowflakeImpor tCopyCommand .

additionalCopyOptions Additional copy options, provided as a No


dictionary of key-value pairs. Examples:
ON_ERROR, FORCE,
LOAD_UNCERTAIN_FILES. For more
information, see Snowflake Copy
Options.

additionalFormatOptions Additional file format options provided No


to the COPY command, provided as a
dictionary of key-value pairs. Examples:
DATE_FORMAT, TIME_FORMAT,
TIMESTAMP_FORMAT. For more
information, see Snowflake Format
Type Options.

Direct copy to Snowflake


If your source data store and format meet the criteria described in this section, you can use the Copy activity to
directly copy from source to Snowflake. Azure Data Factory checks the settings and fails the Copy activity run if
the following criteria is not met:
The source linked ser vice is Azure Blob storage with shared access signature authentication. If
you want to directly copy data from Azure Data Lake Storage Gen2 in the following supported format,
you can create an Azure Blob linked service with SAS authentication against your ADLS Gen2 account, to
avoid using staged copy to Snowflake..
The source data format is Parquet , Delimited text , or JSON with the following configurations:
For Parquet format, the compression codec is None , or Snappy .
For delimited text format:
rowDelimiter is \r\n , or any single character. If row delimiter is not “\r\n”, firstRowAsHeader
need to be false , and skipLineCount is not specified.
compression can be no compression , gzip , bzip2 , or deflate .
encodingName is left as default or set to "UTF-8", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE",
"BIG5", "EUC-JP", "EUC-KR", "GB18030", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-
8859-2", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "WINDOWS-
1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254",
"WINDOWS-1255".
quoteChar is double quote , single quote , or empty string (no quote char).
For JSON format, direct copy only supports the case that sink Snowflake table only has single
column and the data type of this column is VARIANT , OBJECT , or ARRAY .
compression can be no compression , gzip , bzip2 , or deflate .
encodingName is left as default or set to utf-8 .
Column mapping is not specified.
In the Copy activity source:
additionalColumns is not specified.
If your source is a folder, recursive is set to true.
prefix , modifiedDateTimeStart , modifiedDateTimeEnd , and enablePartitionDiscovery are not
specified.
Example:

"activities":[
{
"name": "CopyToSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Snowflake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SnowflakeSink",
"importSettings": {
"type": "SnowflakeImportCopyCommand",
"copyOptions": {
"FORCE": "TRUE",
"ON_ERROR": "SKIP_FILE",
},
"fileFormatOptions": {
"DATE_FORMAT": "YYYY-MM-DD",
}
}
}
}
}
]

Staged copy to Snowflake


When your source data store or format is not natively compatible with the Snowflake COPY command, as
mentioned in the last section, enable the built-in staged copy using an interim Azure Blob storage instance. The
staged copy feature also provides you better throughput. Data Factory automatically converts the data to meet
the data format requirements of Snowflake. It then invokes the COPY command to load data into Snowflake.
Finally, it cleans up your temporary data from the blob storage. See Staged copy for details about copying data
using staging.
To use this feature, create an Azure Blob storage linked service that refers to the Azure storage account as the
interim staging. Then specify the enableStaging and stagingSettings properties in the Copy activity.
NOTE
The staging Azure Blob storage linked service need to use shared access signature authentication as required by the
Snowflake COPY command.

Example:

"activities":[
{
"name": "CopyToSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Snowflake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SnowflakeSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]

Mapping data flow properties


When transforming data in mapping data flow, you can read from and write to tables in Snowflake. For more
information, see the source transformation and sink transformation in mapping data flows. You can choose to
use a Snowflake dataset or an inline dataset as source and sink type.
Source transformation
The below table lists the properties supported by Snowflake source. You can edit these properties in the Source
options tab. The connector utilizes Snowflake internal data transfer.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table If you select Table as No String (for inline dataset


input, data flow will only)
fetch all the data tableName
from the table schemaName
specified in the
Snowflake dataset or
in the source options
when using inline
dataset.

Query If you select Query No String query


as input, enter a
query to fetch data
from Snowflake. This
setting overrides any
table that you've
chosen in dataset.
If the names of the
schema, table and
columns contain
lower case, quote the
object identifier in
query e.g.
select * from
"schema"."myTable"
.

Snowflake source script examples


When you use Snowflake dataset as source type, the associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
query: 'select * from MYTABLE',
format: 'query') ~> SnowflakeSource

If you use inline dataset, the associated data flow script is:

source(allowSchemaDrift: true,
validateSchema: false,
format: 'query',
query: 'select * from MYTABLE',
store: 'snowflake') ~> SnowflakeSource

Sink transformation
The below table lists the properties supported by Snowflake sink. You can edit these properties in the Settings
tab. When using inline dataset, you will see additional settings, which are the same as the properties described
in dataset properties section. The connector utilizes Snowflake internal data transfer.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Update method Specify what Yes true or false deletable


operations are insertable
allowed on your updateable
Snowflake upsertable
destination.
To update, upsert, or
delete rows, an Alter
row transformation is
required to tag rows
for those actions.

Key columns For updates, upserts No Array keys


and deletes, a key
column or columns
must be set to
determine which row
to alter.

Table action Determines whether No true or false recreate


to recreate or truncate
remove all rows from
the destination table
prior to writing.
- None : No action
will be done to the
table.
- Recreate : The table
will get dropped and
recreated. Required if
creating a new table
dynamically.
- Truncate : All rows
from the target table
will get removed.

Snowflake sink script examples


When you use Snowflake dataset as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,


validateSchema: false,
deletable:true,
insertable:true,
updateable:true,
upsertable:false,
keys:['movieId'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> SnowflakeSink

If you use inline dataset, the associated data flow script is:
IncomingStream sink(allowSchemaDrift: true,
validateSchema: false,
format: 'table',
tableName: 'table',
schemaName: 'schema',
deletable: true,
insertable: true,
updateable: true,
upsertable: false,
store: 'snowflake',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> SnowflakeSink

Lookup activity properties


For more information about the properties, see Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by Copy activity in Data Factory, see supported data
stores and formats.
Copy data from Spark using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Spark. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Spark connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Spark to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Spark connector.
Linked service properties
The following properties are supported for Spark linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Spark

host IP address or host name of the Spark Yes


server

port The TCP port that the Spark server Yes


uses to listen for client connections. If
you connect to Azure HDInsights,
specify port as 443.

serverType The type of Spark server. No


Allowed values are: SharkSer ver ,
SharkSer ver2 , SparkThriftSer ver

thriftTransportProtocol The transport protocol to use in the No


Thrift layer.
Allowed values are: Binar y , SASL ,
HTTP

authenticationType The authentication method used to Yes


access the Spark server.
Allowed values are: Anonymous ,
Username ,
UsernameAndPassword ,
WindowsAzureHDInsightSer vice

username The user name that you use to access No


Spark Server.

password The password corresponding to the No


user. Mark this field as a SecureString
to store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

httpPath The partial URL corresponding to the No


Spark server.

enableSsl Specifies whether the connections to No


the server are encrypted using TLS.
The default value is false.

trustedCertPath The full path of the .pem file containing No


trusted CA certificates for verifying the
server when connecting over TLS. This
property can only be set when using
TLS on self-hosted IR. The default value
is the cacerts.pem file installed with the
IR.
P RO P ERT Y DESC RIP T IO N REQ UIRED

useSystemTrustStore Specifies whether to use a CA No


certificate from the system trust store
or from a specified PEM file. The
default value is false.

allowHostNameCNMismatch Specifies whether to require a CA- No


issued TLS/SSL certificate name to
match the host name of the server
when connecting over TLS. The default
value is false.

allowSelfSignedServerCert Specifies whether to allow self-signed No


certificates from the server. The default
value is false.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "SparkLinkedService",
"properties": {
"type": "Spark",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Spark dataset.
To copy data from Spark, set the type property of the dataset to SparkObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: SparkObject

schema Name of the schema. No (if "query" in activity source is


specified)
P RO P ERT Y DESC RIP T IO N REQ UIRED

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "SparkDataset",
"properties": {
"type": "SparkObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Spark linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Spark source.
Spark as source
To copy data from Spark, set the source type in the copy activity to SparkSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: SparkSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromSpark",
"type": "Copy",
"inputs": [
{
"referenceName": "<Spark input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SparkSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data to and from SQL Server
by using Azure Data Factory
7/16/2021 • 26 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the copy activity in Azure Data Factory to copy data from and to SQL Server
database and use Data Flow to transform data in SQL Server database. To learn about Azure Data Factory, read
the introductory article.

Supported capabilities
This SQL Server connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
You can copy data from a SQL Server database to any supported sink data store. Or, you can copy data from any
supported source data store to a SQL Server database. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.
Specifically, this SQL Server connector supports:
SQL Server version 2005 and above.
Copying data by using SQL or Windows authentication.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from SQL Server source, see the Parallel copy from SQL database section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during copy.
SQL Server Express LocalDB is not supported.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the SQL Server database connector.

Linked service properties


The following properties are supported for the SQL Server linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


SqlSer ver .

connectionString Specify connectionString information Yes


that's needed to connect to the SQL
Server database by using either SQL
authentication or Windows
authentication. Refer to the following
samples.
You also can put a password in Azure
Key Vault. If it's SQL authentication,
pull the password configuration out
of the connection string. For more
information, see the JSON example
following the table and Store
credentials in Azure Key Vault.

userName Specify a user name if you use No


Windows authentication. An example is
domainname\username .

password Specify a password for the user No


account you specified for the user
name. Mark this field as SecureString
to store it securely in Azure Data
Factory. Or, you can reference a secret
stored in Azure Key Vault.

alwaysEncryptedSettings Specify alwaysencr yptedsettings No


information that's needed to enable
Always Encrypted to protect sensitive
data stored in SQL server by using
either managed identity or service
principal. For more information, see
the JSON example following the table
and Using Always Encrypted section. If
not specified, the default always
encrypted setting is disabled.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia This integration runtime is used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, the default Azure integration
runtime is used.

NOTE
SQL Server Always Encr ypted is not supported in data flow.

TIP
If you hit an error with the error code "UserErrorFailedToConnectToSqlServer" and a message like "The session limit for the
database is XXX and has been reached," add Pooling=false to your connection string and try again.

Example 1: Use SQL authentication

{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 2: Use SQL authentication with a password in Azure Key Vault

{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example 3: Use Windows authentication

{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=True;",
"userName": "<domain\\username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example 4: Use Always Encr ypted

{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by the SQL Server dataset.
To copy data from and to a SQL Server database, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to SqlSer verTable .

schema Name of the schema. No for source, Yes for sink


P RO P ERT Y DESC RIP T IO N REQ UIRED

table Name of the table/view. No for source, Yes for sink

tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .

Example

{
"name": "SQLServerDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<SQL Server linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}

Copy activity properties


For a full list of sections and properties available for use to define activities, see the Pipelines article. This section
provides a list of properties supported by the SQL Server source and sink.
SQL Server as a source

TIP
To load data from SQL Server efficiently by using data partitioning, learn more from Parallel copy from SQL database.

To copy data from SQL Server, set the source type in the copy activity to SqlSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to SqlSource .

sqlReaderQuery Use the custom SQL query to read No


data. An example is
select * from MyTable .

sqlReaderStoredProcedureName This property is the name of the No


stored procedure that reads data from
the source table. The last SQL
statement must be a SELECT
statement in the stored procedure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

storedProcedureParameters These parameters are for the stored No


procedure.
Allowed values are name or value
pairs. The names and casing of
parameters must match the names
and casing of the stored procedure
parameters.

isolationLevel Specifies the transaction locking No


behavior for the SQL source. The
allowed values are: ReadCommitted ,
ReadUncommitted ,
RepeatableRead , Serializable ,
Snapshot . If not specified, the
database's default isolation level is
used. Refer to this doc for more details.

partitionOptions Specifies the data partitioning options No


used to load data from SQL Server.
Allowed values are: None (default),
PhysicalPar titionsOfTable , and
DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from SQL Server is controlled by the
parallelCopies setting on the copy
activity.

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when the partition option isn't
None .

Under partitionSettings :

partitionColumnName Specify the name of the source column No


in integer or date/datetime type (
int , smallint , bigint , date ,
smalldatetime , datetime ,
datetime2 , or datetimeoffset )
that will be used by range partitioning
for parallel copy. If not specified, the
index or the primary key of the table is
auto-detected and used as the
partition column.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?
AdfDynamicRangePartitionCondition
in the WHERE clause. For an example,
see the Parallel copy from SQL
database section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionUpperBound The maximum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

partitionLowerBound The minimum value of the partition No


column for partition range splitting.
This value is used to decide the
partition stride, not for filtering the
rows in table. All rows in the table or
query result will be partitioned and
copied. If not specified, copy activity
auto detect the value.
Apply when the partition option is
DynamicRange . For an example, see
the Parallel copy from SQL database
section.

Note the following points:


If sqlReaderQuer y is specified for SqlSource , the copy activity runs this query against the SQL Server
source to get the data. You also can specify a stored procedure by specifying
sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes
parameters.
When using stored procedure in source to retrieve data, note if your stored procedure is designed as
returning different schema when different parameter value is passed in, you may encounter failure or see
unexpected result when importing schema from UI or when copying data to SQL database with auto table
creation.
Example: Use SQL quer y
"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Example: Use a stored procedure

"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]

The stored procedure definition


CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO

SQL Server as a sink

TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into SQL Server.

To copy data to SQL Server, set the sink type in the copy activity to SqlSink . The following properties are
supported in the copy activity sink section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


sink must be set to SqlSink .

preCopyScript This property specifies a SQL query for No


the copy activity to run before writing
data into SQL Server. It's invoked only
once per copy run. You can use this
property to clean up the preloaded
data.

tableOption Specifies whether to automatically No


create the sink table if not exists based
on the source schema. Auto table
creation is not supported when sink
specifies stored procedure. Allowed
values are: none (default),
autoCreate .

sqlWriterStoredProcedureName The name of the stored procedure that No


defines how to apply source data into
a target table.
This stored procedure is invoked per
batch. For operations that run only
once and have nothing to do with
source data, for example, delete or
truncate, use the preCopyScript
property.
See example from Invoke a stored
procedure from a SQL sink.

storedProcedureTableTypeParameterNa The parameter name of the table type No


me specified in the stored procedure.
P RO P ERT Y DESC RIP T IO N REQ UIRED

sqlWriterTableType The table type name to be used in the No


stored procedure. The copy activity
makes the data being moved available
in a temp table with this table type.
Stored procedure code can then merge
the data that's being copied with
existing data.

storedProcedureParameters Parameters for the stored procedure. No


Allowed values are name and value
pairs. Names and casing of parameters
must match the names and casing of
the stored procedure parameters.

writeBatchSize Number of rows to insert into the SQL No


table per batch.
Allowed values are integers for the
number of rows. By default, Azure
Data Factory dynamically determines
the appropriate batch size based on
the row size.

writeBatchTimeout This property specifies the wait time No


for the batch insert operation to
complete before it times out.
Allowed values are for the timespan.
An example is "00:30:00" for 30
minutes. If no value is specified, the
timeout defaults to "02:00:00".

maxConcurrentConnections The upper limit of concurrent connecti No


ons established to the data store durin
g the activity run. Specify a value only
when you want to limit concurrent con
nections.

Example 1: Append data


"activities":[
{
"name": "CopyToSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Server output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"tableOption": "autoCreate",
"writeBatchSize": 100000
}
}
}
]

Example 2: Invoke a stored procedure during copy


Learn more details from Invoke a stored procedure from a SQL sink.
"activities":[
{
"name": "CopyToSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<SQL Server output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters",
"storedProcedureTableTypeParameterName": "MyTable",
"sqlWriterTableType": "MyTableType",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
}
]

Parallel copy from SQL database


The SQL Server connector in copy activity provides built-in data partitioning to copy data in parallel. You can
find data partitioning options on the Source tab of the copy activity.

When you enable partitioned copy, copy activity runs parallel queries against your SQL Server source to load
data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For
example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on
your specified partition option and settings, and each query retrieves a portion of data from your SQL Server.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your SQL Server. The following are suggested configurations for different scenarios. When copying data
into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in
which case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table, with physical partitions. Par tition option : Physical partitions of table.

During execution, Data Factory automatically detects the


physical partitions, and copies data by partitions.

To check if your table has physical partition or not, you can


refer to this query.

Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the primary key column is
used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detects the values and it can take long time
depending on MIN and MAX values. It is recommended to
provide upper bound and lower bound.

For example, if your partition column "ID" has values range


from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions - IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.
SC EN A RIO SUGGEST ED SET T IN GS

Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.

During execution, Data Factory replaces


?AdfRangePartitionColumnName with the actual column
name and value ranges for each partition, and sends to SQL
Server.
For example, if your partition column "ID" has values range
from 1 to 100, and you set the lower bound as 20 and the
upper bound as 80, with parallel copy as 4, Data Factory
retrieves data by 4 partitions- IDs in range <=20, [21, 50],
[51, 80], and >=81, respectively.

Here are more sample queries for different scenarios:


1. Query the whole table:
SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition
2. Query from a table with column selection and additional
where-clause filters:
SELECT <column_list> FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
3. Query with subqueries:
SELECT <column_list> FROM (<your_sub_query>) AS T
WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
4. Query with partition in subquery:
SELECT <column_list> FROM (SELECT
<your_sub_query_column_list> FROM <TableName> WHERE
?AdfDynamicRangePartitionCondition) AS T

Best practices to load data with partition option:


1. Choose distinctive column as partition column (like primary key or unique key) to avoid data skew.
2. If the table has built-in partition, use partition option "Physical partitions of table" to get better performance.
3. If you use Azure Integration Runtime to copy data, you can set larger "Data Integration Units (DIU)" (>4) to
utilize more computing resource. Check the applicable scenarios there.
4. "Degree of copy parallelism" control the partition numbers, setting this number too large sometime hurts
the performance, recommend setting this number as (DIU or number of Self-hosted IR nodes) * (2 to 4).
Example: full load from large table with physical par titions

"source": {
"type": "SqlSource",
"partitionOption": "PhysicalPartitionsOfTable"
}

Example: quer y with dynamic range par tition


"source": {
"type": "SqlSource",
"query":"SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride,
not as data filter>",
"partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride,
not as data filter>"
}
}

Sample query to check physical partition

SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, pf.name AS PartitionFunctionName, c.name AS


ColumnName, iif(pf.name is null, 'no', 'yes') AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id
= t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.partition_schemes ps ON i.data_space_id = ps.data_space_id
LEFT JOIN sys.partition_functions pf ON pf.function_id = ps.function_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'

If the table has physical partition, you would see "HasPartition" as "yes" like the following.

Best practice for loading data into SQL Server


When you copy data into SQL Server, you might require different write behavior:
Append: My source data has only new records.
Upsert: My source data has both inserts and updates.
Overwrite: I want to reload the entire dimension table each time.
Write with custom logic: I need extra processing before the final insertion into the destination table.
See the respective sections for how to configure in Azure Data Factory and best practices.
Append data
Appending data is the default behavior of this SQL Server sink connector. Azure Data Factory does a bulk insert
to write to your table efficiently. You can configure the source and sink accordingly in the copy activity.
Upsert data
Option 1: When you have a large amount of data to copy, you can bulk load all records into a staging table by
using the copy activity, then run a stored procedure activity to apply a MERGE or INSERT/UPDATE statement in
one shot.
Copy activity currently doesn't natively support loading data into a database temporary table. There is an
advanced way to set it up with a combination of multiple activities, refer to Optimize SQL Database Bulk Upsert
scenarios. Below shows a sample of using a permanent table as staging.
As an example, in Azure Data Factory, you can create a pipeline with a Copy activity chained with a Stored
Procedure activity . The former copies data from your source store into a SQL Server staging table, for
example, Upser tStagingTable , as the table name in the dataset. Then the latter invokes a stored procedure to
merge source data from the staging table into the target table and clean up the staging table.

In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.

CREATE PROCEDURE [dbo].[spMergeData]


AS
BEGIN
MERGE TargetTable AS target
USING UpsertStagingTable AS source
ON (target.[ProfileID] = source.[ProfileID])
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT matched THEN
INSERT ([ProfileID], [State], [Category])
VALUES (source.ProfileID, source.State, source.Category);

TRUNCATE TABLE UpsertStagingTable


END

Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Overwrite the entire table
You can configure the preCopyScript property in a copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data.

Invoke a stored procedure from a SQL sink


When you copy data into SQL Server database, you also can configure and invoke a user-specified stored
procedure with additional parameters on each batch of the source table. The stored procedure feature takes
advantage of table-valued parameters.
You can use a stored procedure when built-in copy mechanisms don't serve the purpose. An example is when
you want to apply extra processing before the final insertion of source data into the destination table. Some
extra processing examples are when you want to merge columns, look up additional values, and insert into
more than one table.
The following sample shows how to use a stored procedure to do an upsert into a table in the SQL Server
database. Assume that the input data and the sink Marketing table each have three columns: ProfileID , State ,
and Categor y . Do the upsert based on the ProfileID column, and only apply it for a specific category called
"ProductA".
1. In your database, define the table type with the same name as sqlWriterTableType . The schema of the
table type is the same as the schema returned by your input data.

CREATE TYPE [dbo].[MarketingType] AS TABLE(


[ProfileID] [varchar](256) NOT NULL,
[State] [varchar](256) NOT NULL,
[Category] [varchar](256) NOT NULL
)

2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.

CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @category


varchar(256)
AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING @Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = @category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END

3. In Azure Data Factory, define the SQL sink section in the copy activity as follows:

"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}

Mapping data flow properties


When transforming data in mapping data flow, you can read and write to tables from SQL Server Database. For
more information, see the source transformation and sink transformation in mapping data flows.

NOTE
To access on premise SQL Server, you need to use Azure Data Factory Managed Virtual Network using private endpoint.
Refer to this tutorial for detailed steps.
Source transformation
The below table lists the properties supported by SQL Server source. You can edit these properties in the
Source options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table If you select Table as No - -


input, data flow
fetches all the data
from the table
specified in the
dataset.

Query If you select Query No String query


as input, specify a
SQL query to fetch
data from source,
which overrides any
table you specify in
dataset. Using
queries is a great way
to reduce rows for
testing or lookups.

Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
Select * from
MyTable where
customerId > 1000
and customerId <
2000

Batch size Specify a batch size No Integer batchSize


to chunk large data
into reads.

Isolation Level Choose one of the No READ_COMMITTED isolationLevel


following isolation READ_UNCOMMITTED
levels: REPEATABLE_READ
- Read Committed SERIALIZABLE
- Read Uncommitted NONE
(default)
- Repeatable Read
- Serializable
- None (ignore
isolation level)

SQL Server source script example


When you use SQL Server as source type, the associated data flow script is:
source(allowSchemaDrift: true,
validateSchema: false,
isolationLevel: 'READ_UNCOMMITTED',
query: 'select * from MYTABLE',
format: 'query') ~> SQLSource

Sink transformation
The below table lists the properties supported by SQL Server sink. You can edit these properties in the Sink
options tab.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Update method Specify what Yes true or false deletable


operations are insertable
allowed on your updateable
database destination. upsertable
The default is to only
allow inserts.
To update, upsert, or
delete rows, an Alter
row transformation is
required to tag rows
for those actions.

Key columns For updates, upserts No Array keys


and deletes, key
column(s) must be
set to determine
which row to alter.
The column name
that you pick as the
key will be used as
part of the
subsequent update,
upsert, delete.
Therefore, you must
pick a column that
exists in the Sink
mapping.

Skip writing key If you wish to not No true or false skipKeyWrites


columns write the value to the
key column, select
"Skip writing key
columns".
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Table action Determines whether No true or false recreate


to recreate or truncate
remove all rows from
the destination table
prior to writing.
- None : No action
will be done to the
table.
- Recreate : The table
will get dropped and
recreated. Required if
creating a new table
dynamically.
- Truncate : All rows
from the target table
will get removed.

Batch size Specify how many No Integer batchSize


rows are being
written in each batch.
Larger batch sizes
improve compression
and memory
optimization, but risk
out of memory
exceptions when
caching data.

Pre and Post SQL Specify multi-line No String preSQLs


scripts SQL scripts that will postSQLs
execute before (pre-
processing) and after
(post-processing)
data is written to
your Sink database.

SQL Server sink script example


When you use SQL Server as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,


validateSchema: false,
deletable:false,
insertable:true,
updateable:true,
upsertable:true,
keys:['keyColumn'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> SQLSink

Data type mapping for SQL Server


When you copy data from and to SQL Server, the following mappings are used from SQL Server data types to
Azure Data Factory interim data types. To learn how the copy activity maps the source schema and data type to
the sink, see Schema and data type mappings.
SQ L SERVER DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

bigint Int64

binary Byte[]

bit Boolean

char String, Char[]

date DateTime

Datetime DateTime

datetime2 DateTime

Datetimeoffset DateTimeOffset

Decimal Decimal

FILESTREAM attribute (varbinary(max)) Byte[]

Float Double

image Byte[]

int Int32

money Decimal

nchar String, Char[]

ntext String, Char[]

numeric Decimal

nvarchar String, Char[]

real Single

rowversion Byte[]

smalldatetime DateTime

smallint Int16

smallmoney Decimal

sql_variant Object

text String, Char[]


SQ L SERVER DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E

time TimeSpan

timestamp Byte[]

tinyint Int16

uniqueidentifier Guid

varbinary Byte[]

varchar String, Char[]

xml String

NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
that requires precision larger than 28, consider converting to a string in a SQL query.

Lookup activity properties


To learn details about the properties, check Lookup activity.

GetMetadata activity properties


To learn details about the properties, check GetMetadata activity

Using Always Encrypted


When you copy data from/to SQL Server with Always Encrypted, follow below steps:
1. Store the Column Master Key (CMK) in an Azure Key Vault. Learn more on how to configure Always
Encrypted by using Azure Key Vault
2. Make sure to grant access to the key vault where the Column Master Key (CMK) is stored. Refer to this
article for required permissions.
3. Create linked service to connect to your SQL database and enable 'Always Encrypted' function by using
either managed identity or service principal.

NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.

Troubleshoot connection issues


1. Configure your SQL Server instance to accept remote connections. Start SQL Ser ver Management
Studio , right-click ser ver , and select Proper ties . Select Connections from the list, and select the Allow
remote connections to this ser ver check box.

For detailed steps, see Configure the remote access server configuration option.
2. Start SQL Ser ver Configuration Manager . Expand SQL Ser ver Network Configuration for the
instance you want, and select Protocols for MSSQLSERVER . Protocols appear in the right pane. Enable
TCP/IP by right-clicking TCP/IP and selecting Enable .

For more information and alternate ways of enabling TCP/IP protocol, see Enable or disable a server
network protocol.
3. In the same window, double-click TCP/IP to launch the TCP/IP Proper ties window.
4. Switch to the IP Addresses tab. Scroll down to see the IPAll section. Write down the TCP Por t . The
default is 1433 .
5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port.
6. Verify connection : To connect to SQL Server by using a fully qualified name, use SQL Server
Management Studio from a different machine. An example is
"<machine>.<domain>.corp.<company>.com,1433" .

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data from Square using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Square. It builds on
the copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Square connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Square to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Square connector.

Linked service properties


The following properties are supported for Square linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Square

connectionProperties A group of properties that defines how Yes


to connect to Square.

Under connectionProperties :

host The URL of the Square instance. (i.e. Yes


mystore.mysquare.com)

clientId The client ID associated with your Yes


Square application.

clientSecret The client secret associated with your Yes


Square application. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

accessToken The access token obtained from Yes


Square. Grants limited access to a
Square account by asking an
authenticated user for explicit
permissions. OAuth access tokens
expires 30 days after issued, but
refresh tokens do not expire. Access
tokens can be refreshed by refresh
token.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

refreshToken The refresh token obtained from No


Square. Used to obtain new access
tokens when the current one expires.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Square support two types of access token: personal and OAuth .


Personal access tokens are used to get unlimited Connect API access to resources in your own Square
account.
OAuth access tokens are used to get authenticated and scoped Connect API access to any Square account.
Use them when your app accesses resources in other Square accounts on behalf of account owners. OAuth
access tokens can also be used to access resources in your own Square account.
In Data Factory, Authentication via personal access token only needs accessToken , while authentication via
OAuth requires accessToken and refreshToken . Learn how to retrieve access token from here.
Example:

{
"name": "SquareLinkedService",
"properties": {
"type": "Square",
"typeProperties": {
"connectionProperties":{
"host":"<e.g. mystore.mysquare.com>",
"clientId":"<client ID>",
"clientSecrect":{
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Square dataset.
To copy data from Square, set the type property of the dataset to SquareObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: SquareObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example
{
"name": "SquareDataset",
"properties": {
"type": "SquareObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Square linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Square source.
Square as source
To copy data from Square, set the source type in the copy activity to SquareSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: SquareSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Business" .

Example:

"activities":[
{
"name": "CopyFromSquare",
"type": "Copy",
"inputs": [
{
"referenceName": "<Square input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SquareSource",
"query": "SELECT * FROM Business"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Lookup activity properties
To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Sybase using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Sybase database. It
builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Sybase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Sybase database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Sybase connector supports:
SAP Sybase SQL Anywhere (ASA) version 16 and above .
Copying data using Basic or Windows authentication.
Sybase IQ and ASE are not supported. You can use generic ODBC connector with Sybase driver instead.

Prerequisites
To use this Sybase connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the data provider for Sybase iAnywhere.Data.SQLAnywhere 16 or above on the Integration Runtime
machine.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Sybase connector.

Linked service properties


The following properties are supported for Sybase linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Sybase

server Name of the Sybase server. Yes

database Name of the Sybase database. Yes

authenticationType Type of authentication used to connect Yes


to the Sybase database.
Allowed values are: Basic, and
Windows .

username Specify user name to connect to the Yes


Sybase database.

password Specify password for the user account Yes


you specified for the username. Mark
this field as a SecureString to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "SybaseLinkedService",
"properties": {
"type": "Sybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Sybase dataset.
To copy data from Sybase, the following properties are supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: SybaseTable

tableName Name of the table in the Sybase No (if "query" in activity source is
database. specified)

Example

{
"name": "SybaseDataset",
"properties": {
"type": "SybaseTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Sybase linked service name>",
"type": "LinkedServiceReference"
}
}
}

If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Sybase source.
Sybase as source
To copy data from Sybase, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: SybaseSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .

Example:
"activities":[
{
"name": "CopyFromSybase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Sybase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SybaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.

Data type mapping for Sybase


When copying data from Sybase, the following mappings are used from Sybase data types to Azure Data
Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.
Sybase supports T-SQL types. For a mapping table from SQL types to Azure Data Factory interim data types, see
Azure SQL Database Connector - data type mapping section.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Teradata Vantage by using Azure
Data Factory
5/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the copy activity in Azure Data Factory to copy data from Teradata Vantage. It
builds on the copy activity overview.

Supported capabilities
This Teradata connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Teradata Vantage to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Teradata connector supports:
Teradata version 14.10, 15.0, 15.10, 16.0, 16.10, and 16.20 .
Copying data by using Basic , Windows , or LDAP authentication.
Parallel copying from a Teradata source. See the Parallel copy from Teradata section for details.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
If you use Self-hosted Integration Runtime, note it provides a built-in Teradata driver starting from version 3.18.
You don't need to manually install any driver. The driver requires "Visual C++ Redistributable 2012 Update 4" on
the self-hosted integration runtime machine. If you don't yet have it installed, download it from here.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Teradata connector.

Linked service properties


The Teradata linked service supports the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to Yes


Teradata .

connectionString Specifies the information needed to Yes


connect to the Teradata instance. Refer
to the following samples.
You can also put a password in Azure
Key Vault, and pull the password
configuration out of the connection
string. Refer to Store credentials in
Azure Key Vault with more details.

username Specify a user name to connect to No


Teradata. Applies when you are using
Windows authentication.

password Specify a password for the user No


account you specified for the user
name. You can also choose to
reference a secret stored in Azure Key
Vault.
Applies when you are using Windows
authentication, or referencing a
password in Key Vault for basic
authentication.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

More connection properties you can set in connection string per your case:

P RO P ERT Y DESC RIP T IO N DEFA ULT VA L UE

TdmstPortNumber The number of the port used to access 1025


Teradata database.
Do not change this value unless
instructed to do so by Technical
Support.
P RO P ERT Y DESC RIP T IO N DEFA ULT VA L UE

UseDataEncryption Specifies whether to encrypt all 0


communication with the Teradata
database. Allowed values are 0 or 1.

- 0 (disabled, default) : Encrypts


authentication information only.
- 1 (enabled) : Encrypts all data that is
passed between the driver and the
database.

CharacterSet The character set to use for the ASCII


session. E.g., CharacterSet=UTF16 .

This value can be a user-defined


character set, or one of the following
pre-defined character sets:
- ASCII
- UTF8
- UTF16
- LATIN1252_0A
- LATIN9_0A
- LATIN1_0A
- Shift-JIS (Windows, DOS compatible,
KANJISJIS_0S)
- EUC (Unix compatible, KANJIEC_0U)
- IBM Mainframe
(KANJIEBCDIC5035_0I)
- KANJI932_1S0
- BIG5 (TCHBIG5_1R0)
- GB (SCHGB2312_1T0)
- SCHINESE936_6R0
- TCHINESE950_8R0
- NetworkKorean
(HANGULKSC5601_2R4)
- HANGUL949_7R0
- ARABIC1256_6A0
- CYRILLIC1251_2A0
- HEBREW1255_5A0
- LATIN1250_1A0
- LATIN1254_7A0
- LATIN1258_8A0
- THAI874_4A0

MaxRespSize The maximum size of the response 65536


buffer for SQL requests, in kilobytes
(KBs). E.g., MaxRespSize=10485760 .

For Teradata Database version 16.00


or later, the maximum value is
7361536. For connections that use
earlier versions, the maximum value is
1048576.

MechanismName To use the LDAP protocol to N/A


authenticate the connection, specify
MechanismName=LDAP .

Example using basic authentication


{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"connectionString": "DBCName=<server>;Uid=<username>;Pwd=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example using Windows authentication

{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"connectionString": "DBCName=<server>",
"username": "<username>",
"password": "<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example using LDAP authentication

{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"connectionString": "DBCName=<server>;MechanismName=LDAP;Uid=<username>;Pwd=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

NOTE
The following payload is still supported. Going forward, however, you should use the new one.

Previous payload:
{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "<Basic/Windows>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
This section provides a list of properties supported by the Teradata dataset. For a full list of sections and
properties available for defining datasets, see Datasets.
To copy data from Teradata, the following properties are supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to TeradataTable .

database The name of the Teradata instance. No (if "query" in activity source is
specified)

table The name of the table in the Teradata No (if "query" in activity source is
instance. specified)

Example:

{
"name": "TeradataDataset",
"properties": {
"type": "TeradataTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
}
}
}

NOTE
RelationalTable type dataset is still supported. However, we recommend that you use the new dataset.

Previous payload:
{
"name": "TeradataDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}

Copy activity properties


This section provides a list of properties supported by Teradata source. For a full list of sections and properties
available for defining activities, see Pipelines.
Teradata as source

TIP
To load data from Teradata efficiently by using data partitioning, learn more from Parallel copy from Teradata section.

To copy data from Teradata, the following properties are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to
TeradataSource .

query Use the custom SQL query to read No (if table in dataset is specified)
data. An example is
"SELECT * FROM MyTable" .
When you enable partitioned load, you
need to hook any corresponding built-
in partition parameters in your query.
For examples, see the Parallel copy
from Teradata section.

partitionOptions Specifies the data partitioning options No


used to load data from Teradata.
Allow values are: None (default), Hash
and DynamicRange .
When a partition option is enabled
(that is, not None ), the degree of
parallelism to concurrently load data
from Teradata is controlled by the
parallelCopies setting on the copy
activity.

partitionSettings Specify the group of the settings for No


data partitioning.
Apply when partition option isn't
None .
P RO P ERT Y DESC RIP T IO N REQ UIRED

partitionColumnName Specify the name of the source column No


that will be used by range partition or
Hash partition for parallel copy. If not
specified, the primary index of the
table is autodetected and used as the
partition column.
Apply when the partition option is
Hash or DynamicRange . If you use a
query to retrieve the source data,
hook ?AdfHashPartitionCondition
or ?AdfRangePartitionColumnName
in WHERE clause. See example in
Parallel copy from Teradata section.

partitionUpperBound The maximum value of the partition No


column to copy data out.
Apply when partition option is
DynamicRange . If you use query to
retrieve source data, hook
?AdfRangePartitionUpbound in the
WHERE clause. For an example, see the
Parallel copy from Teradata section.

partitionLowerBound The minimum value of the partition No


column to copy data out.
Apply when the partition option is
DynamicRange . If you use a query to
retrieve the source data, hook
?AdfRangePartitionLowbound in the
WHERE clause. For an example, see the
Parallel copy from Teradata section.

NOTE
RelationalSource type copy source is still supported, but it doesn't support the new built-in parallel load from Teradata
(partition options). However, we recommend that you use the new dataset.

Example: copy data by using a basic quer y without par tition


"activities":[
{
"name": "CopyFromTeradata",
"type": "Copy",
"inputs": [
{
"referenceName": "<Teradata input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "TeradataSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Parallel copy from Teradata


The Data Factory Teradata connector provides built-in data partitioning to copy data from Teradata in parallel.
You can find data partitioning options on the Source table of the copy activity.

When you enable partitioned copy, Data Factory runs parallel queries against your Teradata source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Teradata.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Teradata. The following are suggested configurations for different scenarios. When copying data into
file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name), in which
case the performance is better than writing to a single file.

SC EN A RIO SUGGEST ED SET T IN GS

Full load from large table. Par tition option : Hash.

During execution, Data Factory automatically detects the


primary index column, applies a hash against it, and copies
data by partitions.
SC EN A RIO SUGGEST ED SET T IN GS

Load large amount of data by using a custom query. Par tition option : Hash.
Quer y :
SELECT * FROM <TABLENAME> WHERE ?
AdfHashPartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used for apply hash
partition. If not specified, Data Factory automatically detects
the PK column of the table you specified in the Teradata
dataset.

During execution, Data Factory replaces


?AdfHashPartitionCondition with the hash partition
logic, and sends to Teradata.

Load large amount of data by using a custom query, having Par tition options : Dynamic range partition.
an integer column with evenly distributed value for range Quer y :
partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against the partition column to
retrieve data only between the lower and upper range.

During execution, Data Factory replaces


?AdfRangePartitionColumnName ,
?AdfRangePartitionUpbound , and
?AdfRangePartitionLowbound with the actual column
name and value ranges for each partition, and sends to
Teradata.
For example, if your partition column "ID" set with the lower
bound as 1 and the upper bound as 80, with parallel copy
set as 4, Data Factory retrieves data by 4 partitions. Their
IDs are between [1,20], [21, 40], [41, 60], and [61, 80],
respectively.

Example: quer y with hash par tition

"source": {
"type": "TeradataSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfHashPartitionCondition AND <your_additional_where_clause>",
"partitionOption": "Hash",
"partitionSettings": {
"partitionColumnName": "<hash_partition_column_name>"
}
}

Example: quer y with dynamic range par tition


"source": {
"type": "TeradataSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfRangePartitionColumnName <= ?AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?AdfRangePartitionLowbound AND <your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<dynamic_range_partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column>",
"partitionLowerBound": "<lower_value_of_partition_column>"
}
}

Data type mapping for Teradata


When you copy data from Teradata, the following mappings apply. To learn about how the copy activity maps
the source schema and data type to the sink, see Schema and data type mappings.

T ERA DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

BigInt Int64

Blob Byte[]

Byte Byte[]

ByteInt Int16

Char String

Clob String

Date DateTime

Decimal Decimal

Double Double

Graphic Not supported. Apply explicit cast in source query.

Integer Int32

Interval Day Not supported. Apply explicit cast in source query.

Interval Day To Hour Not supported. Apply explicit cast in source query.

Interval Day To Minute Not supported. Apply explicit cast in source query.

Interval Day To Second Not supported. Apply explicit cast in source query.

Interval Hour Not supported. Apply explicit cast in source query.

Interval Hour To Minute Not supported. Apply explicit cast in source query.
T ERA DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E

Interval Hour To Second Not supported. Apply explicit cast in source query.

Interval Minute Not supported. Apply explicit cast in source query.

Interval Minute To Second Not supported. Apply explicit cast in source query.

Interval Month Not supported. Apply explicit cast in source query.

Interval Second Not supported. Apply explicit cast in source query.

Interval Year Not supported. Apply explicit cast in source query.

Interval Year To Month Not supported. Apply explicit cast in source query.

Number Double

Period (Date) Not supported. Apply explicit cast in source query.

Period (Time) Not supported. Apply explicit cast in source query.

Period (Time With Time Zone) Not supported. Apply explicit cast in source query.

Period (Timestamp) Not supported. Apply explicit cast in source query.

Period (Timestamp With Time Zone) Not supported. Apply explicit cast in source query.

SmallInt Int16

Time TimeSpan

Time With Time Zone TimeSpan

Timestamp DateTime

Timestamp With Time Zone DateTime

VarByte Byte[]

VarChar String

VarGraphic Not supported. Apply explicit cast in source query.

Xml Not supported. Apply explicit cast in source query.

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Vertica using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Vertica. It builds on
the copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Vertica connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Vertica to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.

Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Vertica connector.

Linked service properties


The following properties are supported for Vertica linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Ver tica

connectionString An ODBC connection string to connect Yes


to Vertica.
You can also put password in Azure
Key Vault and pull the pwd
configuration out of the connection
string. Refer to the following samples
and Store credentials in Azure Key
Vault article with more details.

connectVia The Integration Runtime to be used to No


connect to the data store. Learn more
from Prerequisites section. If not
specified, it uses the default Azure
Integration Runtime.

Example:

{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Example: store password in Azure Key Vault

{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Vertica dataset.
To copy data from Vertica, set the type property of the dataset to Ver ticaTable . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: Ver ticaTable

schema Name of the schema. No (if "query" in activity source is


specified)

table Name of the table. No (if "query" in activity source is


specified)

tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.

Example

{
"name": "VerticaDataset",
"properties": {
"type": "VerticaTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Vertica linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Vertica source.
Vertica as source
To copy data from Vertica, set the source type in the copy activity to Ver ticaSource . The following properties
are supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: Ver ticaSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:

"activities":[
{
"name": "CopyFromVertica",
"type": "Copy",
"inputs": [
{
"referenceName": "<Vertica input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "VerticaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Web table by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Web table database.
It builds on the copy activity overview article that presents a general overview of copy activity.
The difference among this Web table connector, the REST connector and the HTTP connector are:
Web table connector extracts table content from an HTML webpage.
REST connector specifically support copying data from RESTful APIs.
HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file.

Supported capabilities
This Web table connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Web table database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Web table connector supports extracting table content from an HTML page .

Prerequisites
To use this Web table connector, you need to set up a Self-hosted Integration Runtime. See Self-hosted
Integration Runtime article for details.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Web table connector.

Linked service properties


The following properties are supported for Web table linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Web

url URL to the Web source Yes

authenticationType Allowed value is: Anonymous . Yes

connectVia The Integration Runtime to be used to Yes


connect to the data store. A Self-
hosted Integration Runtime is required
as mentioned in Prerequisites.

Example:

{
"name": "WebLinkedService",
"properties": {
"type": "Web",
"typeProperties": {
"url" : "https://en.wikipedia.org/wiki/",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Web table dataset.
To copy data from Web table, set the type property of the dataset to WebTable . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: WebTable

path A relative URL to the resource that No. When path is not specified, only
contains the table. the URL specified in the linked service
definition is used.

index The index of the table in the resource. Yes


See Get index of a table in an HTML
page section for steps to getting index
of a table in an HTML page.

Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Web linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Web table source.
Web table as source
To copy data from Web table, set the source type in the copy activity to WebSource , no additional properties
are supported.
Example:

"activities":[
{
"name": "CopyFromWebTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<Web table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "WebSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Get index of a table in an HTML page


To get the index of a table which you need to configure in dataset properties, you can use e.g. Excel 2016 as the
tool as follows:
1. Launch Excel 2016 and switch to the Data tab.
2. Click New Quer y on the toolbar, point to From Other Sources and click From Web .

3. In the From Web dialog box, enter URL that you would use in linked service JSON (for example:
https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example:
AFI%27s_100_Years...100_Movies), and click OK .

URL used in this example: https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies


4. If you see Access Web content dialog box, select the right URL , authentication , and click Connect .
5. Click a table item in the tree view to see content from the table and then click Edit button at the bottom.

6. In the Quer y Editor window, click Advanced Editor button on the toolbar.
7. In the Advanced Editor dialog box, the number next to "Source" is the index.

If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page
article for details. The steps are similar if you are using Microsoft Power BI for Desktop.

Lookup activity properties


To learn details about the properties, check Lookup activity.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Xero using Azure Data Factory
5/6/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Xero. It builds on the
copy activity overview article that presents a general overview of copy activity.

Supported capabilities
This Xero connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Xero to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Xero connector supports:
OAuth 2.0 and OAuth 1.0 authentication. For OAuth 1.0, the connector supports Xero private application but
not public application.
All Xero tables (API endpoints) except "Reports".

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Xero connector.

Linked service properties


The following properties are supported for Xero linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Xero

connectionProperties A group of properties that defines how Yes


to connect to Xero.
P RO P ERT Y DESC RIP T IO N REQ UIRED

Under connectionProperties :

host The endpoint of the Xero server ( Yes


api.xero.com ).

authenticationType Allowed values are OAuth_2.0 and Yes


OAuth_1.0 .

consumerKey For OAuth 2.0, specify the client ID Yes


for your Xero application.
For OAuth 1.0, specify the consumer
key associated with the Xero
application.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

privateKey For OAuth 2.0, specify the client Yes


secret for your Xero application.
For OAuth 1.0, specify the private key
from the .pem file that was generated
for your Xero private application, see
Create a public/private key pair. Note
to generate the privatekey.pem
with numbits of 512 using
openssl genrsa -out
privatekey.pem 512
, 1024 is not supported. Include all the
text from the .pem file including the
Unix line endings(\n), see sample
below.

Mark this field as a SecureString to


store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

tenantId The tenant ID associated with your Yes for OAuth 2.0 authentication
Xero application. Applicable for OAuth
2.0 authentication.
Learn how to get the tenant ID from
Check the tenants you're authorized to
access section.
P RO P ERT Y DESC RIP T IO N REQ UIRED

refreshToken Applicable for OAuth 2.0 Yes for OAuth 2.0 authentication
authentication.
TheOAuth 2.0 refreshtoken is
associated withtheXero application and
used to refresh the accesstoken; the
access token expires after 30 minutes.
Learn about how the Xero
authorization flow works and how to
get the refresh token from this article.
To get a refresh token, you must
request the offline_access scope.
Know limitation : Note Xero resets
the refresh token after it's used for
access token refresh. For
operationalized workload, before each
copy activity run, you need to set a
valid refresh token for ADF to use.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether the host name is No


required in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.

Example: OAuth 2.0 authentication


{
"name": "XeroLinkedService",
"properties": {
"type": "Xero",
"typeProperties": {
"connectionProperties": {
"host":"api.xero.com",
"authenticationType":"OAuth_2.0",
"consumerKey": {
"type": "SecureString",
"value": "<client ID>"
},
"privateKey": {
"type": "SecureString",
"value": "<client secret>"
},
"tenantId":"<tenant ID>",
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Example: OAuth 1.0 authentication

{
"name": "XeroLinkedService",
"properties": {
"type": "Xero",
"typeProperties": {
"connectionProperties": {
"host":"api.xero.com",
"authenticationType":"OAuth_1.0",
"consumerKey": {
"type": "SecureString",
"value": "<consumer key>"
},
"privateKey": {
"type": "SecureString",
"value": "<private key>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Sample private key value:


Include all the text from the .pem file including the Unix line endings(\n).
"-----BEGIN RSA PRIVATE KEY-----
\nMII***************************************************P\nbu***********************************************
*****s\nU/****************************************************B\nA******************************************
***********W\njH****************************************************e\nsx***********************************
******************l\nq******************************************************X\nh****************************
*************************i\nd*****************************************************s\nA**********************
*******************************dsfb\nN*****************************************************M\np*************
****************************************Ly\nK*****************************************************Y=\n-----
END RSA PRIVATE KEY-----"

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Xero dataset.
To copy data from Xero, set the type property of the dataset to XeroObject . The following properties are
supported:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: XeroObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "XeroDataset",
"properties": {
"type": "XeroObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Xero linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Xero source.
Xero as source
To copy data from Xero, set the source type in the copy activity to XeroSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: XeroSource
P RO P ERT Y DESC RIP T IO N REQ UIRED

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Contacts" .

Example:

"activities":[
{
"name": "CopyFromXero",
"type": "Copy",
"inputs": [
{
"referenceName": "<Xero input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "XeroSource",
"query": "SELECT * FROM Contacts"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Note the following when specifying the Xero query:


Tables with complex items will be split to multiple tables. For example, Bank transactions has a complex
data structure "LineItems", so data of bank transaction is mapped to table Bank_Transaction and
Bank_Transaction_Line_Items , with Bank_Transaction_ID as foreign key to link them together.

Xero data is available through two schemas: Minimal (default) and Complete . The Complete schema
contains prerequisite call tables which require additional data (e.g. ID column) before making the desired
query.
The following tables have the same information in the Minimal and Complete schema. To reduce the number of
API calls, use Minimal schema (default).
Bank_Transactions
Contact_Groups
Contacts
Contacts_Sales_Tracking_Categories
Contacts_Phones
Contacts_Addresses
Contacts_Purchases_Tracking_Categories
Credit_Notes
Credit_Notes_Allocations
Expense_Claims
Expense_Claim_Validation_Errors
Invoices
Invoices_Credit_Notes
Invoices_ Prepayments
Invoices_Overpayments
Manual_Journals
Overpayments
Overpayments_Allocations
Prepayments
Prepayments_Allocations
Receipts
Receipt_Validation_Errors
Tracking_Categories
The following tables can only be queried with complete schema:
Complete.Bank_Transaction_Line_Items
Complete.Bank_Transaction_Line_Item_Tracking
Complete.Contact_Group_Contacts
Complete.Contacts_Contact_ Persons
Complete.Credit_Note_Line_Items
Complete.Credit_Notes_Line_Items_Tracking
Complete.Expense_Claim_ Payments
Complete.Expense_Claim_Receipts
Complete.Invoice_Line_Items
Complete.Invoices_Line_Items_Tracking
Complete.Manual_Journal_Lines
Complete.Manual_Journal_Line_Tracking
Complete.Overpayment_Line_Items
Complete.Overpayment_Line_Items_Tracking
Complete.Prepayment_Line_Items
Complete.Prepayment_Line_Item_Tracking
Complete.Receipt_Line_Items
Complete.Receipt_Line_Item_Tracking
Complete.Tracking_Category_Options

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of supported data stores by the copy activity, see supported data stores.
XML format in Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Follow this article when you want to parse the XML files .
XML format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob,
Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google
Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP. It is supported as source but not sink.

Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the XML dataset.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to Xml.

location Location settings of the file(s). Each Yes


file-based connector has its own
location type and supported
properties under location . See
details in connector ar ticle ->
Dataset proper ties section .
P RO P ERT Y DESC RIP T IO N REQ UIRED

encodingName The encoding type used to read/write No


test files.
Allowed values are as follows: "UTF-8",
"UTF-16", "UTF-16BE", "UTF-32", "UTF-
32BE", "US-ASCII", "UTF-7", "BIG5",
"EUC-JP", "EUC-KR", "GB2312",
"GB18030", "JOHAB", "SHIFT-JIS",
"CP875", "CP866", "IBM00858",
"IBM037", "IBM273", "IBM437",
"IBM500", "IBM737", "IBM775",
"IBM850", "IBM852", "IBM855",
"IBM857", "IBM860", "IBM861",
"IBM863", "IBM864", "IBM865",
"IBM869", "IBM870", "IBM01140",
"IBM01141", "IBM01142",
"IBM01143", "IBM01144",
"IBM01145", "IBM01146",
"IBM01147", "IBM01148",
"IBM01149", "ISO-2022-JP", "ISO-
2022-KR", "ISO-8859-1", "ISO-8859-
2", "ISO-8859-3", "ISO-8859-4", "ISO-
8859-5", "ISO-8859-6", "ISO-8859-7",
"ISO-8859-8", "ISO-8859-9", "ISO-
8859-13", "ISO-8859-15",
"WINDOWS-874", "WINDOWS-1250",
"WINDOWS-1251", "WINDOWS-
1252", "WINDOWS-1253",
"WINDOWS-1254", "WINDOWS-
1255", "WINDOWS-1256",
"WINDOWS-1257", "WINDOWS-
1258".

nullValue Specifies the string representation of No


null value.
The default value is empty string .

compression Group of properties to configure file No


compression. Configure this section
when you want to do
compression/decompression during
activity execution.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The compression codec used to No.


(under compression ) read/write XML files.
Allowed values are bzip2 , gzip ,
deflate , ZipDeflate , TarGzip , Tar ,
snappy , or lz4 . Default is not
compressed.
Note currently Copy activity doesn't
support "snappy" & "lz4", and
mapping data flow doesn't support
"ZipDeflate", "TarGzip" and "Tar".
Note when using copy activity to
decompress ZipDeflate /TarGzip /Tar
file(s) and write to file-based sink data
store, by default files are extracted to
the folder:
<path specified in
dataset>/<folder named as source
compressed file>/
, use preserveZipFileNameAsFolder /
preserveCompressionFileNameAsFolder
on copy activity source to control
whether to preserve the name of the
compressed file(s) as folder structure.

level The compression ratio. No


(under compression ) Allowed values are Optimal or
Fastest .
- Fastest: The compression operation
should complete as quickly as possible,
even if the resulting file is not
optimally compressed.
- Optimal: The compression operation
should be optimally compressed, even
if the operation takes a longer time to
complete. For more information, see
Compression Level topic.

Below is an example of XML dataset on Azure Blob Storage:

{
"name": "XMLDataset",
"properties": {
"type": "Xml",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "ZipDeflate"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the XML source.
Learn about how to map XML data and sink data store/format from schema mapping. When previewing XML
files, data is shown with JSON hierarchy, and you use JSON path to point to the fields.
XML as source
The following properties are supported in the copy activity *source* section. Learn more from XML connector
behavior.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to XmlSource .

formatSettings A group of properties. Refer to XML No


read settings table below.

storeSettings A group of properties on how to read No


data from a data store. Each file-based
connector has its own supported read
settings under storeSettings . See
details in connector ar ticle ->
Copy activity proper ties section .

Supported XML read settings under formatSettings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type of formatSettings must be set Yes


to XmlReadSettings .

validationMode Specifies whether to validate the XML No


schema.
Allowed values are none (default, no
validation), xsd (validate using XSD),
dtd (validate using DTD).

namespaces Whether to enable namespace when No


parsing the XML files. Allowed values
are: true (default), false .

namespacePrefixes Namespace URI to prefix mapping, No


which is used to name fields when
parsing the xml file.
If an XML file has namespace and
namespace is enabled, by default, the
field name is the same as it is in the
XML document.
If there is an item defined for the
namespace URI in this map, the field
name is prefix:fieldName .

detectDataType Whether to detect integer, double, and No


Boolean data types. Allowed values
are: true (default), false .
P RO P ERT Y DESC RIP T IO N REQ UIRED

compressionProperties A group of properties on how to No


decompress data for a given
compression codec.

preserveZipFileNameAsFolder Applies when input dataset is No


(under compressionProperties -> configured with ZipDeflate
type as ZipDeflateReadSettings ) compression. Indicates whether to
preserve the source zip file name as
folder structure during copy.
- When set to true (default) , Data
Factory writes unzipped files to
<path specified in
dataset>/<folder named as source
zip file>/
.
- When set to false , Data Factory
writes unzipped files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source zip files
to avoid racing or unexpected
behavior.

preserveCompressionFileNameAsFolde Applies when input dataset is No


r configured with TarGzip /Tar
(under compressionProperties -> compression. Indicates whether to
type as TarGZipReadSettings or preserve the source compressed file
TarReadSettings ) name as folder structure during copy.
- When set to true (default) , Data
Factory writes decompressed files to
<path specified in
dataset>/<folder named as source
compressed file>/
.
- When set to false , Data Factory
writes decompressed files directly to
<path specified in dataset> .
Make sure you don't have duplicated
file names in different source files to
avoid racing or unexpected behavior.

Mapping data flow properties


In mapping data flows, you can read XML format in the following data stores: Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2. You can point to XML files either using XML dataset or
using an inline dataset.
Source properties
The below table lists the properties supported by an XML source. You can edit these properties in the Source
options tab. Learn more from XML connector behavior. When using inline dataset, you will see additional file
settings, which are the same as the properties described in dataset properties section.

DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Wild card paths All files matching the No String[] wildcardPaths


wildcard path will be
processed. Overrides
the folder and file
path set in the
dataset.

Partition root path For file data that is No String partitionRootPath


partitioned, you can
enter a partition root
path in order to read
partitioned folders as
columns

List of files Whether your source No true or false fileList


is pointing to a text
file that lists files to
process

Column to store file Create a new column No String rowUrlColumn


name with the source file
name and path

After completion Delete or move the No Delete: true or purgeFiles


files after processing. false moveFiles
File path starts from Move:
the container root ['<from>',
'<to>']

Filter by last modified Choose to filter files No Timestamp modifiedAfter


based upon when modifiedBefore
they were last altered

Validation mode Specifies whether to No None (default, no validationMode


validate the XML validation)
schema. xsd (validate using
XSD)
dtd (validate using
DTD).

Namespaces Whether to enable No true (default) or namespaces


namespace when false
parsing the XML files.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y

Namespace prefix Namespace URI to No Array with pattern namespacePrefixes


pairs prefix mapping, ['URI1'-
which is used to >'prefix1','URI2'-
>'prefix2']
name fields when
parsing the xml file.
If an XML file has
namespace and
namespace is
enabled, by default,
the field name is the
same as it is in the
XML document.
If there is an item
defined for the
namespace URI in
this map, the field
name is
prefix:fieldName .

Allow no files found If true, an error is not no true or false ignoreNoFilesFound


thrown if no files are
found

XML source script example


The below script is an example of an XML source configuration in mapping data flows using dataset mode.

source(allowSchemaDrift: true,
validateSchema: false,
validationMode: 'xsd',
namespaces: true) ~> XMLSource

The below script is an example of an XML source configuration using inline dataset mode.

source(allowSchemaDrift: true,
validateSchema: false,
format: 'xml',
fileSystem: 'filesystem',
folderPath: 'folder',
validationMode: 'xsd',
namespaces: true) ~> XMLSource

XML connector behavior


Note the following when using XML as source.
XML attributes:
Attributes of an element are parsed as the subfields of the element in the hierarchy.
The name of the attribute field follows the pattern @attributeName .
XML schema validation:
You can choose to not validate schema, or validate schema using XSD or DTD.
When using XSD or DTD to validate XML files, the XSD/DTD must be referred inside the XML files
through relative path.
Namespace handling:
Namespace can be disabled when using data flow, in which case the attributes that defines the
namespace will be parsed as normal attributes.
When namespace is enabled, the names of the element and attributes follow the pattern
namespaceUri,elementName and namespaceUri,@attributeName by default. You can define namespace
prefix for each namespace URI in source, in which case the names of the element and attributes follow
the pattern definedPrefix:elementName or definedPrefix:@attributeName instead.
Value column:
If an XML element has both simple text value and attributes/child elements, the simple text value is
parsed as the value of a "value column" with built-in field name _value_ . And it inherits the
namespace of the element as well if applies.

Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Zoho using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to use the Copy Activity in Azure Data Factory to copy data from Zoho. It builds on the
copy activity overview article that presents a general overview of copy activity.

IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.

Supported capabilities
This Zoho connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Zoho to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
This connector supports Xero access token authentication and OAuth 2.0 authentication.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.

Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Zoho connector.

Linked service properties


The following properties are supported for Zoho linked service:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


Zoho

connectionProperties A group of properties that defines how Yes


to connect to Zoho.

Under connectionProperties :

endpoint The endpoint of the Zoho server ( Yes


crm.zoho.com/crm/private ).

authenticationType Allowed values are OAuth_2.0 and Yes


Access Token .

clientId The client ID associated with your Yes for OAuth 2.0 authentication
Zoho application.

clientSecrect The clientsecret associated with your Yes for OAuth 2.0 authentication
Zoho application. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

refreshToken The OAuth 2.0 refresh token Yes for OAuth 2.0 authentication
associated with your Zoho application,
used to refresh the access token when
it expires. Refresh token will never
expire. To get a refresh token, you
must request the offline
access_type, learn more from this
article.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.

accessToken The access token for Zoho Yes


authentication. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.

useEncryptedEndpoints Specifies whether the data source No


endpoints are encrypted using HTTPS.
The default value is true.

useHostVerification Specifies whether to require the host No


name in the server's certificate to
match the host name of the server
when connecting over TLS. The default
value is true.

usePeerVerification Specifies whether to verify the identity No


of the server when connecting over
TLS. The default value is true.
Example: OAuth 2.0 authentication

{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"connectionProperties": {
"authenticationType":"OAuth_2.0",
"endpoint":"crm.zoho.com/crm/private",
"clientId":"<client ID>",
"clientSecrect":{
"type": "SecureString",
"value": "<client secret>"
},
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Example: access token authentication

{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"connectionProperties": {
"authenticationType":"Access Token",
"endpoint":"crm.zoho.com/crm/private",
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}

Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Zoho dataset.
To copy data from Zoho, set the type property of the dataset to ZohoObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the dataset must Yes


be set to: ZohoObject

tableName Name of the table. No (if "query" in activity source is


specified)

Example

{
"name": "ZohoDataset",
"properties": {
"type": "ZohoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Zoho linked service name>",
"type": "LinkedServiceReference"
}
}
}

Copy activity properties


For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Zoho source.
Zoho as source
To copy data from Zoho, set the source type in the copy activity to ZohoSource . The following properties are
supported in the copy activity source section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


source must be set to: ZohoSource

query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .

Example:
"activities":[
{
"name": "CopyFromZoho",
"type": "Copy",
"inputs": [
{
"referenceName": "<Zoho input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ZohoSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]

Lookup activity properties


To learn details about the properties, check Lookup activity.

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy activity in Azure Data Factory
6/8/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In Azure Data Factory, you can use the Copy activity to copy data among data stores located on-premises and in
the cloud. After you copy the data, you can use other activities to further transform and analyze it. You can also
use the Copy activity to publish transformation and analysis results for business intelligence (BI) and application
consumption.

The Copy activity is executed on an integration runtime. You can use different types of integration runtimes for
different data copy scenarios:
When you're copying data between two data stores that are publicly accessible through the internet from any
IP, you can use the Azure integration runtime for the copy activity. This integration runtime is secure, reliable,
scalable, and globally available.
When you're copying data to and from data stores that are located on-premises or in a network with access
control (for example, an Azure virtual network), you need to set up a self-hosted integration runtime.
An integration runtime needs to be associated with each source and sink data store. For information about how
the Copy activity determines which integration runtime to use, see Determining which IR to use.
To copy data from a source to a sink, the service that runs the Copy activity performs these steps:
1. Reads data from a source data store.
2. Performs serialization/deserialization, compression/decompression, column mapping, and so on. It performs
these operations based on the configuration of the input dataset, output dataset, and Copy activity.
3. Writes data to the sink/destination data store.

Supported data stores and formats


SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Azure Azure Blob ✓ ✓ ✓ ✓


storage

Azure Cognitive ✓ ✓ ✓
Search index
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)

Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB

Azure Data ✓ ✓ ✓ ✓
Explorer

Azure Data Lake ✓ ✓ ✓ ✓


Storage Gen1

Azure Data Lake ✓ ✓ ✓ ✓


Storage Gen2

Azure Database ✓ ✓ ✓
for MariaDB

Azure Database ✓ ✓ ✓ ✓
for MySQL

Azure Database ✓ ✓ ✓ ✓
for PostgreSQL

Azure Databricks ✓ ✓ ✓ ✓
Delta Lake

Azure File ✓ ✓ ✓ ✓
Storage

Azure SQL ✓ ✓ ✓ ✓
Database

Azure SQL ✓ ✓ ✓ ✓
Managed
Instance

Azure Synapse ✓ ✓ ✓ ✓
Analytics

Azure Table ✓ ✓ ✓ ✓
storage

Database Amazon Redshift ✓ ✓ ✓

DB2 ✓ ✓ ✓

Drill ✓ ✓ ✓

Google ✓ ✓ ✓
BigQuery
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Greenplum ✓ ✓ ✓

HBase ✓ ✓ ✓

Hive ✓ ✓ ✓

Apache Impala ✓ ✓ ✓

Informix ✓ ✓ ✓

MariaDB ✓ ✓ ✓

Microsoft Access ✓ ✓ ✓

MySQL ✓ ✓ ✓

Netezza ✓ ✓ ✓

Oracle ✓ ✓ ✓ ✓

Phoenix ✓ ✓ ✓

PostgreSQL ✓ ✓ ✓

Presto ✓ ✓ ✓

SAP Business ✓ ✓
Warehouse via
Open Hub

SAP Business ✓ ✓
Warehouse via
MDX

SAP HANA ✓ ✓ ✓

SAP table ✓ ✓

Snowflake ✓ ✓ ✓ ✓

Spark ✓ ✓ ✓

SQL Server ✓ ✓ ✓ ✓

Sybase ✓ ✓

Teradata ✓ ✓ ✓

Vertica ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

NoSQL Cassandra ✓ ✓ ✓

Couchbase ✓ ✓ ✓
(Preview)

MongoDB ✓ ✓ ✓ ✓

MongoDB Atlas ✓ ✓ ✓ ✓

File Amazon S3 ✓ ✓ ✓

Amazon S3 ✓ ✓ ✓
Compatible
Storage

File system ✓ ✓ ✓ ✓

FTP ✓ ✓ ✓

Google Cloud ✓ ✓ ✓
Storage

HDFS ✓ ✓ ✓

Oracle Cloud ✓ ✓ ✓
Storage

SFTP ✓ ✓ ✓ ✓

Generic Generic HTTP ✓ ✓ ✓


protocol

Generic OData ✓ ✓ ✓

Generic ODBC ✓ ✓ ✓

Generic REST ✓ ✓ ✓ ✓

Ser vices and Amazon ✓ ✓ ✓


apps Marketplace
Web Service

Concur (Preview) ✓ ✓ ✓

Dataverse ✓ ✓ ✓ ✓

Dynamics 365 ✓ ✓ ✓ ✓

Dynamics AX ✓ ✓ ✓

Dynamics CRM ✓ ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Google AdWords ✓ ✓ ✓

HubSpot ✓ ✓ ✓

Jira ✓ ✓ ✓

Magento ✓ ✓ ✓
(Preview)

Marketo ✓ ✓ ✓
(Preview)

Microsoft 365 ✓ ✓ ✓

Oracle Eloqua ✓ ✓ ✓
(Preview)

Oracle ✓ ✓ ✓
Responsys
(Preview)

Oracle Service ✓ ✓ ✓
Cloud (Preview)

PayPal (Preview) ✓ ✓ ✓

QuickBooks ✓ ✓ ✓
(Preview)

Salesforce ✓ ✓ ✓ ✓

Salesforce ✓ ✓ ✓ ✓
Service Cloud

Salesforce ✓ ✓ ✓
Marketing Cloud

SAP Cloud for ✓ ✓ ✓ ✓


Customer (C4C)

SAP ECC ✓ ✓ ✓

ServiceNow ✓ ✓ ✓

SharePoint ✓ ✓ ✓
Online List

Shopify (Preview) ✓ ✓ ✓

Square (Preview) ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR

Web table ✓ ✓
(HTML table)

Xero ✓ ✓ ✓

Zoho (Preview) ✓ ✓ ✓

NOTE
If a connector is marked Preview, you can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, contact Azure support.

Supported file formats


Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
You can use the Copy activity to copy files as-is between two file-based data stores, in which case the data is
copied efficiently without any serialization or deserialization. In addition, you can also parse or generate files of a
given format, for example, you can perform the following:
Copy data from a SQL Server database and write to Azure Data Lake Storage Gen2 in Parquet format.
Copy files in text (CSV) format from an on-premises file system and write to Azure Blob storage in Avro
format.
Copy zipped files from an on-premises file system, decompress them on-the-fly, and write extracted files to
Azure Data Lake Storage Gen2.
Copy data in Gzip compressed-text (CSV) format from Azure Blob storage and write it to Azure SQL
Database.
Many more activities that require serialization/deserialization or compression/decompression.

Supported regions
The service that enables the Copy activity is available globally in the regions and geographies listed in Azure
integration runtime locations. The globally available topology ensures efficient data movement that usually
avoids cross-region hops. See Products by region to check the availability of Data Factory and data movement in
a specific region.

Configuration
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
In general, to use the Copy activity in Azure Data Factory, you need to:
1. Create linked ser vices for the source data store and the sink data store. You can find the list of
supported connectors in the Supported data stores and formats section of this article. Refer to the connector
article's "Linked service properties" section for configuration information and supported properties.
2. Create datasets for the source and sink . Refer to the "Dataset properties" sections of the source and
sink connector articles for configuration information and supported properties.
3. Create a pipeline with the Copy activity. The next section provides an example.
Syntax
The following template of a Copy activity contains a complete list of supported properties. Specify the ones that
fit your scenario.
"activities":[
{
"name": "CopyActivityTemplate",
"type": "Copy",
"inputs": [
{
"referenceName": "<source dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<sink dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>",
<properties>
},
"sink": {
"type": "<sink type>"
<properties>
},
"translator":
{
"type": "TabularTranslator",
"columnMappings": "<column mapping>"
},
"dataIntegrationUnits": <number>,
"parallelCopies": <number>,
"enableStaging": true/false,
"stagingSettings": {
<properties>
},
"enableSkipIncompatibleRow": true/false,
"redirectIncompatibleRowSettings": {
<properties>
}
}
}
]

Syntax details

P RO P ERT Y DESC RIP T IO N REQ UIRED?

type For a Copy activity, set to Copy Yes

inputs Specify the dataset that you created Yes


that points to the source data. The
Copy activity supports only a single
input.

outputs Specify the dataset that you created Yes


that points to the sink data. The Copy
activity supports only a single output.

typeProperties Specify properties to configure the Yes


Copy activity.
P RO P ERT Y DESC RIP T IO N REQ UIRED?

source Specify the copy source type and the Yes


corresponding properties for retrieving
data.
For more information, see the "Copy
activity properties" section in the
connector article listed in Supported
data stores and formats.

sink Specify the copy sink type and the Yes


corresponding properties for writing
data.
For more information, see the "Copy
activity properties" section in the
connector article listed in Supported
data stores and formats.

translator Specify explicit column mappings from No


source to sink. This property applies
when the default copy behavior
doesn't meet your needs.
For more information, see Schema
mapping in copy activity.

dataIntegrationUnits Specify a measure that represents the No


amount of power that the Azure
integration runtime uses for data copy.
These units were formerly known as
cloud Data Movement Units (DMU).
For more information, see Data
Integration Units.

parallelCopies Specify the parallelism that you want No


the Copy activity to use when reading
data from the source and writing data
to the sink.
For more information, see Parallel
copy.

preserve Specify whether to preserve No


metadata/ACLs during data copy.
For more information, see Preserve
metadata.

enableStaging Specify whether to stage the interim No


stagingSettings data in Blob storage instead of directly
copying data from source to sink.
For information about useful scenarios
and configuration details, see Staged
copy.

enableSkipIncompatibleRow Choose how to handle incompatible No


redirectIncompatibleRowSettings rows when you copy data from source
to sink.
For more information, see Fault
tolerance.

Monitoring
You can monitor the Copy activity run in the Azure Data Factory both visually and programmatically. For details,
see Monitor copy activity.

Incremental copy
Data Factory enables you to incrementally copy delta data from a source data store to a sink data store. For
details, see Tutorial: Incrementally copy data.

Performance and tuning


The copy activity monitoring experience shows you the copy performance statistics for each of your activity run.
The Copy activity performance and scalability guide describes key factors that affect the performance of data
movement via the Copy activity in Azure Data Factory. It also lists the performance values observed during
testing and discusses how to optimize the performance of the Copy activity.

Resume from last failed run


Copy activity supports resume from last failed run when you copy large size of files as-is with binary format
between file-based stores and choose to preserve the folder/file hierarchy from source to sink, e.g. to migrate
data from Amazon S3 to Azure Data Lake Storage Gen2. It applies to the following file-based connectors:
Amazon S3, Amazon S3 Compatible Storage Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake
Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, Oracle Cloud Storage and SFTP.
You can leverage the copy activity resume in the following two ways:
Activity level retr y: You can set retry count on copy activity. During the pipeline execution, if this copy
activity run fails, the next automatic retry will start from last trial's failure point.
Rerun from failed activity: After pipeline execution completion, you can also trigger a rerun from the
failed activity in the ADF UI monitoring view or programmatically. If the failed activity is a copy activity,
the pipeline will not only rerun from this activity, but also resume from the previous run's failure point.

Few points to note:


Resume happens at file level. If copy activity fails when copying a file, in next run, this specific file will be re-
copied.
For resume to work properly, do not change the copy activity settings between the reruns.
When you copy data from Amazon S3, Azure Blob, Azure Data Lake Storage Gen2 and Google Cloud Storage,
copy activity can resume from arbitrary number of copied files. While for the rest of file-based connectors as
source, currently copy activity supports resume from a limited number of files, usually at the range of tens of
thousands and varies depending on the length of the file paths; files beyond this number will be re-copied
during reruns.
For other scenarios than binary file copy, copy activity rerun starts from the beginning.

Preserve metadata along with data


While copying data from source to sink, in scenarios like data lake migration, you can also choose to preserve
the metadata and ACLs along with data using copy activity. See Preserve metadata for details.

Schema and data type mapping


See Schema and data type mapping for information about how the Copy activity maps your source data to your
sink.

Add additional columns during copy


In addition to copying data from source data store to sink, you can also configure to add additional data
columns to copy along to sink. For example:
When copy from file-based source, store the relative file path as an additional column to trace from which
file the data comes from.
Duplicate the specified source column as another column.
Add a column with ADF expression, to attach ADF system variables like pipeline name/pipeline ID, or store
other dynamic value from upstream activity's output.
Add a column with static value to meet your downstream consumption need.
You can find the following configuration on copy activity source tab. You can also map those additional columns
in copy activity schema mapping as usual by using your defined column names.

TIP
This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.

To configure it programmatically, add the additionalColumns property in your copy activity source:

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

additionalColumns Add additional data columns to copy No


to sink.

Each object under the


additionalColumns array represents
an extra column. The name defines
the column name, and the value
indicates the data value of that
column.

Allowed data values are:


- $$FILEPATH - a reserved variable
indicates to store the source files'
relative path to the folder path
specified in dataset. Apply to file-based
source.
- $$COLUMN:<source_column_name> -
a reserved variable pattern indicates to
duplicate the specified source column
as another column
- Expression
- Static value

Example:
"activities":[
{
"name": "CopyWithAdditionalColumns",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "<source type>",
"additionalColumns": [
{
"name": "filePath",
"value": "$$FILEPATH"
},
{
"name": "newColName",
"value": "$$COLUMN:SourceColumnA"
},
{
"name": "pipelineName",
"value": {
"value": "@pipeline().Pipeline",
"type": "Expression"
}
},
{
"name": "staticValue",
"value": "sampleValue"
}
],
...
},
"sink": {
"type": "<sink type>"
}
}
}
]

Auto create sink tables


When copying data into SQL database/Azure Synapse Analytics, if the destination table does not exist, copy
activity supports automatically creating it based on the source data. It aims to help you quickly get started to
load the data and evaluate SQL database/Azure Synapse Analytics. After the data ingestion, you can review and
adjust the sink table schema according to your needs.
This feature is supported when copying data from any source into the following sink data stores. You can find
the option on ADF authoring UI –> Copy activity sink –> Table option –> Auto create table, or via tableOption
property in copy activity sink payload.
Azure SQL Database
Azure SQL Database Managed Instance
Azure Synapse Analytics
SQL Server
Fault tolerance
By default, the Copy activity stops copying data and returns a failure when source data rows are incompatible
with sink data rows. To make the copy succeed, you can configure the Copy activity to skip and log the
incompatible rows and copy only the compatible data. See Copy activity fault tolerance for details.

Data consistency verification


When you move data from source to destination store, Azure Data Factory copy activity provides an option for
you to do additional data consistency verification to ensure the data is not only successfully copied from source
to destination store, but also verified to be consistent between source and destination store. Once inconsistent
files have been found during the data movement, you can either abort the copy activity or continue to copy the
rest by enabling fault tolerance setting to skip inconsistent files. You can get the skipped file names by enabling
session log setting in copy activity. See Data consistency verification in copy activity for details.

Session log
You can log your copied file names, which can help you to further ensure the data is not only successfully copied
from source to destination store, but also consistent between source and destination store by reviewing the
copy activity session logs. See Session log in copy activity for details.

Next steps
See the following quickstarts, tutorials, and samples:
Copy data from one location to another location in the same Azure Blob storage account
Copy data from Azure Blob storage to Azure SQL Database
Copy data from a SQL Server database to Azure
Monitor copy activity
5/6/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to monitor the copy activity execution in Azure Data Factory. It builds on the copy
activity overview article that presents a general overview of copy activity.

Monitor visually
Once you've created and published a pipeline in Azure Data Factory, you can associate it with a trigger or
manually kick off an ad hoc run. You can monitor all of your pipeline runs natively in the Azure Data Factory user
experience. Learn about Azure Data Factory monitoring in general from Visually monitor Azure Data Factory.
To monitor the Copy activity run, go to your data factory Author & Monitor UI. On the Monitor tab, you see a
list of pipeline runs, click the pipeline name link to access the list of activity runs in the pipeline run.

At this level, you can see links to copy activity input, output, and errors (if the Copy activity run fails), as well as
statistics like duration/status. Clicking the Details button (eyeglasses) next to the copy activity name will give
you deep details on your copy activity execution.

In this graphical monitoring view, Azure Data Factory presents you the copy activity execution information,
including data read/written volume, number of files/rows of data copied from source to sink, throughput, the
configurations applied for your copy scenario, steps the copy activity goes through with corresponding
durations and details, and more. Refer to this table on each possible metric and its detailed description.
In some scenarios, when you run a Copy activity in Data Factory, you'll see "Performance tuning tips" at the
top of the copy activity monitoring view as shown in the example. The tips tell you the bottleneck identified by
ADF for the specific copy run, along with suggestion on what to change to boost copy throughput. Learn more
about auto performance tuning tips.
The bottom execution details and durations describes the key steps your copy activity goes through, which
is especially useful for troubleshooting the copy performance. The bottleneck of your copy run is the one with
the longest duration. Refer to Troubleshoot copy activity performance on for what each stage represents and the
detailed troubleshooting guidance.
Example: Copy from Amazon S3 to Azure Data Lake Storage Gen2

Monitor programmatically
Copy activity execution details and performance characteristics are also returned in the Copy Activity run
result > Output section, which is used to render the UI monitoring view. Following is a complete list of
properties that might be returned. You'll see only the properties that are applicable to your copy scenario. For
information about how to monitor activity runs programmatically in general, see Programmatically monitor an
Azure data factory.

P RO P ERT Y N A M E DESC RIP T IO N UN IT IN O UT P UT

dataRead The actual amount of data read from Int64 value, in bytes
the source.
P RO P ERT Y N A M E DESC RIP T IO N UN IT IN O UT P UT

dataWritten The actual mount of data Int64 value, in bytes


written/committed to the sink. The size
may be different from dataRead size,
as it relates how each data store stores
the data.

filesRead The number of files read from the file- Int64 value (no unit)
based source.

filesWritten The number of files written/committed Int64 value (no unit)


to the file-based sink.

filesSkipped The number of files skipped from the Int64 value (no unit)
file-based source.

dataConsistencyVerification Details of data consistency verification Array


where you can see if your copied data
has been verified to be consistent
between source and destination store.
Learn more from this article.

sourcePeakConnections Peak number of concurrent Int64 value (no unit)


connections established to the source
data store during the Copy activity
run.

sinkPeakConnections Peak number of concurrent Int64 value (no unit)


connections established to the sink
data store during the Copy activity
run.

rowsRead Number of rows read from the source. Int64 value (no unit)
This metric does not apply when
copying files as-is without parsing
them, for example, when source and
sink datasets are binary format type,
or other format type with identical
settings.

rowsCopied Number of rows copied to sink. This Int64 value (no unit)
metric does not apply when copying
files as-is without parsing them, for
example, when source and sink
datasets are binary format type, or
other format type with identical
settings.

rowsSkipped Number of incompatible rows that Int64 value (no unit)


were skipped. You can enable
incompatible rows to be skipped by
setting enableSkipIncompatibleRow
to true.

copyDuration Duration of the copy run. Int32 value, in seconds


P RO P ERT Y N A M E DESC RIP T IO N UN IT IN O UT P UT

throughput Rate of data transfer, calculated by Floating point number, in KBps


dataRead divided by copyDuration .

sourcePeakConnections Peak number of concurrent Int32 value (no unit)


connections established to the source
data store during the Copy activity
run.

sinkPeakConnections Peak number of concurrent Int32 value (no unit)


connections established to the sink
data store during the Copy activity
run.

sqlDwPolyBase Whether PolyBase is used when data is Boolean


copied into Azure Synapse Analytics.

redshiftUnload Whether UNLOAD is used when data Boolean


is copied from Redshift.

hdfsDistcp Whether DistCp is used when data is Boolean


copied from HDFS.

effectiveIntegrationRuntime The integration runtime (IR) or Text (string)


runtimes used to power the activity
run, in the format
<IR name> (<region if it's Azure
IR>)
.

usedDataIntegrationUnits The effective Data Integration Units Int32 value


during copy.

usedParallelCopies The effective parallelCopies during Int32 value


copy.

logPath Path to the session log of skipped data Text (string)


in the blob storage. See Fault
tolerance.

executionDetails More details on the stages the Copy Array


activity goes through and the
corresponding steps, durations,
configurations, and so on. We don't
recommend that you parse this section
because it might change. To better
understand how it helps you
understand and troubleshoot copy
performance, refer to Monitor visually
section.

perfRecommendation Copy performance tuning tips. See Array


Performance tuning tips for details.

billingReference The billing consumption for the given Object


run. Learn more from Monitor
consumption at activity-run level.
P RO P ERT Y N A M E DESC RIP T IO N UN IT IN O UT P UT

durationInQueue Queueing duration in second before Object


the copy activity starts to execute.

Example:

"output": {
"dataRead": 1180089300500,
"dataWritten": 1180089300500,
"filesRead": 110,
"filesWritten": 110,
"filesSkipped": 0,
"sourcePeakConnections": 640,
"sinkPeakConnections": 1024,
"copyDuration": 388,
"throughput": 2970183,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"usedDataIntegrationUnits": 128,
"billingReference": "{\"activityType\":\"DataMovement\",\"billableDuration\":
[{\"Managed\":11.733333333333336}]}",
"usedParallelCopies": 64,
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "None"
},
"executionDetails": [
{
"source": {
"type": "AmazonS3"
},
"sink": {
"type": "AzureBlobFS",
"region": "East US",
"throttlingErrors": 6
},
"status": "Succeeded",
"start": "2020-03-04T02:13:25.1454206Z",
"duration": 388,
"usedDataIntegrationUnits": 128,
"usedParallelCopies": 64,
"profile": {
"queue": {
"status": "Completed",
"duration": 2
},
"transfer": {
"status": "Completed",
"duration": 386,
"details": {
"listingSource": {
"type": "AmazonS3",
"workingDuration": 0
},
"readingFromSource": {
"type": "AmazonS3",
"workingDuration": 301
},
"writingToSink": {
"type": "AzureBlobFS",
"workingDuration": 335
}
}
}
},
},
"detailedDurations": {
"queuingDuration": 2,
"transferDuration": 386
}
}
],
"perfRecommendation": [
{
"Tip": "6 write operations were throttled by the sink data store. To achieve better performance,
you are suggested to check and increase the allowed request rate for Azure Data Lake Storage Gen2, or reduce
the number of concurrent copy runs and other data access, or reduce the DIU or parallel copy.",
"ReferUrl": "https://go.microsoft.com/fwlink/?linkid=2102534 ",
"RuleName": "ReduceThrottlingErrorPerfRecommendationRule"
}
],
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}

Next steps
See the other Copy Activity articles:
- Copy activity overview
- Copy activity performance
Delete Activity in Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can use the Delete Activity in Azure Data Factory to delete files or folders from on-premises storage stores
or cloud storage stores. Use this activity to clean up or archive files when they are no longer needed.

WARNING
Deleted files or folders cannot be restored (unless the storage has soft-delete enabled). Be cautious when using the Delete
activity to delete files or folders.

Best practices
Here are some recommendations for using the Delete activity:
Back up your files before deleting them with the Delete activity in case you need to restore them in the
future.
Make sure that Data Factory has write permissions to delete folders or files from the storage store.
Make sure you are not deleting files that are being written at the same time.
If you want to delete files or folder from an on-premises system, make sure you are using a self-hosted
integration runtime with a version greater than 3.14.

Supported data stores


Azure Blob storage
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure File Storage
File System
FTP
SFTP
Amazon S3
Amazon S3 Compatible Storage
Google Cloud Storage
Oracle Cloud Storage
HDFS

Syntax
{
"name": "DeleteActivity",
"type": "Delete",
"typeProperties": {
"dataset": {
"referenceName": "<dataset name>",
"type": "DatasetReference"
},
"storeSettings": {
"type": "<source type>",
"recursive": true/false,
"maxConcurrentConnections": <number>
},
"enableLogging": true/false,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
},
"path": "<path to save log file>"
}
}
}

Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

dataset Provides the dataset reference to Yes


determine which files or folder to be
deleted

recursive Indicates whether the files are deleted No. The default is false .
recursively from the subfolders or only
from the specified folder.

maxConcurrentConnections The number of the connections to No. The default is 1 .


connect to storage store concurrently
for deleting folder or files.

enablelogging Indicates whether you need to record No


the folder or file names that have been
deleted. If true, you need to further
provide a storage account to save the
log file, so that you can track the
behaviors of the Delete activity by
reading the log file.

logStorageSettings Only applicable when enablelogging = No


true.

A group of storage properties that can


be specified where you want to save
the log file containing the folder or file
names that have been deleted by the
Delete activity.
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Only applicable when enablelogging = No


true.

The linked service of Azure Storage,


Azure Data Lake Storage Gen1, or
Azure Data Lake Storage Gen2 to store
the log file that contains the folder or
file names that have been deleted by
the Delete activity. Be aware it must be
configured with the same type of
Integration Runtime from the one
used by delete activity to delete files.

path Only applicable when enablelogging = No


true.

The path to save the log file in your


storage account. If you do not provide
a path, the service creates a container
for you.

Monitoring
There are two places where you can see and monitor the results of the Delete activity:
From the output of the Delete activity.
From the log file.
Sample output of the Delete activity

{
"datasetName": "AmazonS3",
"type": "AmazonS3Object",
"prefix": "test",
"bucketName": "adf",
"recursive": true,
"isWildcardUsed": false,
"maxConcurrentConnections": 2,
"filesDeleted": 4,
"logPath": "https://sample.blob.core.windows.net/mycontainer/5c698705-a6e2-40bf-911e-e0a927de3f07",
"effectiveIntegrationRuntime": "MyAzureIR (West Central US)",
"executionDuration": 650
}

Sample log file of the Delete activity


NAME C AT EGO RY STAT US ERRO R

test1/yyy.json File Deleted

test2/hello789.txt File Deleted

test2/test3/hello000.txt File Deleted

test2/test3/zzz.json File Deleted


Examples of using the Delete activity
Delete specific folders or files
The store has the following folder structure:
Root/
Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt
Now you are using the Delete activity to delete folder or files by the combination of different property value
from the dataset and the Delete activity:

F O L DERPAT H F IL EN A M E REC URSIVE O UT P UT

Root/ Folder_A_2 NULL False Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Root/ Folder_A_2 NULL True Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt
F O L DERPAT H F IL EN A M E REC URSIVE O UT P UT

Root/ Folder_A_2 *.txt False Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Root/ Folder_A_2 *.txt True Root/


Folder_A_1/
1.txt
2.txt
3.csv
Folder_A_2/
4.txt
5.csv
Folder_B_1/
6.txt
7.csv
Folder_B_2/
8.txt

Periodically clean up the time -partitioned folder or files


You can create a pipeline to periodically clean up the time partitioned folder or files. For example, the folder
structure is similar as: /mycontainer/2018/12/14/*.csv . You can leverage ADF system variable from schedule
trigger to identify which folder or files should be deleted in each pipeline run.
Sample pipeline
{
"name":"cleanup_time_partitioned_folder",
"properties":{
"activities":[
{
"name":"DeleteOneFolder",
"type":"Delete",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"dataset":{
"referenceName":"PartitionedFolder",
"type":"DatasetReference",
"parameters":{
"TriggerTime":{
"value":"@formatDateTime(pipeline().parameters.TriggerTime, 'yyyy/MM/dd')",
"type":"Expression"
}
}
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"mycontainer/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
}
}
}
],
"parameters":{
"TriggerTime":{
"type":"string"
}
},
"annotations":[

]
}
}

Sample dataset
{
"name":"PartitionedFolder",
"properties":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"TriggerTime":{
"type":"string"
}
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":{
"value":"@dataset().TriggerTime",
"type":"Expression"
},
"container":{
"value":"mycontainer",
"type":"Expression"
}
}
}
}
}

Sample trigger
{
"name": "DailyTrigger",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "cleanup_time_partitioned_folder",
"type": "PipelineReference"
},
"parameters": {
"TriggerTime": "@trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-12-13T00:00:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
59
],
"hours": [
23
]
}
}
}
}
}

Clean up the expired files that were last modified before 2018.1.1
You can create a pipeline to clean up the old or expired files by leveraging file attribute filter: “LastModified” in
dataset.
Sample pipeline
{
"name":"CleanupExpiredFiles",
"properties":{
"activities":[
{
"name":"DeleteFilebyLastModified",
"type":"Delete",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"dataset":{
"referenceName":"BlobFilesLastModifiedBefore201811",
"type":"DatasetReference"
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"mycontainer/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true,
"modifiedDatetimeEnd":"2018-01-01T00:00:00.000Z"
}
}
}
],
"annotations":[

]
}
}

Sample dataset
{
"name":"BlobFilesLastModifiedBefore201811",
"properties":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"*",
"folderPath":"mydirectory",
"container":"mycontainer"
}
}
}
}

Move files by chaining the Copy activity and the Delete activity
You can move a file by using a copy activity to copy a file and then a delete activity to delete a file in a pipeline.
When you want to move multiple files, you can use the GetMetadata activity + Filter activity + Foreach activity +
Copy activity + Delete activity as in the following sample:

NOTE
If you want to move the entire folder by defining a dataset containing a folder path only, and then using a copy activity
and a the Delete activity to reference to the same dataset representing a folder, you need to be very careful. It is because
you have to make sure that there will NOT be new files arriving into the folder between copying operation and deleting
operation. If there are new files arriving at the folder at the moment when your copy activity just completed the copy job
but the Delete activity has not been stared, it is possible that the Delete activity will delete this new arriving file which has
NOT been copied to the destination yet by deleting the entire folder.

Sample pipeline

{
"name":"MoveFiles",
"properties":{
"activities":[
{
"name":"GetFileList",
"type":"GetMetadata",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"dataset":{
"referenceName":"OneSourceFolder",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
}
}
},
"fieldList":[
"childItems"
],
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
},
"formatSettings":{
"type":"BinaryReadSettings"
}
}
},
{
"name":"FilterFiles",
"type":"Filter",
"dependsOn":[
{
"activity":"GetFileList",
"dependencyConditions":[
"Succeeded"
]
}
],
"userProperties":[

],
"typeProperties":{
"items":{
"value":"@activity('GetFileList').output.childItems",
"type":"Expression"
},
"condition":{
"value":"@equals(item().type, 'File')",
"type":"Expression"
}
}
},
{
"name":"ForEachFile",
"type":"ForEach",
"dependsOn":[
{
"activity":"FilterFiles",
"dependencyConditions":[
"Succeeded"
]
}
],
"userProperties":[

],
"typeProperties":{
"items":{
"value":"@activity('FilterFiles').output.value",
"type":"Expression"
},
"batchCount":20,
"activities":[
{
"name":"CopyAFile",
"type":"Copy",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"source":{
"type":"BinarySource",
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":false,
"deleteFilesAfterCompletion":false
},
"formatSettings":{
"type":"BinaryReadSettings"
},
"recursive":false
},
"sink":{
"type":"BinarySink",
"storeSettings":{
"type":"AzureBlobStorageWriteSettings"
}
},
"enableStaging":false,
"dataIntegrationUnits":0
},
"inputs":[
{
"referenceName":"OneSourceFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
}
],
"outputs":[
{
"referenceName":"OneDestinationFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.DestinationStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.DestinationStore_Directory",
"value":"@pipeline().parameters.DestinationStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
}
]
},
{
"name":"DeleteAFile",
"type":"Delete",
"dependsOn":[
{
"activity":"CopyAFile",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"dataset":{
"referenceName":"OneSourceFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"container/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
}
}
}
]
}
}
],
"parameters":{
"parameters":{
"SourceStore_Location":{
"type":"String"
},
"SourceStore_Directory":{
"type":"String"
},
"DestinationStore_Location":{
"type":"String"
},
"DestinationStore_Directory":{
"type":"String"
}
},
"annotations":[

]
}
}

Sample datasets
Dataset used by GetMetadata activity to enumerate the file list.

{
"name":"OneSourceFolder",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"Container":{
"type":"String"
},
"Directory":{
"type":"String"
}
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}

Dataset for data source used by copy activity and the Delete activity.
{
"name":"OneSourceFile",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"Container":{
"type":"String"
},
"Directory":{
"type":"String"
},
"filename":{
"type":"string"
}
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":{
"value":"@dataset().filename",
"type":"Expression"
},
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}

Dataset for data destination used by copy activity.


{
"name":"OneDestinationFile",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"Container":{
"type":"String"
},
"Directory":{
"type":"String"
},
"filename":{
"type":"string"
}
},
"annotations":[

],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":{
"value":"@dataset().filename",
"type":"Expression"
},
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}

You can also get the template to move files from here.

Known limitation
Delete activity does not support deleting list of folders described by wildcard.
When using file attribute filter in delete activity: modifiedDatetimeStart and modifiedDatetimeEnd to
select files to be deleted, make sure to set "wildcardFileName": "*" in delete activity as well.

Next steps
Learn more about moving files in Azure Data Factory.
Copy Data tool in Azure Data Factory
Copy Data tool in Azure Data Factory
7/8/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Data Factory Copy Data tool eases and optimizes the process of ingesting data into a data lake, which
is usually a first step in an end-to-end data integration scenario. It saves time, especially when you use Azure
Data Factory to ingest data from a data source for the first time. Some of the benefits of using this tool are:
When using the Azure Data Factory Copy Data tool, you do not need understand Data Factory definitions for
linked services, datasets, pipelines, activities, and triggers.
The flow of Copy Data tool is intuitive for loading data into a data lake. The tool automatically creates all the
necessary Data Factory resources to copy data from the selected source data store to the selected
destination/sink data store.
The Copy Data tool helps you validate the data that is being ingested at the time of authoring, which helps
you avoid any potential errors at the beginning itself.
If you need to implement complex business logic to load data into a data lake, you can still edit the Data
Factory resources created by the Copy Data tool by using the per-activity authoring in Data Factory UI.
The following table provides guidance on when to use the Copy Data tool vs. per-activity authoring in Data
Factory UI:

C O P Y DATA TO O L P ER A C T IVIT Y ( C O P Y A C T IVIT Y ) A UT H O RIN G

You want to easily build a data loading task without learning You want to implement complex and flexible logic for loading
about Azure Data Factory entities (linked services, datasets, data into lake.
pipelines, etc.)

You want to quickly load a large number of data artifacts You want to chain Copy activity with subsequent activities
into a data lake. for cleansing or processing data.

To start the Copy Data tool, click the Ingest tile on the home page of your data factory.

After you launch copy data tool, you will see two types of the tasks: one is built-in copy task and another is
metadata driven copy task . The built-in copy task leads you to create a pipeline within five minutes to
replicate data without learning about Azure Data Factory entities. The metadata driven copy task to ease your
journey of creating parameterized pipelines and external control table in order to manage to copy large
amounts of objects (for example, thousands of tables) at scale. You can see more details in metadata driven copy
data.
Intuitive flow for loading data into a data lake
This tool allows you to easily move data from a wide variety of sources to destinations in minutes with an
intuitive flow:
1. Configure settings for the source .
2. Configure settings for the destination .
3. Configure advanced settings for the copy operation such as column mapping, performance settings,
and fault tolerance settings.
4. Specify a schedule for the data loading task.
5. Review summar y of Data Factory entities to be created.
6. Edit the pipeline to update settings for the copy activity as needed.
The tool is designed with big data in mind from the start, with support for diverse data and object types.
You can use it to move hundreds of folders, files, or tables. The tool supports automatic data preview,
schema capture and automatic mapping, and data filtering as well.

Automatic data preview


You can preview part of the data from the selected source data store, which allows you to validate the data that
is being copied. In addition, if the source data is in a text file, the Copy Data tool parses the text file to
automatically detect the row and column delimiters, and schema.
After the detection, select Preview data :
Schema capture and automatic mapping
The schema of data source may not be same as the schema of data destination in many cases. In this scenario,
you need to map columns from the source schema to columns from the destination schema.
The Copy Data tool monitors and learns your behavior when you are mapping columns between source and
destination stores. After you pick one or a few columns from source data store, and map them to the destination
schema, the Copy Data tool starts to analyze the pattern for column pairs you picked from both sides. Then, it
applies the same pattern to the rest of the columns. Therefore, you see all the columns have been mapped to the
destination in a way you want just after several clicks. If you are not satisfied with the choice of column mapping
provided by Copy Data tool, you can ignore it and continue with manually mapping the columns. Meanwhile,
the Copy Data tool constantly learns and updates the pattern, and ultimately reaches the right pattern for the
column mapping you want to achieve.

NOTE
When copying data from SQL Server or Azure SQL Database into Azure Synapse Analytics, if the table does not exist in
the destination store, Copy Data tool supports creation of the table automatically by using the source schema.

Filter data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces
the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. Copy Data tool provides a flexible way to filter data in a relational database by using the SQL query
language, or files in an Azure blob folder.
Filter data in a database
The following screenshot shows a SQL query to filter the data.

Filter data in an Azure blob folder


You can use variables in the folder path to copy data from a folder. The supported variables are: {year} ,
{month} , {day} , {hour} , and {minute} . For example: inputfolder/{year}/{month}/{day}.
Suppose that you have input folders in the following format:

2016/03/01/01
2016/03/01/02
2016/03/01/03
...

Click the Browse button for File or folder , browse to one of these folders (for example, 2016->03->01->02),
and click Choose . You should see 2016/03/01/02 in the text box.
Then, replace 2016 with {year} , 03 with {month} , 01 with {day} , and 02 with {hour} , and press the Tab key.
When you select Incremental load: time-par titioned folder/file names in the File loading behavior
section and you select Schedule or Tumbling window on the Proper ties page, you should see drop-down
lists to select the format for these four variables:
The Copy Data tool generates parameters with expressions, functions, and system variables that can be used to
represent {year}, {month}, {day}, {hour}, and {minute} when creating pipeline.

Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). These options can be used for
the connectors across different environments, including on-premises, cloud, and local desktop.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data
of any size and any supported format. The scheduled copy allows you to copy data on a recurrence that you
specify. You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.
Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Build large-scale data copy pipelines with
metadata-driven approach in copy data tool
(Preview)
7/20/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When you want to copy huge amounts of objects (for example, thousands of tables) or load data from large
variety of sources, the appropriate approach is to input the name list of the objects with required copy behaviors
in a control table, and then use parameterized pipelines to read the same from the control table and apply them
to the jobs accordingly. By doing so, you can maintain (for example, add/remove) the objects list to be copied
easily by just updating the object names in control table instead of redeploying the pipelines. What’s more, you
will have single place to easily check which objects copied by which pipelines/triggers with defined copy
behaviors.
Copy data tool in ADF eases the journey of building such metadata driven data copy pipelines. After you go
through an intuitive flow from a wizard-based experience, the tool can generate parameterized pipelines and
SQL scripts for you to create external control tables accordingly. After you run the generated scripts to create the
control table in your SQL database, your pipelines will read the metadata from the control table and apply them
on the copy jobs automatically.

Create metadata-driven copy jobs from copy data tool


1. Select Metadata-driven copy task in copy data tool.
You need to input the connection and table name of your control table, so that the generated pipeline will
read metadata from that.

2. Input the connection of your source database . You can use parameterized linked service as well.
3. Select the table name to copy.

NOTE
If you select tabular data store, you will have chance to further select either full load or incremental load in the
next page. If you select storage store, you can further select full load only in the next page. Incrementally loading
new files only from storage store is currently not supported.

4. Choose loading behavior .

TIP
If you want to do full copy on all the tables, select Full load all tables . If you want to do incremental copy, you
can select configure for each table individually , and select Delta load as well as watermark column name &
value to start for each table.

5. Select Destination data store .


6. In Settings page, You can decide the max number of copy activities to copy data from your source store
concurrently via Number of concurrent copy tasks . The default value is 20.
7. After pipeline deployment, you can copy or download the SQL scripts from UI for creating control table
and store procedure.
You will see two SQL scripts.
The first SQL script is used to create two control tables. The main control table stores the table list, file
path or copy behaviors. The connection control table stores the connection value of your data store if
you used parameterized linked service.
The second SQL script is used to create a store procedure. It is used to update the watermark value in
main control table when the incremental copy jobs complete every time.
8. Open SSMS to connect to your control table server, and run the two SQL scripts to create control tables
and store procedure.

9. Query the main control table and connection control table to review the metadata in it.
Main control table
Connection control table

10. Go back to ADF portal to view and debug pipelines. You will see a folder created by naming
"MetadataDrivenCopyTask_#########". Click the pipeline naming with
"MetadataDrivenCopyTask###_TopLevel" and click debug run .
You are required to input the following parameters:

PA RA M ET ERS N A M E DESC RIP T IO N

MaxNumberOfConcurrentTasks You can always change the max number of concurrent


copy activities run before pipeline run. The default value
will be the one you input in copy data tool.

MainControlTableName You can always change the main control table name, so
the pipeline will get the metadata from that table before
run.

ConnectionControlTableName You can always change the connection control table


name (optional), so the pipeline will get the metadata
related to data store connection before run.

MaxNumberOfObjectsReturnedFromLookupActivity In order to avoid reaching the limit of output lookup


activity, there is a way to define the max number of
objects returned by lookup activity. In most cases, the
default value is not required to be changed.

windowStart When you input dynamic value (for example,


yyyy/mm/dd) as folder path, the parameter is used to
pass the current trigger time to pipeline in order to fill
the dynamic folder path. When the pipeline is triggered
by schedule trigger or tumbling windows trigger, users
do not need to input the value of this parameter. Sample
value: 2021-01-25T01:49:28Z

11. Enable the trigger to operationalize the pipelines.


Update control table by copy data tool
You can always directly update the control table by adding or removing the object to be copied or changing the
copy behavior for each table. We also create UI experience in copy data tool to ease the journey of editing the
control table.
1. Right-click the top-level pipeline: MetadataDrivenCopyTask_xxx_TopLevel , and then select Edit
control table .

2. Select rows from the control table to edit.


3. Go throughput the copy data tool, and it will come up with a new SQL script for you. Rerun the SQL script
to update your control table.

NOTE
The pipeline will NOT be redeployed. The new created SQL script help you to update the control table only.

Control tables
Main control table
Each row in control table contains the metadata for one object (for example, one table) to be copied.

C O L UM N N A M E DESC RIP T IO N

Id Unique ID of the object to be copied.

SourceObjectSettings Metadata of source dataset. It can be schema name, table


name etc. Here is an example.

SourceConnectionSettingsName The name of the source connection setting in connection


control table. It is optional.
C O L UM N N A M E DESC RIP T IO N

CopySourceSettings Metadata of source property in copy activity. It can be


query, partitions etc. Here is an example.

SinkObjectSettings Metadata of destination dataset. It can be file name, folder


path, table name etc. Here is an example. If dynamic folder
path specified, the variable value will not be written here in
control table.

SinkConnectionSettingsName The name of the destination connection setting in


connection control table. It is optional.

CopySinkSettings Metadata of sink property in copy activity. It can be


preCopyScript, tableOption etc. Here is an example.

CopyActivitySettings Metadata of translator property in copy activity. It is used to


define column mapping.

TopLevelPipelineName Top Pipeline name, which can copy this object.

TriggerName Trigger name, which can trigger the pipeline to copy this
object. If debug run, the name is Sandbox. If manual
execution, the name is Manual.

DataLoadingBehaviorSettings Full load vs. delta load.

TaskId The order of objects to be copied following the TaskId in


control table (ORDER BY [TaskId] DESC). If you have huge
amounts of objects to be copied but only limited concurrent
number of copied allowed, you can change the TaskId for
each object to decide which objects can be copied earlier. The
default value is 0.

Connection control table


Each row in control table contains one connection setting for the data store.

C O L UM N N A M E DESC RIP T IO N

Name Name of the parameterized connection in main control table.

ConnectionSettings The connection settings. It can be DB name, Server name


and so on.

Pipelines
You will see three levels of pipelines are generated by copy data tool.
MetadataDrivenCopyTask_xxx_TopLevel
This pipeline will calculate the total number of objects (tables etc.) required to be copied in this run, come up
with the number of sequential batches based on the max allowed concurrent copy task, and then execute
another pipeline to copy different batches sequentially.
Parameters
PA RA M ET ERS N A M E DESC RIP T IO N

MaxNumberOfConcurrentTasks You can always change the max number of concurrent copy
activities run before pipeline run. The default value will be
the one you input in copy data tool.

MainControlTableName The table name of main control table. The pipeline will get
the metadata from this table before run

ConnectionControlTableName The table name of connection control table (optional). The


pipeline will get the metadata related to data store
connection before run

MaxNumberOfObjectsReturnedFromLookupActivity In order to avoid reaching the limit of output lookup activity,


there is a way to define the max number of objects returned
by lookup activity. In most cases, the default value is not
required to be changed.

windowStart When you input dynamic value (for example, yyyy/mm/dd)


as folder path, the parameter is used to pass the current
trigger time to pipeline in order to fill the dynamic folder
path. When the pipeline is triggered by schedule trigger or
tumbling windows trigger, users do not need to input the
value of this parameter. Sample value: 2021-01-
25T01:49:28Z

Activities

A C T IVIT Y N A M E A C T IVIT Y T Y P E DESC RIP T IO N

GetSumOfObjectsToCopy Lookup Calculate the total number of objects


(tables etc.) required to be copied in
this run.

CopyBatchesOfObjectsSequentially ForEach Come up with the number of


sequential batches based on the max
allowed concurrent copy tasks, and
then execute another pipeline to copy
different batches sequentially.

CopyObjectsInOneBtach Execute Pipeline Execute another pipeline to copy one


batch of objects. The objects belonging
to this batch will be copied in parallel.

MetadataDrivenCopyTask_xxx_ MiddleLevel
This pipeline will copy one batch of objects. The objects belonging to this batch will be copied in parallel.
Parameters

PA RA M ET ERS N A M E DESC RIP T IO N

MaxNumberOfObjectsReturnedFromLookupActivity In order to avoid reaching the limit of output lookup activity,


there is a way to define the max number of objects returned
by lookup activity. In most case, the default value is not
required to be changed.

TopLayerPipelineName The name of top layer pipeline.


PA RA M ET ERS N A M E DESC RIP T IO N

TriggerName The name of trigger.

CurrentSequentialNumberOfBatch The ID of sequential batch.

SumOfObjectsToCopy The total number of objects to copy.

SumOfObjectsToCopyForCurrentBatch The number of objects to copy in current batch.

MainControlTableName The name of main control table.

ConnectionControlTableName The name of connection control table.

Activities

A C T IVIT Y N A M E A C T IVIT Y T Y P E DESC RIP T IO N

DivideOneBatchIntoMultipleGroups ForEach Divide objects from single batch into


multiple parallel groups to avoid
reaching the output limit of lookup
activity.

GetObjectsPerGroupToCopy Lookup Get objects (tables etc.) from control


table required to be copied in this
group. The order of objects to be
copied following the TaskId in control
table (ORDER BY [TaskId] DESC).

CopyObjectsInOneGroup Execute Pipeline Execute another pipeline to copy


objects from one group. The objects
belonging to this group will be copied
in parallel.

MetadataDrivenCopyTask_xxx_ BottomLevel
This pipeline will copy objects from one group. The objects belonging to this group will be copied in parallel.
Parameters

PA RA M ET ERS N A M E DESC RIP T IO N

ObjectsPerGroupToCopy The number of objects to copy in current group.

ConnectionControlTableName The name of connection control table.

windowStart It used to pass the current trigger time to pipeline in order


to fill the dynamic folder path if configured by user.

Activities

A C T IVIT Y N A M E A C T IVIT Y T Y P E DESC RIP T IO N

ListObjectsFromOneGroup ForEach List objects from one group and iterate


each of them to downstream activities.
A C T IVIT Y N A M E A C T IVIT Y T Y P E DESC RIP T IO N

RouteJobsBasedOnLoadingBehavior Switch Check the loading behavior for each


object. If it is default or FullLoad case,
do full load. If it is DeltaLoad case, do
incremental load via watermark
column to identify changes

FullLoadOneObject Copy Take a full snapshot on this object and


copy it to the destination.

DeltaLoadOneObject Copy Copy the changed data only from last


time via comparing the value in
watermark column to identify changes.

GetMaxWatermarkValue Lookup Query the source object to get the


max value from watermark column.

UpdateWatermarkColumnValue StoreProcedure Write back the new watermark value


to control table to be used next time.

Known limitations
Copy data tool does not support metadata driven ingestion for incrementally copying new files only
currently. But you can bring your own parameterized pipelines to achieve that.
IR name, database type, file format type cannot be parameterized in ADF. For example, if you want to ingest
data from both Oracle Server and SQL Server, you will need two different parameterized pipelines. But the
single control table can be shared by two sets of pipelines.

Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Supported file formats and compression codecs by
copy activity in Azure Data Factory
5/14/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article applies to the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob, Azure
Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud
Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
You can use the Copy activity to copy files as-is between two file-based data stores, in which case the data is
copied efficiently without any serialization or deserialization.
In addition, you can also parse or generate files of a given format. For example, you can perform the following:
Copy data from a SQL Server database and write to Azure Data Lake Storage Gen2 in Parquet format.
Copy files in text (CSV) format from an on-premises file system and write to Azure Blob storage in Avro
format.
Copy zipped files from an on-premises file system, decompress them on-the-fly, and write extracted files to
Azure Data Lake Storage Gen2.
Copy data in Gzip compressed-text (CSV) format from Azure Blob storage and write it to Azure SQL
Database.
Many more activities that require serialization/deserialization or compression/decompression.

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Copy activity performance and scalability guide
5/6/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Sometimes you want to perform a large-scale data migration from data lake or enterprise data warehouse
(EDW), to Azure. Other times you want to ingest large amounts of data, from different sources into Azure, for big
data analytics. In each case, it is critical to achieve optimal performance and scalability.
Azure Data Factory (ADF) provides a mechanism to ingest data. ADF has the following advantages:
Handles large amounts of data
Is highly performant
Is cost-effective
These advantages make ADF an excellent fit for data engineers who want to build scalable data ingestion
pipelines that are highly performant.
After reading this article, you will be able to answer the following questions:
What level of performance and scalability can I achieve using ADF copy activity for data migration and data
ingestion scenarios?
What steps should I take to tune the performance of ADF copy activity?
What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?
What other factors outside ADF to consider when optimizing copy performance?

NOTE
If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

Copy performance and scalability achievable using ADF


ADF offers a serverless architecture that allows parallelism at different levels.
This architecture allows you to develop pipelines that maximize data movement throughput for your
environment. These pipelines fully utilize the following resources:
Network bandwidth between the source and destination data stores
Source or destination data store input/output operations per second (IOPS) and bandwidth
This full utilization means you can estimate the overall throughput by measuring the minimum throughput
available with the following resources:
Source data store
Destination data store
Network bandwidth in between the source and destination data stores
The table below shows the calculation of data movement duration. The duration in each cell is calculated based
on a given network and data store bandwidth and a given data payload size.
NOTE
The duration provided below are meant to represent achievable performance in an end-to-end data integration solution
implemented using ADF, by using one or more performance optimization techniques described in Copy performance
optimization features, including using ForEach to partition and spawn off multiple concurrent copy activities. We
recommend you to follow steps laid out in Performance tuning steps to optimize copy performance for your specific
dataset and system configuration. You should use the numbers obtained in your performance tuning tests for production
deployment planning, capacity planning, and billing projection.

DATA SIZ E
/
B A N DW IDT
H 50 M B P S 100 M B P S 500 M B P S 1 GB P S 5 GB P S 10 GB P S 50 GB P S

1 GB 2.7 min 1.4 min 0.3 min 0.1 min 0.03 min 0.01 min 0.0 min

10 GB 27.3 min 13.7 min 2.7 min 1.3 min 0.3 min 0.1 min 0.03 min

100 GB 4.6 hrs 2.3 hrs 0.5 hrs 0.2 hrs 0.05 hrs 0.02 hrs 0.0 hrs

1 TB 46.6 hrs 23.3 hrs 4.7 hrs 2.3 hrs 0.5 hrs 0.2 hrs 0.05 hrs

10 TB 19.4 days 9.7 days 1.9 days 0.9 days 0.2 days 0.1 days 0.02 days

100 TB 194.2 days 97.1 days 19.4 days 9.7 days 1.9 days 1 day 0.2 days

1 PB 64.7 mo 32.4 mo 6.5 mo 3.2 mo 0.6 mo 0.3 mo 0.06 mo

10 PB 647.3 mo 323.6 mo 64.7 mo 31.6 mo 6.5 mo 3.2 mo 0.6 mo

ADF copy is scalable at different levels:


ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
A single copy activity can take advantage of scalable compute resources.
When using Azure integration runtime (IR), you can specify up to 256 data integration units (DIUs) for
each copy activity, in a serverless manner.
When using self-hosted IR, you can take either of the following approaches:
Manually scale up the machine.
Scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file
set across all nodes.
A single copy activity reads from and writes to the data store using multiple threads in parallel.

Performance tuning steps


Take the following steps to tune the performance of your Azure Data Factory service with the copy activity:
1. Pick up a test dataset and establish a baseline.
During development, test your pipeline by using the copy activity against a representative data sample.
The dataset you choose should represent your typical data patterns along the following attributes:
Folder structure
File pattern
Data schema
And your dataset should be big enough to evaluate copy performance. A good size takes at least 10
minutes for copy activity to complete. Collect execution details and performance characteristics following
copy activity monitoring.
2. How to maximize performance of a single copy activity :
We recommend you to first maximize performance using a single copy activity.
If the copy activity is being executed on an Azure integration runtime:
Start with default values for Data Integration Units (DIU) and parallel copy settings.
If the copy activity is being executed on a self-hosted integration runtime:
We recommend that you use a dedicated machine to host IR. The machine should be separate
from the server hosting the data store. Start with default values for parallel copy setting and using
a single node for the self-hosted IR.
Conduct a performance test run. Take a note of the performance achieved. Include the actual values used,
such as DIUs and parallel copies. Refer to copy activity monitoring on how to collect run results and
performance settings used. Learn how to troubleshoot copy activity performance to identify and resolve
the bottleneck.
Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance.
Once single copy activity runs cannot achieve better throughput, consider whether to maximize
aggregate throughput by running multiple copies concurrently. This option is discussed in the next
numbered bullet.
3. How to maximize aggregate throughput by running multiple copies concurrently:
By now you have maximized the performance of a single copy activity. If you have not yet achieved the
throughput upper limits of your environment, you can run multiple copy activities in parallel. You can run
in parallel by using ADF control flow constructs. One such construct is the For Each loop. For more
information, see the following articles about solution templates:
Copy files from multiple containers
Migrate data from Amazon S3 to ADLS Gen2
Bulk copy with a control table
4. Expand the configuration to your entire dataset.
When you're satisfied with the execution results and performance, you can expand the definition and
pipeline to cover your entire dataset.

Troubleshoot copy activity performance


Follow the Performance tuning steps to plan and conduct performance test for your scenario. And learn how to
troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity
performance.

Copy performance optimization features


Azure Data Factory provides the following performance optimization features:
Data Integration Units
Self-hosted integration runtime scalability
Parallel copy
Staged copy
Data Integration Units
A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory.
Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration
runtime. DIU does not apply to self-hosted integration runtime. Learn more here.
Self-hosted integration runtime scalability
You might want to host an increasing concurrent workload. Or you might want to achieve higher performance in
your present workload level. You can enhance the scale of processing by the following approaches:
You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
Scale up works only if the processor and memory of the node are being less than fully utilized.
You can scale out the self-hosted IR, by adding more nodes (machines).
For more information, see:
Copy activity performance optimization features: Self-hosted integration runtime scalability
Create and configure a self-hosted integration runtime: Scale considerations
Parallel copy
You can set the parallelCopies property to indicate the parallelism you want the copy activity to use. Think of
this property as the maximum number of threads within the copy activity. The threads operate in parallel. The
threads either read from your source, or write to your sink data stores. Learn more.
Staged copy
A data copy operation can send the data directly to the sink data store. Alternatively, you can choose to use Blob
storage as an interim staging store. Learn more.

Next steps
See the other copy activity articles:
Copy activity overview
Troubleshoot copy activity performance
Copy activity performance optimization features
Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure
Migrate data from Amazon S3 to Azure Storage
Troubleshoot copy activity performance
7/6/2021 • 15 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines how to troubleshoot copy activity performance issue in Azure Data Factory.
After you run a copy activity, you can collect the run result and performance statistics in copy activity monitoring
view. The following is an example.

Performance tuning tips


In some scenarios, when you run a copy activity in Data Factory, you'll see "Performance tuning tips" at the
top as shown in the above example. The tips tell you the bottleneck identified by ADF for this particular copy
run, along with suggestion on how to boost copy throughput. Try making the recommanded change, then run
the copy again.
As a reference, currently the performance tuning tips provide suggestions for the following cases:

C AT EGO RY P ERF O RM A N C E T UN IN G T IP S

Data store specific Loading data into Azure Synapse Analytics : suggest
using PolyBase or COPY statement if it's not used.

Copying data from/to Azure SQL Database : when DTU is


under high utilization, suggest upgrading to higher tier.
C AT EGO RY P ERF O RM A N C E T UN IN G T IP S

Copying data from/to Azure Cosmos DB : when RU is


under high utilization, suggest upgrading to larger RU.

Copying data from SAP Table : when copying large amount


of data, suggest leveraging SAP connector's partition option
to enable parallel load and increase the max partition
number.

Ingesting data from Amazon Redshift : suggest using


UNLOAD if it's not used.

Data store throttling If a number of read/write operations are throttled by the


data store during copy, suggest checking and increase the
allowed request rate for the data store, or reduce the
concurrent workload.

Integration runtime If you use a Self-hosted Integration Runtime (IR) and


copy activity waits long in the queue until the IR has
available resource to execute, suggest scaling out/up your IR.

If you use an Azure Integration Runtime that is in a not


optimal region resulting in slow read/write, suggest
configuring to use an IR in another region.

Fault tolerance If you configure fault tolerance and skipping incompatible


rows results in slow performance, suggest ensuring source
and sink data are compatible.

Staged copy If staged copy is configured but not helpful for your source-
sink pair, suggest removing it.

Resume When copy activity is resumed from last failure point but
you happen to change the DIU setting after the original run,
note the new DIU setting doesn't take effect.

Understand copy activity execution details


The execution details and durations at the bottom of the copy activity monitoring view describes the key stages
your copy activity goes through (see example at the beginning of this article), which is especially useful for
troubleshooting the copy performance. The bottleneck of your copy run is the one with the longest duration.
Refer to the following table on each stage's definition, and learn how to Troubleshoot copy activity on Azure IR
and Troubleshoot copy activity on Self-hosted IR with such info.

STA GE DESC RIP T IO N

Queue The elapsed time until the copy activity actually starts on the
integration runtime.

Pre-copy script The elapsed time between copy activity starting on IR and
copy activity finishing executing the pre-copy script in sink
data store. Apply when you configure the pre-copy script for
database sinks, e.g. when writing data into Azure SQL
Database do clean up before copy new data.
STA GE DESC RIP T IO N

Transfer The elapsed time between the end of the previous step and
the IR transferring all the data from source to sink.
Note the sub-steps under transfer run in parallel, and some
operations are not shown now e.g. parsing/generating file
format.

- Time to first byte: The time elapsed between the end of


the previous step and the time when the IR receives the first
byte from the source data store. Applies to non-file-based
sources.
- Listing source: The amount of time spent on
enumerating source files or data partitions. The latter applies
when you configure partition options for database sources,
e.g. when copy data from databases like Oracle/SAP
HANA/Teradata/Netezza/etc.
-Reading from source: The amount of time spent on
retrieving data from source data store.
- Writing to sink : The amount of time spent on writing
data to sink data store. Note some connectors do not have
this metric at the moment, including Azure Cognitive Search,
Azure Data Explorer, Azure Table storage, Oracle, SQL Server,
Common Data Service, Dynamics 365, Dynamics CRM,
Salesforce/Salesforce Service Cloud.

Troubleshoot copy activity on Azure IR


Follow the Performance tuning steps to plan and conduct performance test for your scenario.
When the copy activity performance doesn't meet your expectation, to troubleshoot single copy activity running
on Azure Integration Runtime, if you see performance tuning tips shown up in the copy monitoring view, apply
the suggestion and try again. Otherwise, understand copy activity execution details, check which stage has the
longest duration, and apply the guidance below to boost copy performance:
"Pre-copy script" experienced long duration: it means the pre-copy script running on sink database
takes long to finish. Tune the specified pre-copy script logic to enhance the performance. If you need
further help on improving the script, contact your database team.
"Transfer - Time to first byte" experienced long working duration : it means your source query
takes long to return any data. Check and optimize the query or server. If you need further help, contact
your data store team.
"Transfer - Listing source" experienced long working duration : it means enumerating source files
or source database data partitions is slow.
When copying data from file-based source, if you use wildcard filter on folder path or file name (
wildcardFolderPath or wildcardFileName ), or use file last modified time filter (
modifiedDatetimeStart or modifiedDatetimeEnd ), note such filter would result in copy activity listing
all the files under the specified folder to client side then apply the filter. Such file enumeration
could become the bottleneck especially when only small set of files met the filter rule.
Check whether you can copy files based on datetime partitioned file path or name. Such
way doesn't bring burden on listing source side.
Check if you can use data store's native filter instead, specifically "prefix " for Amazon
S3/Azure Blob/Azure File Storage and "listAfter/listBefore " for ADLS Gen1. Those filters
are data store server-side filter and would have much better performance.
Consider to split single large data set into several smaller data sets, and let those copy jobs
run concurrently each tackles portion of data. You can do this with Lookup/GetMetadata +
ForEach + Copy. Refer to Copy files from multiple containers or Migrate data from Amazon
S3 to ADLS Gen2 solution templates as general example.
Check if ADF reports any throttling error on source or if your data store is under high utilization
state. If so, either reduce your workloads on the data store, or try contacting your data store
administrator to increase the throttling limit or available resource.
Use Azure IR in the same or close to your source data store region.
"Transfer - reading from source" experienced long working duration :
Adopt connector-specific data loading best practice if applies. For example, when copying data
from Amazon Redshift, configure to use Redshift UNLOAD.
Check if ADF reports any throttling error on source or if your data store is under high utilization. If
so, either reduce your workloads on the data store, or try contacting your data store administrator
to increase the throttling limit or available resource.
Check your copy source and sink pattern:
If your copy pattern supports larger than 4 Data Integration Units (DIUs) - refer to this
section on details, generally you can try increasing DIUs to get better performance.
Otherwise, consider to split single large data set into several smaller data sets, and let those
copy jobs run concurrently each tackles portion of data. You can do this with
Lookup/GetMetadata + ForEach + Copy. Refer to Copy files from multiple containers,
Migrate data from Amazon S3 to ADLS Gen2, or Bulk copy with a control table solution
templates as general example.
Use Azure IR in the same or close to your source data store region.
"Transfer - writing to sink" experienced long working duration :
Adopt connector-specific data loading best practice if applies. For example, when copying data into
Azure Synapse Analytics, use PolyBase or COPY statement.
Check if ADF reports any throttling error on sink or if your data store is under high utilization. If so,
either reduce your workloads on the data store, or try contacting your data store administrator to
increase the throttling limit or available resource.
Check your copy source and sink pattern:
If your copy pattern supports larger than 4 Data Integration Units (DIUs) - refer to this
section on details, generally you can try increasing DIUs to get better performance.
Otherwise, gradually tune the parallel copies, note that too many parallel copies may even
hurt the performance.
Use Azure IR in the same or close to your sink data store region.

Troubleshoot copy activity on Self-hosted IR


Follow the Performance tuning steps to plan and conduct performance test for your scenario.
When the copy performance doesn't meet your expectation, to troubleshoot single copy activity running on
Azure Integration Runtime, if you see performance tuning tips shown up in the copy monitoring view, apply the
suggestion and try again. Otherwise, understand copy activity execution details, check which stage has the
longest duration, and apply the guidance below to boost copy performance:
"Queue" experienced long duration: it means the copy activity waits long in the queue until your
Self-hosted IR has resource to execute. Check the IR capacity and usage, and scale up or out according to
your workload.
"Transfer - Time to first byte" experienced long working duration : it means your source query
takes long to return any data. Check and optimize the query or server. If you need further help, contact
your data store team.
"Transfer - Listing source" experienced long working duration : it means enumerating source files
or source database data partitions is slow.
Check if the Self-hosted IR machine has low latency connecting to source data store. If your source
is in Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure
region, the less the better.
When copying data from file-based source, if you use wildcard filter on folder path or file name (
wildcardFolderPath or wildcardFileName ), or use file last modified time filter (
modifiedDatetimeStart or modifiedDatetimeEnd ), note such filter would result in copy activity listing
all the files under the specified folder to client side then apply the filter. Such file enumeration
could become the bottleneck especially when only small set of files met the filter rule.
Check whether you can copy files based on datetime partitioned file path or name. Such
way doesn't bring burden on listing source side.
Check if you can use data store's native filter instead, specifically "prefix " for Amazon
S3/Azure Blob/Azure File Storage and "listAfter/listBefore " for ADLS Gen1. Those filters
are data store server-side filter and would have much better performance.
Consider to split single large data set into several smaller data sets, and let those copy jobs
run concurrently each tackles portion of data. You can do this with Lookup/GetMetadata +
ForEach + Copy. Refer to Copy files from multiple containers or Migrate data from Amazon
S3 to ADLS Gen2 solution templates as general example.
Check if ADF reports any throttling error on source or if your data store is under high utilization
state. If so, either reduce your workloads on the data store, or try contacting your data store
administrator to increase the throttling limit or available resource.
"Transfer - reading from source" experienced long working duration :
Check if the Self-hosted IR machine has low latency connecting to source data store. If your source
is in Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure
regions, the less the better.
Check if the Self-hosted IR machine has enough inbound bandwidth to read and transfer the data
efficiently. If your source data store is in Azure, you can use this tool to check the download speed.
Check the Self-hosted IR's CPU and memory usage trend in Azure portal -> your data factory ->
overview page. Consider to scale up/out IR if the CPU usage is high or available memory is low.
Adopt connector-specific data loading best practice if applies. For example:
When copying data from Oracle, Netezza, Teradata, SAP HANA, SAP Table, and SAP Open
Hub), enable data partition options to copy data in parallel.
When copying data from HDFS, configure to use DistCp.
When copying data from Amazon Redshift, configure to use Redshift UNLOAD.
Check if ADF reports any throttling error on source or if your data store is under high utilization. If
so, either reduce your workloads on the data store, or try contacting your data store administrator
to increase the throttling limit or available resource.
Check your copy source and sink pattern:
If you copy data from partition-option-enabled data stores, consider to gradually tune the
parallel copies, note that too many parallel copies may even hurt the performance.
Otherwise, consider to split single large data set into several smaller data sets, and let those
copy jobs run concurrently each tackles portion of data. You can do this with
Lookup/GetMetadata + ForEach + Copy. Refer to Copy files from multiple containers,
Migrate data from Amazon S3 to ADLS Gen2, or Bulk copy with a control table solution
templates as general example.
"Transfer - writing to sink" experienced long working duration :
Adopt connector-specific data loading best practice if applies. For example, when copying data into
Azure Synapse Analytics, use PolyBase or COPY statement.
Check if the Self-hosted IR machine has low latency connecting to sink data store. If your sink is in
Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure
region, the less the better.
Check if the Self-hosted IR machine has enough outbound bandwidth to transfer and write the
data efficiently. If your sink data store is in Azure, you can use this tool to check the upload speed.
Check if the Self-hosted IR's CPU and memory usage trend in Azure portal -> your data factory ->
overview page. Consider to scale up/out IR if the CPU usage is high or available memory is low.
Check if ADF reports any throttling error on sink or if your data store is under high utilization. If so,
either reduce your workloads on the data store, or try contacting your data store administrator to
increase the throttling limit or available resource.
Consider to gradually tune the parallel copies, note that too many parallel copies may even hurt
the performance.

Connector and IR performance


This section explores some performance troubleshooting guides for particular connector type or integration
runtime.
Activity execution time varies using Azure IR vs Azure VNet IR
Activity execution time varies when the dataset is based on different Integration Runtime.
Symptoms : Simply toggling the Linked Service dropdown in the dataset performs the same pipeline
activities, but has drastically different run-times. When the dataset is based on the Managed Virtual
Network Integration Runtime, it takes more time on average than the run when based on the Default
Integration Runtime.
Cause : Checking the details of pipeline runs, you can see that the slow pipeline is running on Managed
VNet (Virtual Network) IR while the normal one is running on Azure IR. By design, Managed VNet IR
takes longer queue time than Azure IR as we are not reserving one compute node per data factory, so
there is a warm up for each copy activity to start, and it occurs primarily on VNet join rather than Azure
IR.
Low performance when loading data into Azure SQL Database
Symptoms : Copying data in to Azure SQL Database turns to be slow.
Cause : The root cause of the issue is mostly triggered by the bottleneck of Azure SQL Database side.
Following are some possible causes:
Azure SQL Database tier is not high enough.
Azure SQL Database DTU usage is close to 100%. You can monitor the performance and consider
to upgrade the Azure SQL Database tier.
Indexes are not set properly. Remove all the indexes before data load and recreate them after load
complete.
WriteBatchSize is not large enough to fit schema row size. Try to enlarge the property for the issue.
Instead of bulk insert, stored procedure is being used, which is expected to have worse
performance.
Timeout or slow performance when parsing large Excel file
Symptoms :
When you create Excel dataset and import schema from connection/store, preview data, list, or
refresh worksheets, you may hit timeout error if the excel file is large in size.
When you use copy activity to copy data from large Excel file (>= 100 MB) into other data store,
you may experience slow performance or OOM issue.
Cause :
For operations like importing schema, previewing data, and listing worksheets on excel dataset,
the timeout is 100 s and static. For large Excel file, these operations may not finish within the
timeout value.
ADF copy activity reads the whole Excel file into memory then locate the specified worksheet and
cells to read data. This behavior is due to the underlying SDK ADF uses.
Resolution :
For importing schema, you can generate a smaller sample file, which is a subset of original file,
and choose "import schema from sample file" instead of "import schema from connection/store".
For listing worksheet, in the worksheet dropdown, you can click "Edit" and input the sheet
name/index instead.
To copy large excel file (>100 MB) into other store, you can use Data Flow Excel source which sport
streaming read and perform better.

Other references
Here is performance monitoring and tuning references for some of the supported data stores:
Azure Blob storage: Scalability and performance targets for Blob storage and Performance and scalability
checklist for Blob storage.
Azure Table storage: Scalability and performance targets for Table storage and Performance and scalability
checklist for Table storage.
Azure SQL Database: You can monitor the performance and check the Database Transaction Unit (DTU)
percentage.
Azure Synapse Analytics: Its capability is measured in Data Warehouse Units (DWUs). See Manage compute
power in Azure Synapse Analytics (Overview).
Azure Cosmos DB: Performance levels in Azure Cosmos DB.
SQL Server: Monitor and tune for performance.
On-premises file server: Performance tuning for file servers.

Next steps
See the other copy activity articles:
Copy activity overview
Copy activity performance and scalability guide
Copy activity performance optimization features
Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure
Migrate data from Amazon S3 to Azure Storage
Copy activity performance optimization features
5/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article outlines the copy activity performance optimization features that you can leverage in Azure Data
Factory.

Data Integration Units


A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network
resource allocation) of a single unit in Azure Data Factory. Data Integration Unit only applies to Azure integration
runtime, but not self-hosted integration runtime.
The allowed DIUs to empower a copy activity run is between 2 and 256 . If not specified or you choose "Auto"
on the UI, Data Factory dynamically applies the optimal DIU setting based on your source-sink pair and data
pattern. The following table lists the supported DIU ranges and default behavior in different copy scenarios:

DEFA ULT DIUS DET ERM IN ED B Y


C O P Y SC EN A RIO SUP P O RT ED DIU RA N GE SERVIC E

Between file stores - Copy from or to single file : 2-4 Between 4 and 32 depending on the
- Copy from and to multiple files : number and size of the files
2-256 depending on the number and
size of the files

For example, if you copy data from a


folder with 4 large files and choose to
preserve hierarchy, the max effective
DIU is 16; when you choose to merge
file, the max effective DIU is 4.

From file store to non-file store - Copy from single file : 2-4 - Copy into Azure SQL Database
- Copy from multiple files : 2-256 or Azure Cosmos DB : between 4
depending on the number and size of and 16 depending on the sink tier
the files (DTUs/RUs) and source file pattern
- Copy into Azure Synapse
For example, if you copy data from a Analytics using PolyBase or COPY
folder with 4 large files, the max statement: 2
effective DIU is 16. - Other scenario: 4

From non-file store to file store - Copy from par tition-option- - Copy from REST or HTTP : 1
enabled data stores (including - Copy from Amazon Redshift
Azure SQL Database, Azure SQL using UNLOAD: 2
Managed Instance, Azure Synapse - Other scenario : 4
Analytics, Oracle, Netezza, SQL Server,
and Teradata): 2-256 when writing to a
folder, and 2-4 when writing to one
single file. Note per source data
partition can use up to 4 DIUs.
- Other scenarios : 2-4
DEFA ULT DIUS DET ERM IN ED B Y
C O P Y SC EN A RIO SUP P O RT ED DIU RA N GE SERVIC E

Between non-file stores - Copy from par tition-option- - Copy from REST or HTTP : 1
enabled data stores (including - Other scenario : 4
Azure SQL Database, Azure SQL
Managed Instance, Azure Synapse
Analytics, Oracle, Netezza, SQL Server,
and Teradata): 2-256 when writing to a
folder, and 2-4 when writing to one
single file. Note per source data
partition can use up to 4 DIUs.
- Other scenarios : 2-4

You can see the DIUs used for each copy run in the copy activity monitoring view or activity output. For more
information, see Copy activity monitoring. To override this default, specify a value for the dataIntegrationUnits
property as follows. The actual number of DIUs that the copy operation uses at run time is equal to or less than
the configured value, depending on your data pattern.
You will be charged # of used DIUs * copy duration * unit price/DIU-hour . See the current prices here.
Local currency and separate discounting may apply per subscription type.
Example:

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"dataIntegrationUnits": 128
}
}
]

Self-hosted integration runtime scalability


If you would like to achieve higher throughput, you can either scale up or scale out the Self-hosted IR:
If the CPU and available memory on the Self-hosted IR node are not fully utilized, but the execution of
concurrent jobs is reaching the limit, you should scale up by increasing the number of concurrent jobs that
can run on a node. See here for instructions.
If on the other hand, the CPU is high on the Self-hosted IR node or available memory is low, you can add a
new node to help scale out the load across the multiple nodes. See here for instructions.
Note in the following scenarios, single copy activity execution can leverage multiple Self-hosted IR nodes:
Copy data from file-based stores, depending on the number and size of the files.
Copy data from partition-option-enabled data store (including Azure SQL Database, Azure SQL Managed
Instance, Azure Synapse Analytics, Oracle, Netezza, SAP HANA, SAP Open Hub, SAP Table, SQL Server, and
Teradata), depending on the number of data partitions.
Parallel copy
You can set parallel copy ( parallelCopies property) on copy activity to indicate the parallelism that you want
the copy activity to use. You can think of this property as the maximum number of threads within the copy
activity that read from your source or write to your sink data stores in parallel.
The parallel copy is orthogonal to Data Integration Units or Self-hosted IR nodes. It is counted across all the DIUs
or Self-hosted IR nodes.
For each copy activity run, by default Azure Data Factory dynamically applies the optimal parallel copy setting
based on your source-sink pair and data pattern.

TIP
The default behavior of parallel copy usually gives you the best throughput, which is auto-determined by ADF based on
your source-sink pair, data pattern and number of DIUs or the Self-hosted IR's CPU/memory/node count. Refer to
Troubleshoot copy activity performance on when to tune parallel copy.

The following table lists the parallel copy behavior:

C O P Y SC EN A RIO PA RA L L EL C O P Y B EH AVIO R

Between file stores parallelCopies determines the parallelism at the file


level. The chunking within each file happens underneath
automatically and transparently. It's designed to use the best
suitable chunk size for a given data store type to load data
in parallel.

The actual number of parallel copies copy activity uses at run


time is no more than the number of files you have. If the
copy behavior is mergeFile into file sink, the copy activity
can't take advantage of file-level parallelism.

From file store to non-file store - When copying data into Azure SQL Database or Azure
Cosmos DB, default parallel copy also depend on the sink
tier (number of DTUs/RUs).
- When copying data into Azure Table, default parallel copy is
4.

From non-file store to file store - When copying data from partition-option-enabled data
store (including Azure SQL Database, Azure SQL Managed
Instance, Azure Synapse Analytics, Oracle, Netezza, SAP
HANA, SAP Open Hub, SAP Table, SQL Server, and Teradata),
default parallel copy is 4. The actual number of parallel
copies copy activity uses at run time is no more than the
number of data partitions you have. When use Self-hosted
Integration Runtime and copy to Azure Blob/ADLS Gen2,
note the max effective parallel copy is 4 or 5 per IR node.
- For other scenarios, parallel copy doesn't take effect. Even if
parallelism is specified, it's not applied.
C O P Y SC EN A RIO PA RA L L EL C O P Y B EH AVIO R

Between non-file stores - When copying data into Azure SQL Database or Azure
Cosmos DB, default parallel copy also depend on the sink
tier (number of DTUs/RUs).
- When copying data from partition-option-enabled data
store (including Azure SQL Database, Azure SQL Managed
Instance, Azure Synapse Analytics, Oracle, Netezza, SAP
HANA, SAP Open Hub, SAP Table, SQL Server, and Teradata),
default parallel copy is 4.
- When copying data into Azure Table, default parallel copy is
4.

To control the load on machines that host your data stores, or to tune copy performance, you can override the
default value and specify a value for the parallelCopies property. The value must be an integer greater than or
equal to 1. At run time, for the best performance, the copy activity uses a value that is less than or equal to the
value that you set.
When you specify a value for the parallelCopies property, take the load increase on your source and sink data
stores into account. Also consider the load increase to the self-hosted integration runtime if the copy activity is
empowered by it. This load increase happens especially when you have multiple activities or concurrent runs of
the same activities that run against the same data store. If you notice that either the data store or the self-hosted
integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
Example:

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 32
}
}
]

Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Azure Blob storage
or Azure Data Lake Storage Gen2 as an interim staging store. Staging is especially useful in the following cases:
You want to ingest data from various data stores into Azure Synapse Analytics via PolyBase,
copy data from/to Snowflake, or ingest data from Amazon Redshift/HDFS performantly. Learn
more details from:
Use PolyBase to load data into Azure Synapse Analytics.
Snowflake connector
Amazon Redshift connector
HDFS connector
You don't want to open por ts other than por t 80 and por t 443 in your firewall because of
corporate IT policies. For example, when you copy data from an on-premises data store to an Azure SQL
Database or an Azure Synapse Analytics, you need to activate outbound TCP communication on port 1433
for both the Windows firewall and your corporate firewall. In this scenario, staged copy can take advantage
of the self-hosted integration runtime to first copy data to a staging storage over HTTP or HTTPS on port 443,
then load the data from staging into SQL Database or Azure Synapse Analytics. In this flow, you don't need to
enable port 1433.
Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-
premises data store to a cloud data store) over a slow network connection. To improve
performance, you can use staged copy to compress the data on-premises so that it takes less time to move
data to the staging data store in the cloud. Then you can decompress the data in the staging store before you
load into the destination data store.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging storage
(bring your own Azure Blob or Azure Data Lake Storage Gen2). Next, the data is copied from the staging to the
sink data store. Azure Data Factory copy activity automatically manages the two-stage flow for you, and also
cleans up temporary data from the staging storage after the data movement is complete.

When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before you move data from the source data store to the staging store and then decompressed
before you move data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither
with nor without staged copy. For such scenario, you can configure two explicitly chained copy activities to copy
from source to staging then from staging to sink.
Configuration
Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in
storage before you load it into a destination data store. When you set enableStaging to TRUE , specify the
additional properties listed in the following table.

P RO P ERT Y DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

enableStaging Specify whether you want False No


to copy data via an interim
staging store.

linkedServiceName Specify the name of an N/A Yes, when enableStaging


Azure Blob storage or Azure is set to TRUE
Data Lake Storage Gen2
linked service, which refers
to the instance of Storage
that you use as an interim
staging store.
P RO P ERT Y DESC RIP T IO N DEFA ULT VA L UE REQ UIRED

path Specify the path that you N/A No


want to contain the staged
data. If you don't provide a
path, the service creates a
container to store
temporary data.

enableCompression Specifies whether data False No


should be compressed
before it's copied to the
destination. This setting
reduces the volume of data
being transferred.

NOTE
If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked
service isn't supported.

Here's a sample definition of a copy activity with the properties that are described in the preceding table:

"activities":[
{
"name": "CopyActivityWithStaging",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "OracleSource",
},
"sink": {
"type": "SqlDWSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
},
"path": "stagingcontainer/path"
}
}
}
]

Staged copy billing impact


You're charged based on two steps: copy duration and copy type.
When you use staging during a cloud copy, which is copying data from a cloud data store to another cloud
data store, both stages empowered by Azure integration runtime, you're charged the [sum of copy duration
for step 1 and step 2] x [cloud copy unit price].
When you use staging during a hybrid copy, which is copying data from an on-premises data store to a cloud
data store, one stage empowered by a self-hosted integration runtime, you're charged for [hybrid copy
duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].
Next steps
See the other copy activity articles:
Copy activity overview
Copy activity performance and scalability guide
Troubleshoot copy activity performance
Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure
Migrate data from Amazon S3 to Azure Storage
Preserve metadata and ACLs using copy activity in
Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When you use Azure Data Factory copy activity to copy data from source to sink, in the following scenarios, you
can also preserve the metadata and ACLs along.

Preserve metadata for lake migration


When you migrate data from one data lake to another including Amazon S3, Azure Blob, Azure Data Lake
Storage Gen2, and Azure File Storage, you can choose to preserve the file metadata along with data.
Copy activity supports preserving the following attributes during data copy:
All the customer specified metadata
And the following five data store built-in system proper ties : contentType , contentLanguage (except for
Amazon S3), contentEncoding , contentDisposition , cacheControl .
Handle differences in metadata: Amazon S3 and Azure Storage allow different sets of characters in the keys
of customer specified metadata. When you choose to preserve metadata using copy activity, ADF automatically
replaces the invalid characters with '_'.
When you copy files as-is from Amazon S3/Azure Data Lake Storage Gen2/Azure Blob/Azure File Storage to
Azure Data Lake Storage Gen2/Azure Blob/Azure File Storage with binary format, you can find the Preser ve
option on the Copy Activity > Settings tab for activity authoring or the Settings page in Copy Data Tool.

Here's an example of copy activity JSON configuration (see preserve ):


"activities":[
{
"name": "CopyAndPreserveMetadata",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AmazonS3ReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
},
"preserve": [
"Attributes"
]
},
"inputs": [
{
"referenceName": "<Binary dataset Amazon S3/Azure Blob/ADLS Gen2 source>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Binary dataset for Azure Blob/ADLS Gen2 sink>",
"type": "DatasetReference"
}
]
}
]

Preserve ACLs from Data Lake Storage Gen1/Gen2 to Gen2


When you upgrade from Azure Data Lake Storage Gen1 to Gen2 or copy data between ADLS Gen2, you can
choose to preserve the POSIX access control lists (ACLs) along with data files. For more information on access
control, see Access control in Azure Data Lake Storage Gen1 and Access control in Azure Data Lake Storage
Gen2.
Copy activity supports preserving the following types of ACLs during data copy. You can select one or more
types:
ACL : Copy and preserve POSIX access control lists on files and directories. It copies the full existing ACLs
from source to sink.
Owner : Copy and preserve the owning user of files and directories. Super-user access to sink Data Lake
Storage Gen2 is required.
Group : Copy and preserve the owning group of files and directories. Super-user access to sink Data Lake
Storage Gen2 or the owning user (if the owning user is also a member of the target group) is required.
If you specify to copy from a folder, Data Factory replicates the ACLs for that given folder and the files and
directories under it, if recursive is set to true. If you specify to copy from a single file, the ACLs on that file are
copied.
NOTE
When you use ADF to preserve ACLs from Data Lake Storage Gen1/Gen2 to Gen2, the existing ACLs on sink Gen2's
corresponding folder/files will be overwritten.

IMPORTANT
When you choose to preserve ACLs, make sure you grant high enough permissions for Data Factory to operate against
your sink Data Lake Storage Gen2 account. For example, use account key authentication or assign the Storage Blob Data
Owner role to the service principal or managed identity.

When you configure source as Data Lake Storage Gen1/Gen2 with binary format or the binary copy option, and
sink as Data Lake Storage Gen2 with binary format or the binary copy option, you can find the Preser ve option
on the Settings page in Copy Data Tool or on the Copy Activity > Settings tab for activity authoring.

Here's an example of copy activity JSON configuration (see preserve ):


"activities":[
{
"name": "CopyAndPreserveACLs",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
},
"preserve": [
"ACL",
"Owner",
"Group"
]
},
"inputs": [
{
"referenceName": "<Binary dataset name for Azure Data Lake Storage Gen1/Gen2 source>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Binary dataset name for Azure Data Lake Storage Gen2 sink>",
"type": "DatasetReference"
}
]
}
]

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Schema and data type mapping in copy activity
5/6/2021 • 13 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how the Azure Data Factory copy activity perform schema mapping and data type
mapping from source data to sink data.

Schema mapping
Default mapping
By default, copy activity maps source data to sink by column names in case-sensitive manner. If sink doesn't
exist, for example, writing to file(s), the source field names will be persisted as sink names. If the sink already
exists, it must contain all columns being copied from the source. Such default mapping supports flexible
schemas and schema drift from source to sink from execution to execution - all the data returned by source data
store can be copied to sink.
If your source is text file without header line, explicit mapping is required as the source doesn't contain column
names.
Explicit mapping
You can also specify explicit mapping to customize the column/field mapping from source to sink based on your
need. With explicit mapping, you can copy only partial source data to sink, or map source data to sink with
different names, or reshape tabular/hierarchical data. Copy activity:
1. Reads the data from source and determine the source schema.
2. Applies your defined mapping.
3. Writes the data to sink.
Learn more about:
Tabular source to tabular sink
Hierarchical source to tabular sink
Tabular/Hierarchical source to hierarchical sink
You can configure the mapping on Data Factory authoring UI -> copy activity -> mapping tab, or
programmatically specify the mapping in copy activity -> translator property. The following properties are
supported in translator -> mappings array -> objects -> source and sink , which points to the specific
column/field to map data.

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the source or sink Yes


column/field. Apply for tabular source
and sink.

ordinal Column index. Start from 1. No


Apply and required when using
delimited text without header line.
P RO P ERT Y DESC RIP T IO N REQ UIRED

path JSON path expression for each field to No


extract or map. Apply for hierarchical
source and sink, for example, Cosmos
DB, MongoDB, or REST connectors.
For fields under the root object, the
JSON path starts with root $ ; for
fields inside the array chosen by
collectionReference property,
JSON path starts from the array
element without $ .

type Data Factory interim data type of the No


source or sink column. In general, you
don't need to specify or change this
property. Learn more about data type
mapping.

culture Culture of the source or sink column. No


Apply when type is Datetime or
Datetimeoffset . The default is
en-us .
In general, you don't need to specify or
change this property. Learn more
about data type mapping.

format Format string to be used when type is No


Datetime or Datetimeoffset . Refer
to Custom Date and Time Format
Strings on how to format datetime. In
general, you don't need to specify or
change this property. Learn more
about data type mapping.

The following properties are supported under translator in addition to mappings :

P RO P ERT Y DESC RIP T IO N REQ UIRED

collectionReference Apply when copying data from No


hierarchical source, for example,
Cosmos DB, MongoDB, or REST
connectors.
If you want to iterate and extract data
from the objects inside an array
field with the same pattern and
convert to per row per object, specify
the JSON path of that array to do
cross-apply.

Tabular source to tabular sink


For example, to copy data from Salesforce to Azure SQL Database and explicitly map three columns:
1. On copy activity -> mapping tab, click Impor t schemas button to import both source and sink schemas.
2. Map the needed fields and exclude/delete the rest.
The same mapping can be configured as the following in copy activity payload (see translator ):

{
"name": "CopyActivityTabularToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "SalesforceSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "Id" },
"sink": { "name": "CustomerID" }
},
{
"source": { "name": "Name" },
"sink": { "name": "LastName" }
},
{
"source": { "name": "LastModifiedDate" },
"sink": { "name": "ModifiedDate" }
}
]
}
},
...
}

To copy data from delimited text file(s) without header line, the columns are represented by ordinal instead of
names.
{
"name": "CopyActivityTabularToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "DelimitedTextSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "ordinal": "1" },
"sink": { "name": "CustomerID" }
},
{
"source": { "ordinal": "2" },
"sink": { "name": "LastName" }
},
{
"source": { "ordinal": "3" },
"sink": { "name": "ModifiedDate" }
}
]
}
},
...
}

Hierarchical source to tabular sink


When copying data from hierarchical source to tabular sink, copy activity supports the following capabilities:
Extract data from objects and arrays.
Cross apply multiple objects with the same pattern from an array, in which case to convert one JSON object
into multiple records in tabular result.
For more advanced hierarchical-to-tabular transformation, you can use Data Flow.
For example, if you have source MongoDB document with the following content:

{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}

And you want to copy it into a text file in the following format with header line, by flattening the data inside the
array (order_pd and order_price) and cross join with the common root info (number, date, and city):
O RDERN UM B ER O RDERDAT E O RDER_P D O RDER_P RIC E C IT Y

01 20170122 P1 23 Seattle

01 20170122 P2 13 Seattle

01 20170122 P3 231 Seattle

You can define such mapping on Data Factory authoring UI:


1. On copy activity -> mapping tab, click Impor t schemas button to import both source and sink schemas.
As Data Factory samples the top few objects when importing schema, if any field doesn't show up, you
can add it to the correct layer in the hierarchy - hover on an existing field name and choose to add a
node, an object, or an array.
2. Select the array from which you want to iterate and extract data. It will be auto populated as Collection
reference . Note only single array is supported for such operation.
3. Map the needed fields to sink. Data Factory automatically determines the corresponding JSON paths for
the hierarchical side.

NOTE
For records where the array marked as collection reference is empty and the check box is selected, the entire record is
skipped.

You can also switch to Advanced editor , in which case you can directly see and edit the fields' JSON paths. If
you choose to add new mapping in this view, specify the JSON path.
The same mapping can be configured as the following in copy activity payload (see translator ):

{
"name": "CopyActivityHierarchicalToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "MongoDbV2Source" },
"sink": { "type": "DelimitedTextSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "path": "$['number']" },
"sink": { "name": "orderNumber" }
},
{
"source": { "path": "$['date']" },
"sink": { "name": "orderDate" }
},
{
"source": { "path": "['prod']" },
"sink": { "name": "order_pd" }
},
{
"source": { "path": "['price']" },
"sink": { "name": "order_price" }
},
{
"source": { "path": "$['city'][0]['name']" },
"sink": { "name": "city" }
}
],
"collectionReference": "$['orders']"
}
},
...
}

Tabular/Hierarchical source to hierarchical sink


The user experience flow is similar to Hierarchical source to tabular sink.
When copying data from tabular source to hierarchical sink, writing to array inside object is not supported.
When copying data from hierarchical source to hierarchical sink, you can additionally preserve entire layer's
hierarchy, by selecting the object/array and map to sink without touching the inner fields.
For more advanced data reshape transformation, you can use Data Flow.
Parameterize mapping
If you want to create a templatized pipeline to copy large number of objects dynamically, determine whether
you can leverage the default mapping or you need to define explicit mapping for respective objects.
If explicit mapping is needed, you can:
1. Define a parameter with object type at the pipeline level, for example, mapping .
2. Parameterize the mapping: on copy activity -> mapping tab, choose to add dynamic content and select
the above parameter. The activity payload would be as the following:

{
"name": "CopyActivityHierarchicalToTabular",
"type": "Copy",
"typeProperties": {
"source": {...},
"sink": {...},
"translator": {
"value": "@pipeline().parameters.mapping",
"type": "Expression"
},
...
}
}

3. Construct the value to pass into the mapping parameter. It should be the entire object of translator
definition, refer to the samples in explicit mapping section. For example, for tabular source to tabular sink
copy, the value should be
{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},
{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":
{"name":"ModifiedDate"}}]}
.

Data type mapping


Copy activity performs source types to sink types mapping with the following flow:
1. Convert from source native data types to Azure Data Factory interim data types.
2. Automatically convert interim data type as needed to match corresponding sink types, applicable for both
default mapping and explicit mapping.
3. Convert from Azure Data Factory interim data types to sink native data types.
Copy activity currently supports the following interim data types: Boolean, Byte, Byte array, Datetime,
DatetimeOffset, Decimal, Double, GUID, Int16, Int32, Int64, SByte, Single, String, Timespan, UInt16, UInt32, and
UInt64.
The following data type conversions are supported between the interim types from source to sink.

F LO AT -
SO URC E B O O L EA BYT E DEC IM A DAT E/ T I P O IN T IN T EGE T IM ESP
\ SIN K N A RRAY L M E ( 1) ( 2) GUID R ( 3) ST RIN G AN

Boolean ✓ ✓ ✓ ✓ ✓

Byte ✓ ✓
array
F LO AT -
SO URC E B O O L EA BYT E DEC IM A DAT E/ T I P O IN T IN T EGE T IM ESP
\ SIN K N A RRAY L M E ( 1) ( 2) GUID R ( 3) ST RIN G AN

Date/Ti ✓ ✓
me

Decimal ✓ ✓ ✓ ✓ ✓

Float- ✓ ✓ ✓ ✓ ✓
point

GUID ✓ ✓

Integer ✓ ✓ ✓ ✓ ✓

String ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

TimeSpa ✓ ✓
n

(1) Date/Time includes DateTime and DateTimeOffset.


(2) Float-point includes Single and Double.
(3) Integer includes SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, and UInt64.

NOTE
Currently such data type conversion is supported when copying between tabular data. Hierarchical sources/sinks are
not supported, which means there is no system-defined data type conversion between source and sink interim types.
This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.

The following properties are supported in copy activity for data type conversion (under translator section for
programmatical authoring):

P RO P ERT Y DESC RIP T IO N REQ UIRED


P RO P ERT Y DESC RIP T IO N REQ UIRED

typeConversion Enable the new data type conversion No


experience.
Default value is false due to backward
compatibility.

For new copy activities created via


Data Factory authoring UI since late
June 2020, this data type conversion is
enabled by default for the best
experience, and you can see the
following type conversion settings on
copy activity -> mapping tab for
applicable scenarios.
To create pipeline programmatically,
you need to explicitly set
typeConversion property to true to
enable it.
For existing copy activities created
before this feature is released, you
won't see type conversion options on
Data Factory authoring UI for
backward compatibility.

typeConversionSettings A group of type conversion settings. No


Apply when typeConversion is set to
true . The following properties are all
under this group.

Under typeConversionSettings

allowDataTruncation Allow data truncation when converting No


source data to sink with different type
during copy, for example, from decimal
to integer, from DatetimeOffset to
Datetime.
Default value is true.

treatBooleanAsNumber Treat booleans as numbers, for No


example, true as 1.
Default value is false.

dateTimeFormat Format string when converting No


between dates without time zone
offset and strings, for example,
yyyy-MM-dd HH:mm:ss.fff . Refer to
Custom Date and Time Format Strings
for detailed information.

dateTimeOffsetFormat Format string when converting No


between dates with time zone offset
and strings, for example,
yyyy-MM-dd HH:mm:ss.fff zzz . Refer
to Custom Date and Time Format
Strings for detailed information.
P RO P ERT Y DESC RIP T IO N REQ UIRED

timeSpanFormat Format string when converting No


between time periods and strings, for
example, dd\.hh\:mm . Refer to
Custom TimeSpan Format Strings for
detailed information.

culture Culture information to be used when No


convert types, for example, en-us or
fr-fr .

Example:

{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "ParquetSource"
},
"sink": {
"type": "SqlSink"
},
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": true,
"dateTimeFormat": "yyyy-MM-dd HH:mm:ss.fff",
"dateTimeOffsetFormat": "yyyy-MM-dd HH:mm:ss.fff zzz",
"timeSpanFormat": "dd\.hh\:mm",
"culture": "en-gb"
}
}
},
...
}

Legacy models
NOTE
The following models to map source columns/fields to sink are still supported as is for backward compatibility. We suggest
that you use the new model mentioned in schema mapping. Data Factory authoring UI has switched to generating the
new model.

Alternative column-mapping (legacy model)


You can specify copy activity -> translator -> columnMappings to map between tabular-shaped data. In this
case, the "structure" section is required for both input and output datasets. Column mapping supports mapping
all or subset of columns in the source dataset "structure" to all columns in the sink dataset
"structure" . The following are error conditions that result in an exception:
Source data store query result does not have a column name that is specified in the input dataset "structure"
section.
Sink data store (if with pre-defined schema) does not have a column name that is specified in the output
dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
Duplicate mapping.
In the following example, the input dataset has a structure, and it points to a table in an on-premises Oracle
database.

{
"name": "OracleDataset",
"properties": {
"structure":
[
{ "name": "UserId"},
{ "name": "Name"},
{ "name": "Group"}
],
"type": "OracleTable",
"linkedServiceName": {
"referenceName": "OracleLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTable"
}
}
}

In this sample, the output dataset has a structure and it points to a table in Salesfoce.

{
"name": "SalesforceDataset",
"properties": {
"structure":
[
{ "name": "MyUserId"},
{ "name": "MyName" },
{ "name": "MyGroup"}
],
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "SalesforceLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SinkTable"
}
}
}

The following JSON defines a copy activity in a pipeline. The columns from source mapped to columns in sink
by using the translator -> columnMappings property.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "OracleDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SalesforceDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "OracleSource" },
"sink": { "type": "SalesforceSink" },
"translator":
{
"type": "TabularTranslator",
"columnMappings":
{
"UserId": "MyUserId",
"Group": "MyGroup",
"Name": "MyName"
}
}
}
}

If you are using the syntax of "columnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName" to specify
column mapping, it is still supported as-is.
Alternative schema-mapping (legacy model)
You can specify copy activity -> translator -> schemaMapping to map between hierarchical-shaped data and
tabular-shaped data, for example, copy from MongoDB/REST to text file and copy from Oracle to Azure Cosmos
DB's API for MongoDB. The following properties are supported in copy activity translator section:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the copy activity Yes


translator must be set to:
TabularTranslator
P RO P ERT Y DESC RIP T IO N REQ UIRED

schemaMapping A collection of key-value pairs, which Yes


represents the mapping relation from
source side to sink side .
- Key: represents source. For tabular
source , specify the column name as
defined in dataset structure; for
hierarchical source , specify the
JSON path expression for each field to
extract and map.
- Value: represents sink. For tabular
sink , specify the column name as
defined in dataset structure; for
hierarchical sink , specify the JSON
path expression for each field to
extract and map.
In the case of hierarchical data, for
fields under root object, JSON path
starts with root $; for fields inside the
array chosen by
collectionReference property,
JSON path starts from the array
element.

collectionReference If you want to iterate and extract data No


from the objects inside an array
field with the same pattern and
convert to per row per object, specify
the JSON path of that array to do
cross-apply. This property is supported
only when hierarchical data is source.

Example: copy from MongoDB to Oracle:


For example, if you have MongoDB document with the following content:

{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
(order_pd and order_price) and cross join with the common root info (number, date, and city):
O RDERN UM B ER O RDERDAT E O RDER_P D O RDER_P RIC E C IT Y

01 20170122 P1 23 Seattle

01 20170122 P2 13 Seattle

01 20170122 P3 231 Seattle

Configure the schema-mapping rule as the following copy activity JSON sample:

{
"name": "CopyFromMongoDBToOracle",
"type": "Copy",
"typeProperties": {
"source": {
"type": "MongoDbV2Source"
},
"sink": {
"type": "OracleSink"
},
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"$.number": "orderNumber",
"$.date": "orderDate",
"prod": "order_pd",
"price": "order_price",
"$.city[0].name": "city"
},
"collectionReference": "$.orders"
}
}
}

Next steps
See the other Copy Activity articles:
Copy activity overview
Fault tolerance of copy activity in Azure Data
Factory
6/10/2021 • 12 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When you copy data from source to destination store, Azure Data Factory copy activity provides certain level of
fault tolerances to prevent interruption from failures in the middle of data movement. For example, you are
copying millions of rows from source to destination store, where a primary key has been created in the
destination database, but source database does not have any primary keys defined. When you happen to copy
duplicated rows from source to the destination, you will hit the PK violation failure on the destination database.
At this moment, copy activity offers you two ways to handle such errors:
You can abort the copy activity once any failure is encountered.
You can continue to copy the rest by enabling fault tolerance to skip the incompatible data. For example, skip
the duplicated row in this case. In addition, you can log the skipped data by enabling session log within copy
activity. You can refer to session log in copy activity for more details.

Copying binary files


ADF supports the following fault tolerance scenarios when copying binary files. You can choose to abort the
copy activity or continue to copy the rest in the following scenarios:
1. The files to be copied by ADF are being deleted by other applications at the same time.
2. Some particular folders or files do not allow ADF to access because ACLs of those files or folders require
higher permission level than the connection information configured in ADF.
3. One or more files are not verified to be consistent between source and destination store if you enable data
consistency verification setting in ADF.
Configuration
When you copy binary files between storage stores, you can enable fault tolerance as followings:
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureDataLakeStoreWriteSettings"
}
},
"skipErrorFile": {
"fileMissing": true,
"fileForbidden": true,
"dataInconsistency": true,
"invalidFileName": true
},
"validateDataConsistency": true,
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
}

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

skipErrorFile A group of properties to No


specify the types of failures
you want to skip during the
data movement.

fileMissing One of the key-value pairs True(default) No


within skipErrorFile False
property bag to determine
if you want to skip files,
which are being deleted by
other applications when
ADF is copying in the
meanwhile.
-True: you want to copy the
rest by skipping the files
being deleted by other
applications.
- False: you want to abort
the copy activity once any
files are being deleted from
source store in the middle
of data movement.
Be aware this property is
set to true as default.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

fileForbidden One of the key-value pairs True No


within skipErrorFile False(default)
property bag to determine
if you want to skip the
particular files, when the
ACLs of those files or
folders require higher
permission level than the
connection configured in
ADF.
-True: you want to copy the
rest by skipping the files.
- False: you want to abort
the copy activity once
getting the permission issue
on folders or files.

dataInconsistency One of the key-value pairs True No


within skipErrorFile False(default)
property bag to determine
if you want to skip the
inconsistent data between
source and destination
store.
-True: you want to copy the
rest by skipping
inconsistent data.
- False: you want to abort
the copy activity once
inconsistent data found.
Be aware this property is
only valid when you set
validateDataConsistency as
True.

invalidFileName One of the key-value pairs True No


within skipErrorFile False(default)
property bag to determine
if you want to skip the
particular files, when the file
names are invalid for the
destination store.
-True: you want to copy the
rest by skipping the files
having invalid file names.
- False: you want to abort
the copy activity once any
files have invalid file names.
Be aware this property
works when copying binary
files from any storage store
to ADLS Gen2 or copying
binary files from AWS S3 to
any storage store only.

logSettings A group of properties that No


can be specified when you
want to log the skipped
object names.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

linkedServiceName The linked service of Azure The names of an No


Blob Storage or Azure Data AzureBlobStorage or
Lake Storage Gen2 to store AzureBlobFS type linked
the session log files. service, which refers to the
instance that you use to
store the log file.

path The path of the log files. Specify the path that you No
use to store the log files. If
you do not provide a path,
the service creates a
container for you.

NOTE
The followings are the prerequisites of enabling fault tolerance in copy activity when copying binary files. For skipping
particular files when they are being deleted from source store:
The source dataset and sink dataset have to be binary format, and the compression type cannot be specified.
The supported data store types are Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage
Gen2, Azure File Storage, File System, FTP, SFTP, Amazon S3, Google Cloud Storage and HDFS.
Only if when you specify multiple files in source dataset, which can be a folder, wildcard or a list of files, copy activity
can skip the particular error files. If a single file is specified in source dataset to be copied to the destination, copy
activity will fail if any error occurred.
For skipping particular files when their access are forbidden from source store:
The source dataset and sink dataset have to be binary format, and the compression type cannot be specified.
The supported data store types are Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage
Gen2, Azure File Storage, SFTP, Amazon S3 and HDFS.
Only if when you specify multiple files in source dataset, which can be a folder, wildcard or a list of files, copy activity
can skip the particular error files. If a single file is specified in source dataset to be copied to the destination, copy
activity will fail if any error occurred.
For skipping particular files when they are verified to be inconsistent between source and destination store:
You can get more details from data consistency doc here.

Monitoring
Output from copy activity
You can get the number of files being read, written, and skipped via the output of each copy activity run.
"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}

Session log from copy activity


If you configure to log the skipped file names, you can find the log file from this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-
name]/[copy-activity-run-id]/[auto-generated-GUID].csv
.
The log files have to be the csv files. The schema of the log file is as following:

C O L UM N DESC RIP T IO N

Timestamp The timestamp when ADF skips the file.

Level The log level of this item. It will be in 'Warning' level for the
item showing file skipping.

OperationName ADF copy activity operational behavior on each file. It will be


'FileSkip' to specify the file to be skipped.

OperationItem The file names to be skipped.

Message More information to illustrate why the file being skipped.

The example of a log file is as following:

Timestamp,Level,OperationName,OperationItem,Message
2020-03-24 05:35:41.0209942,Warning,FileSkip,"bigfile.csv","File is skipped after read 322961408 bytes:
ErrorCode=UserErrorSourceBlobNotExist,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Mes
sage=The required Blob is missing. ContainerName:
https://transferserviceonebox.blob.core.windows.net/skipfaultyfile, path:
bigfile.csv.,Source=Microsoft.DataTransfer.ClientLibrary,'."
2020-03-24 05:38:41.2595989,Warning,FileSkip,"3_nopermission.txt","File is skipped after read 0 bytes:
ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message
=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Forbidden'. Account:
'adlsgen2perfsource'. FileSystem: 'skipfaultyfilesforbidden'. Path: '3_nopermission.txt'. ErrorCode:
'AuthorizationPermissionMismatch'. Message: 'This request is not authorized to perform this operation using
this permission.'. RequestId: '35089f5d-101f-008c-489e-
01cce4000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.DataTransfer.Common.Shared.Hybr
idDeliveryException,Message=Operation returned an invalid status code
'Forbidden',Source=,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message='Type=Microsoft.
Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code
'Forbidden',Source=Microsoft.DataTransfer.ClientLibrary,',Source=Microsoft.DataTransfer.ClientLibrary,'."

From the log above, you can see bigfile.csv has been skipped due to another application deleted this file when
ADF was copying it. And 3_nopermission.txt has been skipped because ADF is not allowed to access it due to
permission issue.

Copying tabular data


Supported scenarios
Copy activity supports three scenarios for detecting, skipping, and logging incompatible tabular data:
Incompatibility between the source data type and the sink native type .
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains three INT type columns. The CSV file rows that contain numeric data, such as 123,456,789 are
copied successfully to the sink store. However, the rows that contain non-numeric values, such as
123,456, abc are detected as incompatible and are skipped.
Mismatch in the number of columns between the source and the sink .
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store.
The CSV file rows that contain more than six columns are detected as incompatible and are skipped.
Primar y key violation when writing to SQL Ser ver/Azure SQL Database/Azure Cosmos DB .
For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL
database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in
the source cannot be copied to the sink. Copy activity copies only the first row of the source data into the
sink. The subsequent source rows that contain the duplicated primary key value are detected as
incompatible and are skipped.

NOTE
To load data into Azure Synapse Analytics using PolyBase, configure PolyBase's native fault tolerance settings by
specifying reject policies via "polyBaseSettings" in copy activity. You can still enable redirecting PolyBase incompatible
rows to Blob or ADLS as normal as shown below.
This feature doesn't apply when copy activity is configured to invoke Amazon Redshift Unload.
This feature doesn't apply when copy activity is configured to invoke a stored procedure from a SQL sink.

Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in copy activity:
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "AzureSqlSink"
},
"enableSkipIncompatibleRow": true,
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
},

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

enableSkipIncompatibleRow Specifies whether to skip True No


incompatible rows during False (default)
copy or not.

logSettings A group of properties that No


can be specified when you
want to log the
incompatible rows.

linkedServiceName The linked service of Azure The names of an No


Blob Storage or Azure Data AzureBlobStorage or
Lake Storage Gen2 to store AzureBlobFS type linked
the log that contains the service, which refers to the
skipped rows. instance that you use to
store the log file.

path The path of the log files Specify the path that you No
that contains the skipped want to use to log the
rows. incompatible data. If you do
not provide a path, the
service creates a container
for you.

Monitor skipped rows


After the copy activity run completes, you can see the number of skipped rows in the output of the copy activity:
"output": {
"dataRead": 95,
"dataWritten": 186,
"rowsCopied": 9,
"rowsSkipped": 2,
"copyDuration": 16,
"throughput": 0.01,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"errors": []
},

If you configure to log the incompatible rows, you can find the log file from this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-
name]/[copy-activity-run-id]/[auto-generated-GUID].csv
.
The log files will be the csv files. The schema of the log file is as following:

C O L UM N DESC RIP T IO N

Timestamp The timestamp when ADF skips the incompatible rows

Level The log level of this item. It will be in 'Warning' level if this
item shows the skipped rows

OperationName ADF copy activity operational behavior on each row. It will be


'TabularRowSkip' to specify that the particular incompatible
row has been skipped

OperationItem The skipped rows from the source data store.

Message More information to illustrate why the incompatibility of this


particular row.

An example of the log file content is as follows:

Timestamp, Level, OperationName, OperationItem, Message


2020-02-26 06:22:32.2586581, Warning, TabularRowSkip, """data1"", ""data2"", ""data3""," "Column 'Prop_2'
contains an invalid value 'data3'. Cannot convert 'data3' to type 'DateTime'."
2020-02-26 06:22:33.2586351, Warning, TabularRowSkip, """data4"", ""data5"", ""data6"",", "Violation of
PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert duplicate key in object
'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4)."

From the sample log file above, you can see one row "data1, data2, data3" has been skipped due to type
conversion issue from source to destination store. Another row "data4, data5, data6" has been skipped due to
PK violation issue from source to destination store.

Copying tabular data (legacy):


The following approach is the legacy way to enable fault tolerance for copying tabular data only. If you are
creating new pipeline or activity, you are encouraged to start from here instead.
Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in copy activity:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "<Azure Storage or Data Lake Store linked service>",
"type": "LinkedServiceReference"
},
"path": "redirectcontainer/erroroutput"
}
}

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

enableSkipIncompatibleRow Specifies whether to skip True No


incompatible rows during False (default)
copy or not.

redirectIncompatibleRowSet A group of properties that No


tings can be specified when you
want to log the
incompatible rows.

linkedServiceName The linked service of Azure The names of an No


Storage or Azure Data Lake AzureStorage or
Store to store the log that AzureDataLakeStore type
contains the skipped rows. linked service, which refers
to the instance that you
want to use to store the log
file.

path The path of the log file that Specify the path that you No
contains the skipped rows. want to use to log the
incompatible data. If you do
not provide a path, the
service creates a container
for you.

Monitor skipped rows


After the copy activity run completes, you can see the number of skipped rows in the output of the copy activity:

"output": {
"dataRead": 95,
"dataWritten": 186,
"rowsCopied": 9,
"rowsSkipped": 2,
"copyDuration": 16,
"throughput": 0.01,
"redirectRowPath": "https://myblobstorage.blob.core.windows.net//myfolder/a84bf8d4-233f-4216-
8cb5-45962831cd1b/",
"errors": []
},

If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-
generated-GUID].csv
.
The log files can only be the csv files. The original data being skipped will be logged with comma as column
delimiter if needed. We add two more columns "ErrorCode" and "ErrorMessage" in additional to the original
source data in log file, where you can see the root cause of the incompatibility. The ErrorCode and ErrorMessage
will be quoted by double quotes.
An example of the log file content is as follows:

data1, data2, data3, "UserErrorInvalidDataValue", "Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'."
data4, data5, data6, "2627", "Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot
insert duplicate key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4)."

Next steps
See the other copy activity articles:
Copy activity overview
Copy activity performance
Data consistency verification in copy activity
3/5/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When you move data from source to destination store, Azure Data Factory copy activity provides an option for
you to do additional data consistency verification to ensure the data is not only successfully copied from source
to destination store, but also verified to be consistent between source and destination store. Once inconsistent
files have been found during the data movement, you can either abort the copy activity or continue to copy the
rest by enabling fault tolerance setting to skip inconsistent files. You can get the skipped file names by enabling
session log setting in copy activity. You can refer to session log in copy activity for more details.

Supported data stores and scenarios


Data consistency verification is supported by all the connectors except FTP, sFTP, and HTTP.
Data consistency verification is not supported in staging copy scenario.
When copying binary files, data consistency verification is only available when 'PreserveHierarchy' behavior
is set in copy activity.
When copying multiple binary files in single copy activity with data consistency verification enabled, you
have an option to either abort the copy activity or continue to copy the rest by enabling fault tolerance
setting to skip inconsistent files.
When copying a table in single copy activity with data consistency verification enabled, copy activity fails if
the number of rows read from the source is different from the number of rows copied to the destination plus
the number of incompatible rows that were skipped.

Configuration
The following example provides a JSON definition to enable data consistency verification in Copy Activity:
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureDataLakeStoreWriteSettings"
}
},
"validateDataConsistency": true,
"skipErrorFile": {
"dataInconsistency": true
},
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
}

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

validateDataConsistency If you set true for this True No


property, when copying False (default)
binary files, copy activity will
check file size,
lastModifiedDate, and MD5
checksum for each binary
file copied from source to
destination store to ensure
the data consistency
between source and
destination store. When
copying tabular data, copy
activity will check the total
row count after job
completes to ensure the
total number of rows read
from the source is same as
the number of rows copied
to the destination plus the
number of incompatible
rows that were skipped. Be
aware the copy
performance will be affected
by enabling this option.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

dataInconsistency One of the key-value pairs True No


within skipErrorFile False (default)
property bag to determine
if you want to skip the
inconsistent files.
-True: you want to copy the
rest by skipping
inconsistent files.
- False: you want to abort
the copy activity once
inconsistent file found.
Be aware this property is
only valid when you are
copying binary files and set
validateDataConsistency as
True.

logSettings A group of properties that No


can be specified to enable
session log to log skipped
files.

linkedServiceName The linked service of Azure The names of an No


Blob Storage or Azure Data AzureBlobStorage or
Lake Storage Gen2 to store AzureBlobFS types linked
the session log files. service, which refers to the
instance that you use to
store the log files.

path The path of the log files. Specify the path that you No
want to store the log files. If
you do not provide a path,
the service creates a
container for you.

NOTE
When copying binary files from, or to Azure Blob or Azure Data Lake Storage Gen2, ADF does block level MD5
checksum verification leveraging Azure Blob API and Azure Data Lake Storage Gen2 API. If ContentMD5 on files exist
on Azure Blob or Azure Data Lake Storage Gen2 as data sources, ADF does file level MD5 checksum verification after
reading the files as well. After copying files to Azure Blob or Azure Data Lake Storage Gen2 as data destination, ADF
writes ContentMD5 to Azure Blob or Azure Data Lake Storage Gen2 which can be further consumed by downstream
applications for data consistency verification.
ADF does file size verification when copying binary files between any storage stores.

Monitoring
Output from copy activity
After the copy activity runs completely, you can see the result of data consistency verification from the output of
each copy activity run:
"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}

You can see the details of data consistency verification from "dataConsistencyVerification property".
Value of VerificationResult :
Verified : Your copied data has been verified to be consistent between source and destination store.
NotVerified : Your copied data has not been verified to be consistent because you have not enabled the
validateDataConsistency in copy activity.
Unsuppor ted : Your copied data has not been verified to be consistent because data consistency verification
is not supported for this particular copy pair.
Value of InconsistentData :
Found : ADF copy activity has found inconsistent data.
Skipped : ADF copy activity has found and skipped inconsistent data.
None : ADF copy activity has not found any inconsistent data. It can be either because your data has been
verified to be consistent between source and destination store or because you disabled
validateDataConsistency in copy activity.
Session log from copy activity
If you configure to log the inconsistent file, you can find the log file from this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-
name]/[copy-activity-run-id]/[auto-generated-GUID].csv
. The log files will be the csv files.
The schema of a log file is as following:

C O L UM N DESC RIP T IO N

Timestamp The timestamp when ADF skips the inconsistent files.

Level The log level of this item. It will be in 'Warning' level for the
item showing file skipping.

OperationName ADF copy activity operational behavior on each file. It will be


'FileSkip' to specify the file to be skipped.

OperationItem The file name to be skipped.

Message More information to illustrate why files being skipped.

The example of a log file is as following:


Timestamp, Level, OperationName, OperationItem, Message
2020-02-26 06:22:56.3190846, Warning, FileSkip, "sample1.csv", "File is skipped after read 548000000 bytes:
ErrorCode=DataConsistencySourceDataChanged,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryExceptio
n,Message=Source file 'sample1.csv' is changed by other clients during the copy activity run.,Source=,'."

From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be
consistent between source and destination store. You can get more details about why sample1.csv becomes
inconsistent is because it was being changed by other applications when ADF copy activity is copying at the
same time.

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity fault tolerance
Session log in copy activity
5/17/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can log your copied file names in copy activity, which can help you to further ensure the data is not only
successfully copied from source to destination store, but also consistent between source and destination store
by reviewing the copied files in copy activity session logs.
When you enable fault tolerance setting in copy activity to skip faulty data, the skipped files and skipped rows
can also be logged. You can get more details from fault tolerance in copy activity.
Given you have the opportunity to get all the file names copied by ADF copy activity via enabling session log, it
will be helpful for you in the following scenarios:
After you use ADF copy activities to copy the files from one storage to another, you see some files are shown
up in destination store which should not. You can scan the copy activity session logs to see which copy
activity actually copied those files and when to copy those files. By those, you can easily find the root cause
and fix your configurations in ADF.
After you use ADF copy activities to copy the files from one storage to another, you feel the files copied to the
destination are not the same as the ones from the source store. You can scan the copy activity session logs to
get the timestamp of copy jobs as well as the metadata of files when ADF copy activities read them from the
source store. By those, you can know if those files had been updated by other applications on source store
after being copied by ADF.

Configuration
The following example provides a JSON definition to enable session log in Copy Activity:
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
},
"formatSettings": {
"type": "BinaryReadSettings"
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
},
"skipErrorFile": {
"fileForbidden": true,
"dataInconsistency": true
},
"validateDataConsistency": true,
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
}

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

enableCopyActivityLog When set it to true, you will True No


have the opportunity to log False (default)
copied files, skipped files or
skipped rows.

logLevel "Info" will log all the copied Info No


files, skipped files and Warning (default)
skipped rows. "Warning" will
log skipped files and
skipped rows only.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

enableReliableLogging When it is true, copy True No


activity in reliable mode will False (default)
flush logs immediately once
each file is copied to the
destination. When you are
copying huge amounts of
files with reliable logging
mode enabled in copy
activity, you should expect
the copy throughput would
be impacted, since double
write operations are
required for each file
copying. One request is to
the destination store and
another request is to the
log storage store. Copy
activity in best effort mode
will flush logs with batch of
records within a period of
time, where the copy
throughput will be much
less impacted. The
completeness and
timeliness of logging is not
guaranteed in this mode
since there are a few
possibilities that the last
batch of log events has not
been flushed to the log file
when copy activity failed. At
this moment, you will see a
few files copied to the
destination are not logged.

logLocationSettings A group of properties that No


can be used to specify the
location to store the session
logs.

linkedServiceName The linked service of Azure The names of an No


Blob Storage or Azure Data AzureBlobStorage or
Lake Storage Gen2 to store AzureBlobFS types linked
the session log files. service, which refers to the
instance that you use to
store the log files.

path The path of the log files. Specify the path that you No
want to store the log files. If
you do not provide a path,
the service creates a
container for you.

Monitoring
Output from copy activity
After the copy activity runs completely, you can see the path of log files from the output of each copy activity
run. You can find the log files from the path:
https://[your-blob-account].blob.core.windows.net/[logFilePath]/copyactivity-logs/[copy-activity-name]/[copy-
activity-run-id]/[auto-generated-GUID].txt
. The log files generated have the .txt extension and their data is in CSV format.

"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}

NOTE
When the enableCopyActivityLog property is set to Enabled , the log file names are system generated.

The schema of the log file


The following is the schema of a log file.

C O L UM N DESC RIP T IO N

Timestamp The timestamp when ADF reads, writes, or skips the object.

Level The log level of this item. It can be 'Warning' or "Info".

OperationName ADF copy activity operational behavior on each object. It can


be 'FileRead',' FileWrite', 'FileSkip', or 'TabularRowSkip'.

OperationItem The file names or skipped rows.

Message More information to show if the file has been read from
source store, or written to the destination store. It can also
be why the file or rows has being skipped.

The following is an example of a log file.


Timestamp, Level, OperationName, OperationItem, Message
2020-10-19 08:39:13.6688152,Info,FileRead,"sample1.csv","Start to read file:
{""Path"":""sample1.csv"",""ItemType"":""File"",""Size"":104857620,""LastModified"":""2020-10-
19T08:22:31Z"",""ETag"":""\""0x8D874081F80C01A\"""",""ContentMD5"":""dGKVP8BVIy6AoTtKnt+aYQ=="",""ObjectName
"":null}"
2020-10-19 08:39:56.3190846, Warning, FileSkip, "sample1.csv", "File is skipped after read 548000000 bytes:
ErrorCode=DataConsistencySourceDataChanged,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryExceptio
n,Message=Source file 'sample1.csv' is changed by other clients during the copy activity run.,Source=,'."
2020-10-19 08:40:13.6688152,Info,FileRead,"sample2.csv","Start to read file:
{""Path"":""sample2.csv"",""ItemType"":""File"",""Size"":104857620,""LastModified"":""2020-10-
19T08:22:31Z"",""ETag"":""\""0x8D874081F80C01A\"""",""ContentMD5"":""dGKVP8BVIy6AoTtKnt+aYQ=="",""ObjectName
"":null}"
2020-10-19 08:40:13.9003981,Info,FileWrite,"sample2.csv","Start to write file from source file:
sample2.csv."
2020-10-19 08:45:17.6508407,Info,FileRead,"sample2.csv","Complete reading file successfully. "
2020-10-19 08:45:28.7390083,Info,FileWrite,"sample2.csv","Complete writing file from source file:
sample2.csv. File is successfully copied."

From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be
consistent between source and destination store. You can get more details about why sample1.csv becomes
inconsistent is because it was being changed by other applications when ADF copy activity is copying at the
same time. You can also see sample2.csv has been successfully copied from source to destination store.
You can use multiple analysis engines to further analyze the log files. There are a few examples below to use
SQL query to analyze the log file by importing csv log file to SQL database where the table name can be
SessionLogDemo.
Give me the copied file list.

select OperationItem from SessionLogDemo where Message like '%File is successfully copied%'

Give me the file list copied within a particular time range.

select OperationItem from SessionLogDemo where TIMESTAMP >= '<start time>' and TIMESTAMP <= '<end time>' and
Message like '%File is successfully copied%'

Give me a particular file with its copied time and metadata.

select * from SessionLogDemo where OperationItem='<file name>'

Give me a list of files with their metadata copied within a time range.

select * from SessionLogDemo where OperationName='FileRead' and Message like 'Start to read%' and
OperationItem in (select OperationItem from SessionLogDemo where TIMESTAMP >= '<start time>' and TIMESTAMP
<= '<end time>' and Message like '%File is successfully copied%')

Give me the skipped file list.

select OperationItem from SessionLogDemo where OperationName='FileSkip'

Give me the reason why a particular file skipped.

select TIMESTAMP, OperationItem, Message from SessionLogDemo where OperationName='FileSkip'


Give me the list of files skipped due to the same reason: "blob file does not exist".

select TIMESTAMP, OperationItem, Message from SessionLogDemo where OperationName='FileSkip' and Message like
'%UserErrorSourceBlobNotExist%'

Give me the file name that requires the longest time to copy.

select top 1 OperationItem, CopyDuration=DATEDIFF(SECOND, min(TIMESTAMP), max(TIMESTAMP)) from


SessionLogDemo group by OperationItem order by CopyDuration desc

Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity fault tolerance
Copy activity data consistency
Supported file formats and compression codecs in
Azure Data Factory (legacy)
5/6/2021 • 18 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article applies to the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure
Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

IMPORTANT
Data Factory introduced new format-based dataset model, see corresponding format article with details:
- Avro format
- Binary format
- Delimited text format
- JSON format
- ORC format
- Parquet format
The rest configurations mentioned in this article are still supported as-is for backward compabitility. You are suggested to
use the new model going forward.

Text format (legacy)


NOTE
Learn the new model from Delimited text format article. The following configurations on file-based data store dataset is
still supported as-is for backward compabitility. You are suggested to use the new model going forward.

If you want to read from a text file or write to a text file, set the type property in the format section of the
dataset to TextFormat . You can also specify the following optional properties in the format section. See
TextFormat example section on how to configure.

P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

columnDelimiter The character used to Only one character is No


separate columns in a file. allowed. The default value
You can consider to use a is comma (',') .
rare unprintable character
that may not exist in your To use a Unicode character,
data. For example, specify refer to Unicode Characters
"\u0001", which represents to get the corresponding
Start of Heading (SOH). code for it.

rowDelimiter The character used to Only one character is No


separate rows in a file. allowed. The default value
is any of the following
values on read: ["\r\n",
"\r", "\n"] and "\r\n" on
write.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

escapeChar The special character used Only one character is No


to escape a column allowed. No default value.
delimiter in the content of
input file. Example: if you have
comma (',') as the column
You cannot specify both delimiter but you want to
escapeChar and quoteChar have the comma character
for a table. in the text (example: "Hello,
world"), you can define '$'
as the escape character and
use string "Hello$, world" in
the source.

quoteChar The character used to quote Only one character is No


a string value. The column allowed. No default value.
and row delimiters inside
the quote characters would For example, if you have
be treated as part of the comma (',') as the column
string value. This property delimiter but you want to
is applicable to both input have comma character in
and output datasets. the text (example: <Hello,
world>), you can define "
You cannot specify both (double quote) as the quote
escapeChar and quoteChar character and use the string
for a table. "Hello, world" in the source.

nullValue One or more characters One or more characters. No


used to represent a null The default values are
value. "\N" and "NULL" on read
and "\N" on write.

encodingName Specify the encoding name. A valid encoding name. see No


Encoding.EncodingName
Property. Example:
windows-1250 or shift_jis.
The default value is UTF-
8.

firstRowAsHeader Specifies whether to True No


consider the first row as a False (default)
header. For an input
dataset, Data Factory reads
first row as a header. For an
output dataset, Data
Factory writes first row as a
header.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

skipLineCount Indicates the number of Integer No


non-empty rows to skip
when reading data from
input files. If both
skipLineCount and
firstRowAsHeader are
specified, the lines are
skipped first and then the
header information is read
from the input file.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.

treatEmptyAsNull Specifies whether to treat True (default) No


null or empty string as a False
null value when reading
data from an input file.

TextFormat example
In the following JSON definition for a dataset, some of the optional properties are specified.

"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},

To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:

"escapeChar": "$",

Scenarios for using firstRowAsHeader and skipLineCount


You are copying from a non-file source to a text file and would like to add a header line containing the
schema metadata (for example: SQL schema). Specify firstRowAsHeader as true in the output dataset for this
scenario.
You are copying from a text file containing a header line to a non-file sink and would like to drop that line.
Specify firstRowAsHeader as true in the input dataset.
You are copying from a text file and want to skip a few lines at the beginning that contain no data or header
information. Specify skipLineCount to indicate the number of lines to be skipped. If the rest of the file
contains a header line, you can also specify firstRowAsHeader . If both skipLineCount and firstRowAsHeader
are specified, the lines are skipped first and then the header information is read from the input file
JSON format (legacy)
NOTE
Learn the new model from JSON format article. The following configurations on file-based data store dataset is still
supported as-is for backward compabitility. You are suggested to use the new model going forward.

To impor t/expor t a JSON file as-is into/from Azure Cosmos DB , see Import/export JSON documents
section in Move data to/from Azure Cosmos DB article.
If you want to parse the JSON files or write the data in JSON format, set the type property in the format
section to JsonFormat . You can also specify the following optional properties in the format section. See
JsonFormat example section on how to configure.

P RO P ERT Y DESC RIP T IO N REQ UIRED

filePattern Indicate the pattern of data stored in No


each JSON file. Allowed values are:
setOfObjects and arrayOfObjects .
The default value is setOfObjects .
See JSON file patterns section for
details about these patterns.

jsonNodeReference If you want to iterate and extract data No


from the objects inside an array field
with the same pattern, specify the
JSON path of that array. This property
is supported only when copying data
from JSON files.

jsonPathDefinition Specify the JSON path expression for No


each column mapping with a
customized column name (start with
lowercase). This property is supported
only when copying data from JSON
files, and you can extract data from
object or array.

For fields under root object, start with


root $; for fields inside the array
chosen by jsonNodeReference
property, start from the array element.
See JsonFormat example section on
how to configure.

encodingName Specify the encoding name. For the list No


of valid encoding names, see:
Encoding.EncodingName Property. For
example: windows-1250 or shift_jis.
The default value is: UTF-8 .

nestingSeparator Character that is used to separate No


nesting levels. The default value is '.'
(dot).
NOTE
For the case of cross-apply data in array into multiple rows (case 1 -> sample 2 in JsonFormat examples), you can only
choose to expand single array using property jsonNodeReference .

JSON file patterns


Copy activity can parse the following patterns of JSON files:
Type I: setOfObjects
Each file contains single object, or line-delimited/concatenated multiple objects. When this option is
chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-
delimited).
single object JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}

line-delimited JSON example

{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":
"567834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":
"789037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":
"345626404","switch1":"Germany","switch2":"UK"}

concatenated JSON example


{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}

Type II: arrayOfObjects


Each file contains an array of objects.

[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]

JsonFormat example
Case 1: Copying data from JSON files
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file
with the following content:
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}

and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects
and array:

RESO URC EM A N A GE
TA RGET RESO URC ET Y M EN T P RO C ESSRUN I
ID DEVIC ET Y P E PE D O C C URREN C ET IM E

ed0e4960-d9c5- PC Microsoft.Compute/v 827f8aaa-ab72- 1/13/2017 11:24:37


11e6-85dc- irtualMachines 437c-ba48- AM
d7996816aad3 d8917a7336a3

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
section defines the customized column names and the corresponding data type while converting
structure
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To
copy data from array, you can use array[x].property to extract value of the given property from the xth
object, or you can use array[*].property to find the value from any object containing such property.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagementProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type",
"targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagementProcessRunId":
"$.context.custom.dimensions[1].ResourceManagementProcessRunId", "occurrenceTime": "
$.context.custom.dimensions[2].OccurrenceTime"}
}
}
}

Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have
a JSON file with the following content:

{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
and cross join with the common root info:
ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY

01 20170122 P1 23 [{"sanmateo":"No
1"}]

01 20170122 P2 13 [{"sanmateo":"No
1"}]

01 20170122 P3 231 [{"sanmateo":"No


1"}]

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
structure section defines the customized column names and the corresponding data type while converting
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines .
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In
this example, ordernumber , orderdate , and city are under root object with JSON path starting with $. ,
while order_pd and order_price are defined with path derived from the array element without $. .

"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}

Note the following points:


If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity
detects the schema from the first object and flatten the whole object.
If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You
can choose to extract data from it using jsonNodeReference and/or jsonPathDefinition , or skip it by not
specifying it in jsonPathDefinition .
If there are duplicate names at the same level, the Copy Activity picks the last one.
Property names are case-sensitive. Two properties with same name but different casings are treated as two
separate properties.
Case 2: Writing data to JSON file
If you have the following table in SQL Database:

ID O RDER_DAT E O RDER_P RIC E O RDER_B Y

1 20170119 2000 David

2 20170120 3500 Patrick

3 20170121 4000 Jason

and for each record, you expect to write to a JSON object in the following format:

{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}

The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file,
nestingSeparator (default is ".") are used to identify the nest layer from the name. This section is optional
unless you want to change the property name comparing with source column name, or nest some of the
properties.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}

Parquet format (legacy)


NOTE
Learn the new model from Parquet format article. The following configurations on file-based data store dataset is still
supported as-is for backward compabitility. You are suggested to use the new model going forward.

If you want to parse the Parquet files or write the data in Parquet format, set the format type property to
ParquetFormat . You do not need to specify any properties in the Format section within the typeProperties
section. Example:

"format":
{
"type": "ParquetFormat"
}

Note the following points:


Complex data types are not supported (MAP, LIST).
White space in column name is not supported.
Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory
supports reading data from Parquet file in any of these compressed formats except LZO - it uses the
compression codec in the metadata to read the data. However, when writing to a Parquet file, Data Factory
chooses SNAPPY, which is the default for Parquet format. Currently, there is no option to override this
behavior.
IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying Parquet files as-is , you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your
IR machine. See the following paragraph with more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : it's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.

TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64MB and max 1G.
Data type mapping for Parquet files
DATA FA C TO RY IN T ERIM PA RQ UET O RIGIN A L T Y P E PA RQ UET O RIGIN A L T Y P E
DATA T Y P E PA RQ UET P RIM IT IVE T Y P E ( DESERIA L IZ E) ( SERIA L IZ E)

Boolean Boolean N/A N/A

SByte Int32 Int8 Int8

Byte Int32 UInt8 Int16

Int16 Int32 Int16 Int16

UInt16 Int32 UInt16 Int32

Int32 Int32 Int32 Int32

UInt32 Int64 UInt32 Int64


DATA FA C TO RY IN T ERIM PA RQ UET O RIGIN A L T Y P E PA RQ UET O RIGIN A L T Y P E
DATA T Y P E PA RQ UET P RIM IT IVE T Y P E ( DESERIA L IZ E) ( SERIA L IZ E)

Int64 Int64 Int64 Int64

UInt64 Int64/Binary UInt64 Decimal

Single Float N/A N/A

Double Double N/A N/A

Decimal Binary Decimal Decimal

String Binary Utf8 Utf8

DateTime Int96 N/A N/A

TimeSpan Int96 N/A N/A

DateTimeOffset Int96 N/A N/A

ByteArray Binary N/A N/A

Guid Binary Utf8 Utf8

Char Binary Utf8 Utf8

CharArray Not supported N/A N/A

ORC format (legacy)


NOTE
Learn the new model from ORC format article. The following configurations on file-based data store dataset is still
supported as-is for backward compabitility. You are suggested to use the new model going forward.

If you want to parse the ORC files or write the data in ORC format, set the format type property to
OrcFormat . You do not need to specify any properties in the Format section within the typeProperties section.
Example:

"format":
{
"type": "OrcFormat"
}

Note the following points:


Complex data types are not supported (STRUCT, MAP, LIST, UNION).
White space in column name is not supported.
ORC file has three compression-related options: NONE, ZLIB, SNAPPY. Data Factory supports reading data
from ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read
the data. However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC.
Currently, there is no option to override this behavior.

IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying ORC files as-is , you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.

For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : it's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
Data type mapping for ORC files
DATA FA C TO RY IN T ERIM DATA T Y P E O RC T Y P ES

Boolean Boolean

SByte Byte

Byte Short

Int16 Short

UInt16 Int

Int32 Int

UInt32 Long

Int64 Long

UInt64 String

Single Float

Double Double

Decimal Decimal

String String

DateTime Timestamp

DateTimeOffset Timestamp

TimeSpan Timestamp

ByteArray Binary
DATA FA C TO RY IN T ERIM DATA T Y P E O RC T Y P ES

Guid String

Char Char(1)

AVRO format (legacy)


NOTE
Learn the new model from Avro format article. The following configurations on file-based data store dataset is still
supported as-is for backward compabitility. You are suggested to use the new model going forward.

If you want to parse the Avro files or write the data in Avro format, set the format type property to
AvroFormat . You do not need to specify any properties in the Format section within the typeProperties section.
Example:

"format":
{
"type": "AvroFormat",
}

To use Avro format in a Hive table, you can refer to Apache Hive's tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions, and fixed).

Compression support (legacy)


Azure Data Factory supports compress/decompress data during copy. When you specify compression property
in an input dataset, the copy activity read the compressed data from the source and decompress it; and when
you specify the property in an output dataset, the copy activity compress then write data to the sink. Here are a
few sample scenarios:
Read GZIP compressed data from an Azure blob, decompress it, and write result data to Azure SQL Database.
You define the input Azure Blob dataset with the compression type property as GZIP.
Read data from a plain-text file from on-premises File System, compress it using GZip format, and write the
compressed data to an Azure blob. You define an output Azure Blob dataset with the compression type
property as GZip.
Read .zip file from FTP server, decompress it to get the files inside, and land those files in Azure Data Lake
Store. You define an input FTP dataset with the compression type property as ZipDeflate.
Read a GZIP-compressed data from an Azure blob, decompress it, compress it using BZIP2, and write result
data to an Azure blob. You define the input Azure Blob dataset with compression type set to GZIP and the
output dataset with compression type set to BZIP2.

To specify compression for a dataset, use the compression property in the dataset JSON as in the following
example:
{
"name": "AzureBlobDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"fileName": "pagecounts.csv.gz",
"folderPath": "compression/file/",
"format": {
"type": "TextFormat"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}

The compression section has two properties:


Type: the compression codec, which can be GZIP , Deflate , BZIP2 , or ZipDeflate . Note when using copy
activity to decompress ZipDeflate file(s) and write to file-based sink data store, files will be extracted to
the folder: <path specified in dataset>/<folder named as source zip file>/ .
Level: the compression ratio, which can be Optimal or Fastest .
Fastest: The compression operation should complete as quickly as possible, even if the resulting
file is not optimally compressed.
Optimal : The compression operation should be optimally compressed, even if the operation takes
a longer time to complete.
For more information, see Compression Level topic.

NOTE
Compression settings are not supported for data in the AvroFormat , OrcFormat , or ParquetFormat . When reading
files in these formats, Data Factory detects and uses the compression codec in the metadata. When writing to files in
these formats, Data Factory chooses the default compression codec for that format. For example, ZLIB for OrcFormat and
SNAPPY for ParquetFormat.

Unsupported file types and compression formats


You can use the extensibility features of Azure Data Factory to transform files that aren't supported. Two options
include Azure Functions and custom tasks by using Azure Batch.
You can see a sample that uses an Azure function to extract the contents of a tar file. For more information, see
Azure Functions activity.
You can also build this functionality using a custom dotnet activity. Further information is available here

Next steps
Learn the latest supported file formats and compressions from Supported file formats and compressions.
Transform data in Azure Data Factory
5/26/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Overview
This article explains data transformation activities in Azure Data Factory that you can use to transform and
process your raw data into predictions and insights at scale. A transformation activity executes in a computing
environment such as Azure Databricks or Azure HDInsight. It provides links to articles with detailed information
on each transformation activity.
Data Factory supports the following data transformation activities that can be added to pipelines either
individually or chained with another activity.

Transform natively in Azure Data Factory with data flows


Mapping data flows
Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data
engineers to develop graphical data transformation logic without writing code. The resulting data flows are
executed as activities within Azure Data Factory pipelines that use scaled-out Spark clusters. Data flow activities
can be operationalized via existing Data Factory scheduling, control, flow, and monitoring capabilities. For more
information, see mapping data flows.
Data wrangling
Power Query in Azure Data Factory enables cloud-scale data wrangling, which allows you to do code-free data
preparation at cloud scale iteratively. Data wrangling integrates with Power Query Online and makes Power
Query M functions available for data wrangling at cloud scale via spark execution. For more information, see
data wrangling in ADF.

External transformations
Optionally, you can hand-code transformations and manage the external compute environment yourself.
HDInsight Hive activity
The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Hive activity article for details about this activity.
HDInsight Pig activity
The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Pig activity article for details about this activity.
HDInsight MapReduce activity
The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-
demand Windows/Linux-based HDInsight cluster. See MapReduce activity article for details about this activity.
HDInsight Streaming activity
The HDInsight Streaming activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this
activity.
HDInsight Spark activity
The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster.
For details, see Invoke Spark programs from Azure Data Factory.
Azure Machine Learning Studio (classic) activities
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning Studio
(classic) web service for predictive analytics. Using the Batch Execution activity in an Azure Data Factory pipeline,
you can invoke a Studio (classic) web service to make predictions on the data in batch.
Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new
input datasets. After you are done with retraining, you want to update the scoring web service with the retrained
machine learning model. You can use the Update Resource activity to update the web service with the newly
trained model.
See Use Azure Machine Learning Studio (classic) activities for details about these Studio (classic) activities.
Stored procedure activity
You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in
one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your
enterprise or an Azure VM. See Stored Procedure activity article for details.
Data Lake Analytics U -SQL activity
Data Lake Analytics U-SQL activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See Data
Analytics U-SQL activity article for details.
Synapse Notebook activity
The Azure Synapse Notebook Activity in a Synapse pipeline runs a Synapse notebook in your Azure Synapse
workspace. See Transform data by running a Synapse notebook.
Databricks Notebook activity
The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace. Azure Databricks is a managed platform for running Apache Spark. See Transform data
by running a Databricks notebook.
Databricks Jar activity
The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster.
Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Jar activity
in Azure Databricks.
Databricks Python activity
The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks
cluster. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a
Python activity in Azure Databricks.
Custom activity
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity
with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET
activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities article
for details.
You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script
using Azure Data Factory.
Compute environments
You create a linked service for the compute environment and then use the linked service when defining a
transformation activity. There are two types of compute environments supported by Data Factory.
On-Demand : In this case, the computing environment is fully managed by Data Factory. It is automatically
created by the Data Factory service before a job is submitted to process data and removed when the job is
completed. You can configure and control granular settings of the on-demand compute environment for job
execution, cluster management, and bootstrapping actions.
Bring Your Own : In this case, you can register your own computing environment (for example HDInsight
cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data
Factory service uses it to execute the activities.
See Compute Linked Services article to learn about compute services supported by Data Factory.

Next steps
See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark
Data Flow activity in Azure Data Factory
5/25/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the Data Flow activity to transform and move data via mapping data flows. If you're new to data flows, see
Mapping Data Flow overview

Syntax
{
"name": "MyDataFlowActivity",
"type": "ExecuteDataFlow",
"typeProperties": {
"dataflow": {
"referenceName": "MyDataFlow",
"type": "DataFlowReference"
},
"compute": {
"coreCount": 8,
"computeType": "General"
},
"traceLevel": "Fine",
"runConcurrently": true,
"continueOnError": true,
"staging": {
"linkedService": {
"referenceName": "MyStagingLinkedService",
"type": "LinkedServiceReference"
},
"folderPath": "my-container/my-folder"
},
"integrationRuntime": {
"referenceName": "MyDataFlowIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

dataflow The reference to the Data DataFlowReference Yes


Flow being executed

integrationRuntime The compute environment IntegrationRuntimeReferenc No


the data flow runs on. If not e
specified, the auto-resolve
Azure integration runtime
will be used.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

compute.coreCount The number of cores used 8, 16, 32, 48, 80, 144, 272 No
in the spark cluster. Can
only be specified if the
auto-resolve Azure
Integration runtime is used

compute.computeType The type of compute used "General", No


in the spark cluster. Can "ComputeOptimized",
only be specified if the "MemoryOptimized"
auto-resolve Azure
Integration runtime is used

staging.linkedService If you're using an Azure LinkedServiceReference Only if the data flow reads
Synapse Analytics source or or writes to an Azure
sink, specify the storage Synapse Analytics
account used for PolyBase
staging.

If your Azure Storage is


configured with VNet
service endpoint, you must
use managed identity
authentication with "allow
trusted Microsoft service"
enabled on storage
account, refer to Impact of
using VNet Service
Endpoints with Azure
storage. Also learn the
needed configurations for
Azure Blob and Azure Data
Lake Storage Gen2
respectively.

staging.folderPath If you're using an Azure String Only if the data flow reads
Synapse Analytics source or or writes to Azure Synapse
sink, the folder path in blob Analytics
storage account used for
PolyBase staging

traceLevel Set logging level of your Fine, Coarse, None No


data flow activity execution

Dynamically size data flow compute at runtime


The Core Count and Compute Type properties can be set dynamically to adjust to the size of your incoming
source data at runtime. Use pipeline activities like Lookup or Get Metadata in order to find the size of the source
dataset data. Then, use Add Dynamic Content in the Data Flow activity properties.

NOTE
When choosing driver and worker node cores in Synapse Data Flows, a minimum of 3 nodes will always be utilized.

Here is a brief video tutorial explaining this technique


Data Flow integration runtime
Choose which Integration Runtime to use for your Data Flow activity execution. By default, Data Factory will use
the auto-resolve Azure Integration runtime with four worker cores. This IR has a general purpose compute type
and runs in the same region as your factory. For operationalized pipelines, it is highly recommended that you
create your own Azure Integration Runtimes that define specific regions, compute type, core counts, and TTL for
your data flow activity execution.
A minimum compute type of General Purpose (compute optimized is not recommended for large workloads)
with an 8+8 (16 total v-cores) configuration and a 10-minute is the minimum recommendation for most
production workloads. By setting a small TTL, the Azure IR can maintain a warm cluster that will not incur the
several minutes of start time for a cold cluster. You can speed up the execution of your data flows even more by
select "Quick re-use" on the Azure IR data flow configurations. For more information, see Azure integration
runtime.
IMPORTANT
The Integration Runtime selection in the Data Flow activity only applies to triggered executions of your pipeline.
Debugging your pipeline with data flows runs on the cluster specified in the debug session.

PolyBase
If you're using an Azure Synapse Analytics as a sink or source, you must choose a staging location for your
PolyBase batch load. PolyBase allows for batch loading in bulk instead of loading the data row-by-row. PolyBase
drastically reduces the load time into Azure Synapse Analytics.

Logging level
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs,
you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose"
mode (default), you are requesting ADF to fully log activity at each individual partition level during your data
transformation. This can be an expensive operation, so only enabling verbose when troubleshooting can
improve your overall data flow and pipeline performance. "Basic" mode will only log transformation durations
while "None" will only provide a summary of durations.
Sink properties
The grouping feature in data flows allow you to both set the order of execution of your sinks as well as to group
sinks together using the same group number. To help manage groups, you can ask ADF to run sinks, in the same
group, in parallel. You can also set the sink group to continue even after one of the sinks encounters an error.
The default behavior of data flow sinks is to execute each sink sequentially, in a serial manner, and to fail the data
flow when an error is encountered in the sink. Additionally, all sinks are defaulted to the same group unless you
go into the data flow properties and set different priorities for the sinks.
First row only
This option is only available for data flows that have cache sinks enabled for "Output to activity". The output
from the data flow that is injected directly into your pipeline is limited to 2MB. Setting "first row only" helps you
to limit the data output from data flow when injecting the data flow activity output directly to your pipeline.

Parameterizing Data Flows


Parameterized datasets
If your data flow uses parameterized datasets, set the parameter values in the Settings tab.

Parameterized data flows


If your data flow is parameterized, set the dynamic values of the data flow parameters in the Parameters tab.
You can use either the ADF pipeline expression language or the data flow expression language to assign
dynamic or literal parameter values. For more information, see Data Flow Parameters.
Parameterized compute properties.
You can parameterize the core count or compute type if you use the auto-resolve Azure Integration runtime and
specify values for compute.coreCount and compute.computeType.
Pipeline debug of Data Flow activity
To execute a debug pipeline run with a Data Flow activity, you must switch on data flow debug mode via the
Data Flow Debug slider on the top bar. Debug mode lets you run the data flow against an active Spark cluster.
For more information, see Debug Mode.

The debug pipeline runs against the active debug cluster, not the integration runtime environment specified in
the Data Flow activity settings. You can choose the debug compute environment when starting up debug mode.

Monitoring the Data Flow activity


The Data Flow activity has a special monitoring experience where you can view partitioning, stage time, and data
lineage information. Open the monitoring pane via the eyeglasses icon under Actions . For more information,
see Monitoring Data Flows.
Use Data Flow activity results in a subsequent activity
The data flow activity outputs metrics regarding the number of rows written to each sink and rows read from
each source. These results are returned in the output section of the activity run result. The metrics returned are
in the format of the below json.
{
"runStatus": {
"metrics": {
"<your sink name1>": {
"rowsWritten": <number of rows written>,
"sinkProcessingTime": <sink processing time in ms>,
"sources": {
"<your source name1>": {
"rowsRead": <number of rows read>
},
"<your source name2>": {
"rowsRead": <number of rows read>
},
...
}
},
"<your sink name2>": {
...
},
...
}
}
}

For example, to get to number of rows written to a sink named 'sink1' in an activity named 'dataflowActivity',
use @activity('dataflowActivity').output.runStatus.metrics.sink1.rowsWritten .
To get the number of rows read from a source named 'source1' that was used in that sink, use
@activity('dataflowActivity').output.runStatus.metrics.sink1.sources.source1.rowsRead .

NOTE
If a sink has zero rows written, it will not show up in metrics. Existence can be verified using the contains function. For
example, contains(activity('dataflowActivity').output.runStatus.metrics, 'sink1') will check whether any
rows were written to sink1.

Next steps
See control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Power query activity in data factory
3/5/2021 • 2 minutes to read • Edit Online

The Power Query activity allows you to build and execute Power Query mash-ups to execute data wrangling at
scale in a Data Factory pipeline. You can create a new Power Query mash-up from the New resources menu
option or by adding a Power Activity to your pipeline.

Previously, data wrangling in Azure Data Factory was authored from the Data Flow menu option. This has been
changed to authoring from a new Power Query activity. You can work directly inside of the Power Query mash-
up editor to perform interactive data exploration and then save your work. Once complete, you can take your
Power Query activity and add it to a pipeline. Azure Data Factory will automatically scale it out and
operationalize your data wrangling using Azure Data Factory's data flow Spark environment.

Translation to data flow script


To achieve scale with your Power Query activity, Azure Data Factory translates your M script into a data flow
script so that you can execute your Power Query at scale using the Azure Data Factory data flow Spark
environment. Author your wrangling data flow using code-free data preparation. For the list of available
functions, see transformation functions.

Next steps
Learn more about data wrangling concepts using Power Query in Azure Data Factory
Azure Function activity in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Function activity allows you to run Azure Functions in a Data Factory pipeline. To run an Azure
Function, you need to create a linked service connection and an activity that specifies the Azure Function that
you plan to execute.
For an eight-minute introduction and demonstration of this feature, watch the following video:

Azure Function linked service


The return type of the Azure function has to be a valid JObject . (Keep in mind that JArray is not a JObject .)
Any return type other than JObject fails and raises the user error Response Content is not a valid JObject.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: yes


AzureFunction

function app url URL for the Azure Function App. yes
Format is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal

function key Access key for the Azure Function. yes


Click on the Manage section for the
respective function, and copy either
the Function Key or the Host key .
Find out more here: Azure Functions
HTTP triggers and bindings

Azure Function activity


P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the activity in the String yes


pipeline

type Type of activity is String yes


‘AzureFunctionActivity’
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

linked service The Azure Function linked Linked service reference yes
service for the
corresponding Azure
Function App

function name Name of the function in the String yes


Azure Function App that
this activity calls

method REST API method for the String Supported Types: yes
function call "GET", "POST", "PUT"

header Headers that are sent to String (or expression with No


the request. For example, to resultType of string)
set the language and type
on a request: "headers": {
"Accept-Language": "en-us",
"Content-Type":
"application/json" }

body body that is sent along with String (or expression with Required for PUT/POST
the request to the function resultType of string) or methods
api method object.

See the schema of the request payload in Request payload schema section.

Routing and queries


The Azure Function Activity supports routing . For example, if your Azure Function has the endpoint
https://functionAPP.azurewebsites.net/api/<functionName>/<value>?code=<secret> , then the functionName to use
in the Azure Function Activity is <functionName>/<value> . You can parameterize this function to provide the
desired functionName at runtime.
The Azure Function Activity also supports queries . A query has to be included as part of the functionName . For
example, when the function name is HttpTriggerCSharp and the query that you want to include is name=hello ,
then you can construct the functionName in the Azure Function Activity as HttpTriggerCSharp?name=hello . This
function can be parameterized so the value can be determined at runtime.

Timeout and long running functions


Azure Functions times out after 230 seconds regardless of the functionTimeout setting you've configured in the
settings. For more information, see this article. To work around this behavior, follow an async pattern or use
Durable Functions. The benefit of Durable Functions is that they offer their own state-tracking mechanism, so
you won't have to implement your own.
Learn more about Durable Functions in this article. You can set up an Azure Function Activity to call the Durable
Function, which will return a response with a different URI, such as this example. Because statusQueryGetUri
returns HTTP Status 202 while the function is running, you can poll the status of the function by using a Web
Activity. Simply set up a Web Activity with the url field set to
@activity('<AzureFunctionActivityName>').output.statusQueryGetUri . When the Durable Function completes, the
output of the function will be the output of the Web Activity.
Sample
You can find a sample of a Data Factory that uses an Azure Function to extract the content of a tar file here.

Next steps
Learn more about activities in Data Factory in Pipelines and activities in Azure Data Factory.
Use custom activities in an Azure Data Factory
pipeline
5/28/2021 • 12 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


There are two types of activities that you can use in an Azure Data Factory pipeline.
Data movement activities to move data between supported source and sink data stores.
Data transformation activities to transform data using compute services such as Azure HDInsight, Azure
Batch, and Azure Machine Learning.
To move data to/from a data store that Data Factory does not support, or to transform/process data in a way
that isn't supported by Data Factory, you can create a Custom activity with your own data movement or
transformation logic and use the activity in a pipeline. The custom activity runs your customized code logic on
an Azure Batch pool of virtual machines.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

See following articles if you are new to Azure Batch service:


Azure Batch basics for an overview of the Azure Batch service.
New-AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account article for detailed
instructions on using the cmdlet.
New-AzBatchPool cmdlet to create an Azure Batch pool.

IMPORTANT
When creating a new Azure Batch pool, ‘VirtualMachineConfiguration’ must be used and NOT
‘CloudServiceConfiguration'. For more details refer Azure Batch Pool migration guidance.

Azure Batch linked service


The following JSON defines a sample Azure Batch linked service. For details, see Compute environments
supported by Azure Data Factory
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}

To learn more about Azure Batch linked service, see Compute linked services article.

Custom activity
The following JSON snippet defines a pipeline with a simple Custom Activity. The activity definition has a
reference to the Azure Batch linked service.

{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "helloworld.exe",
"folderPath": "customactv2/helloworld",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}]
}
}

In this sample, the helloworld.exe is a custom application stored in the customactv2/helloworld folder of the
Azure Storage account used in the resourceLinkedService. The Custom activity submits this custom application
to be executed on Azure Batch. You can replace the command to any preferred application that can be executed
on the target Operation System of the Azure Batch Pool nodes.
The following table describes names and descriptions of properties that are specific to this activity.

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline Yes


P RO P ERT Y DESC RIP T IO N REQ UIRED

description Text describing what the activity does. No

type For Custom activity, the activity type is Yes


Custom .

linkedServiceName Linked Service to Azure Batch. To learn Yes


about this linked service, see Compute
linked services article.

command Command of the custom application Yes


to be executed. If the application is
already available on the Azure Batch
Pool Node, the resourceLinkedService
and folderPath can be skipped. For
example, you can specify the command
to be cmd /c dir , which is natively
supported by the Windows Batch Pool
node.

resourceLinkedService Azure Storage Linked Service to the No *


Storage account where the custom
application is stored

folderPath Path to the folder of the custom No *


application and all its dependencies

If you have dependencies stored in


subfolders - that is, in a hierarchical
folder structure under folderPath - the
folder structure is currently flattened
when the files are copied to Azure
Batch. That is, all files are copied into a
single folder with no subfolders. To
work around this behavior, consider
compressing the files, copying the
compressed file, and then unzipping it
with custom code in the desired
location.

referenceObjects An array of existing Linked Services No


and Datasets. The referenced Linked
Services and Datasets are passed to
the custom application in JSON format
so your custom code can reference
resources of the Data Factory

extendedProperties User-defined properties that can be No


passed to the custom application in
JSON format so your custom code can
reference additional properties

retentionTimeInDays The retention time for the files No


submitted for custom activity. Default
value is 30 days.

* The properties resourceLinkedService and folderPath must either both be specified or both be omitted.
NOTE
If you are passing linked services as referenceObjects in Custom Activity, it is a good security practice to pass an Azure
Key Vault enabled linked service (since it does not contain any secure strings) and fetch the credentials using secret name
directly from Key Vault from the code. You can find an example here that references AKV enabled linked service, retrieves
the credentials from Key Vault, and then accesses the storage in the code.

Custom activity permissions


The custom activity sets the Azure Batch auto-user account to Non-admin access with task scope (the default
auto-user specification). You can't change the permission level of the auto-user account. For more info, see Run
tasks under user accounts in Batch | Auto-user accounts.

Executing commands
You can directly execute a command using Custom Activity. The following example runs the "echo hello world"
command on the target Azure Batch Pool nodes and prints the output to stdout.

{
"name": "MyCustomActivity",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "cmd /c echo hello world"
}
}]
}
}

Passing objects and properties


This sample shows how you can use the referenceObjects and extendedProperties to pass Data Factory objects
and user-defined properties to your custom application.
{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "SampleApp.exe",
"folderPath": "customactv2/SampleApp",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"referenceObjects": {
"linkedServices": [{
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
}]
},
"extendedProperties": {
"connectionString": {
"type": "SecureString",
"value": "aSampleSecureString"
},
"PropertyBagPropertyName1": "PropertyBagValue1",
"propertyBagPropertyName2": "PropertyBagValue2",
"dateTime1": "2015-04-12T12:13:14Z"
}
}
}]
}
}

When the activity is executed, referenceObjects and extendedProperties are stored in following files that are
deployed to the same execution folder of the SampleApp.exe:
activity.json

Stores extendedProperties and properties of the custom activity.


linkedServices.json

Stores an array of Linked Services defined in the referenceObjects property.


datasets.json

Stores an array of Datasets defined in the referenceObjects property.


Following sample code demonstrate how the SampleApp.exe can access the required information from JSON
files:
using Newtonsoft.Json;
using System;
using System.IO;

namespace SampleApp
{
class Program
{
static void Main(string[] args)
{
//From Extend Properties
dynamic activity = JsonConvert.DeserializeObject(File.ReadAllText("activity.json"));
Console.WriteLine(activity.typeProperties.extendedProperties.connectionString.value);

// From LinkedServices
dynamic linkedServices = JsonConvert.DeserializeObject(File.ReadAllText("linkedServices.json"));
Console.WriteLine(linkedServices[0].properties.typeProperties.accountName);
}
}
}

Retrieve execution outputs


You can start a pipeline run using the following PowerShell command:

$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName


$resourceGroupName -PipelineName $pipelineName

When the pipeline is running, you can check the execution output using the following commands:

while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)

if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}

Write-Host "Activity `Output` section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "Activity `Error` section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

The stdout and stderr of your custom application are saved to the adfjobs container in the Azure Storage
Linked Service you defined when creating Azure Batch Linked Service with a GUID of the task. You can get the
detailed path from Activity Run output as shown in the following snippet:
Pipeline ' MyCustomActivity' run finished. Result:

ResourceGroupName : resourcegroupname
DataFactoryName : datafactoryname
ActivityName : MyCustomActivity
PipelineRunId : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PipelineName : MyCustomActivity
Input : {command}
Output : {exitcode, outputs, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 10/5/2017 3:33:06 PM
ActivityRunEnd : 10/5/2017 3:33:28 PM
DurationInMs : 21203
Status : Succeeded
Error : {errorCode, message, failureType, target}

Activity Output section:


"exitcode": 0
"outputs": [
"https://<container>.blob.core.windows.net/adfjobs/<GUID>/output/stdout.txt",
"https://<container>.blob.core.windows.net/adfjobs/<GUID>/output/stderr.txt"
]
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)"
Activity Error section:
"errorCode": ""
"message": ""
"failureType": ""
"target": "MyCustomActivity"

If you would like to consume the content of stdout.txt in downstream activities, you can get the path to the
stdout.txt file in expression "@activity('MyCustomActivity').output.outputs[0]".

IMPORTANT
The activity.json, linkedServices.json, and datasets.json are stored in the runtime folder of the Batch task. For this
example, the activity.json, linkedServices.json, and datasets.json are stored in
https://adfv2storage.blob.core.windows.net/adfjobs/<GUID>/runtime/ path. If needed, you need to clean them
up separately.
For Linked Services that use the Self-Hosted Integration Runtime, the sensitive information like keys or passwords are
encrypted by the Self-Hosted Integration Runtime to ensure credential stays in customer defined private network
environment. Some sensitive fields could be missing when referenced by your custom application code in this way. Use
SecureString in extendedProperties instead of using Linked Service reference if needed.

Pass outputs to another activity


You can send custom values from your code in a Custom Activity back to Azure Data Factory. You can do so by
writing them into outputs.json from your application. Data Factory copies the content of outputs.json and
appends it into the Activity Output as the value of the customOutput property. (The size limit is 2MB.) If you want
to consume the content of outputs.json in downstream activities, you can get the value by using the expression
@activity('<MyCustomActivity>').output.customOutput .

Retrieve SecureString outputs


Sensitive property values designated as type SecureString, as shown in some of the examples in this article, are
masked out in the Monitoring tab in the Data Factory user interface. In actual pipeline execution, however, a
SecureString property is serialized as JSON within the activity.json file as plain text. For example:
"extendedProperties": {
"connectionString": {
"type": "SecureString",
"value": "aSampleSecureString"
}
}

This serialization is not truly secure, and is not intended to be secure. The intent is to hint to Data Factory to
mask the value in the Monitoring tab.
To access properties of type SecureString from a custom activity, read the activity.json file, which is placed in
the same folder as your .EXE, deserialize the JSON, and then access the JSON property (extendedProperties =>
[propertyName] => value).

Compare v2 Custom Activity and version 1 (Custom) DotNet Activity


In Azure Data Factory version 1, you implement a (Custom) DotNet Activity by creating a .NET Class Library
project with a class that implements the Execute method of the IDotNetActivity interface. The Linked Services,
Datasets, and Extended Properties in the JSON payload of a (Custom) DotNet Activity are passed to the
execution method as strongly-typed objects. For details about the version 1 behavior, see (Custom) DotNet in
version 1. Because of this implementation, your version 1 DotNet Activity code has to target .NET Framework
4.5.2. The version 1 DotNet Activity also has to be executed on Windows-based Azure Batch Pool nodes.
In the Azure Data Factory V2 Custom Activity, you are not required to implement a .NET interface. You can now
directly run commands, scripts, and your own custom code, compiled as an executable. To configure this
implementation, you specify the Command property together with the folderPath property. The Custom Activity
uploads the executable and its dependencies to folderpath and executes the command for you.
The Linked Services, Datasets (defined in referenceObjects), and Extended Properties defined in the JSON
payload of a Data Factory v2 Custom Activity can be accessed by your executable as JSON files. You can access
the required properties using a JSON serializer as shown in the preceding SampleApp.exe code sample.
With the changes introduced in the Data Factory V2 Custom Activity, you can write your custom code logic in
your preferred language and execute it on Windows and Linux Operation Systems supported by Azure Batch.
The following table describes the differences between the Data Factory V2 Custom Activity and the Data Factory
version 1 (Custom) DotNet Activity:

VERSIO N 1 ( C USTO M ) DOT N ET


DIF F EREN C ES C USTO M A C T IVIT Y A C T IVIT Y

How custom logic is defined By providing an executable By implementing a .NET DLL

Execution environment of the custom Windows or Linux Windows (.NET Framework 4.5.2)
logic

Executing scripts Supports executing scripts directly (for Requires implementation in the .NET
example "cmd /c echo hello world" on DLL
Windows VM)

Dataset required Optional Required to chain activities and pass


information
VERSIO N 1 ( C USTO M ) DOT N ET
DIF F EREN C ES C USTO M A C T IVIT Y A C T IVIT Y

Pass information from activity to Through ReferenceObjects Through ExtendedProperties (custom


custom logic (LinkedServices and Datasets) and properties), Input, and Output
ExtendedProperties (custom Datasets
properties)

Retrieve information in custom logic Parses activity.json, linkedServices.json, Through .NET SDK (.NET Frame 4.5.2)
and datasets.json stored in the same
folder of the executable

Logging Writes directly to STDOUT Implementing Logger in .NET DLL

If you have existing .NET code written for a version 1 (Custom) DotNet Activity, you need to modify your code
for it to work with the current version of the Custom Activity. Update your code by following these high-level
guidelines:
Change the project from a .NET Class Library to a Console App.
Start your application with the Main method. The Execute method of the IDotNetActivity interface is no
longer required.
Read and parse the Linked Services, Datasets and Activity with a JSON serializer, and not as strongly-typed
objects. Pass the values of required properties to your main custom code logic. Refer to the preceding
SampleApp.exe code as an example.
The Logger object is no longer supported. Output from your executable can be printed to the console and is
saved to stdout.txt.
The Microsoft.Azure.Management.DataFactories NuGet package is no longer required.
Compile your code, upload the executable and its dependencies to Azure Storage, and define the path in the
folderPath property.

For a complete sample of how the end-to-end DLL and pipeline sample described in the Data Factory version 1
article Use custom activities in an Azure Data Factory pipeline can be rewritten as a Data Factory Custom
Activity, see Data Factory Custom Activity sample.

Auto-scaling of Azure Batch


You can also create an Azure Batch pool with autoscale feature. For example, you could create an azure batch
pool with 0 dedicated VMs and an autoscale formula based on the number of pending tasks.
The sample formula here achieves the following behavior: When the pool is initially created, it starts with 1 VM.
$PendingTasks metric defines the number of tasks in running + active (queued) state. The formula finds the
average number of pending tasks in the last 180 seconds and sets TargetDedicated accordingly. It ensures that
TargetDedicated never goes beyond 25 VMs. So, as new tasks are submitted, pool automatically grows and as
tasks complete, VMs become free one by one and the autoscaling shrinks those VMs. startingNumberOfVMs
and maxNumberofVMs can be adjusted to your needs.
Autoscale formula:

startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 *
TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);
See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different autoScaleEvaluationInterval,
the Batch service could take autoScaleEvaluationInterval + 10 minutes.

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data by running a Jar activity in Azure
Databricks
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster.
This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Databricks Jar activity definition


Here's the sample JSON definition of a Databricks Jar Activity:

{
"name": "SparkJarActivity",
"type": "DatabricksSparkJar",
"linkedServiceName": {
"referenceName": "AzureDatabricks",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mainClassName": "org.apache.spark.examples.SparkPi",
"parameters": [ "10" ],
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
}
}

Databricks Jar activity properties


The following table describes the JSON properties used in the JSON definition:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No

type For Databricks Jar Activity, the activity Yes


type is DatabricksSparkJar.
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Name of the Databricks Linked Service Yes


on which the Jar activity runs. To learn
about this linked service, see Compute
linked services article.

mainClassName The full name of the class containing Yes


the main method to be executed. This
class must be contained in a JAR
provided as a library. A JAR file can
contain multiple classes. Each of the
classes can contain a main method.

parameters Parameters that will be passed to the No


main method. This property is an array
of strings.

libraries A list of libraries to be installed on the Yes (at least one containing the
cluster that will execute the job. It can mainClassName method)
be an array of <string, object>

NOTE
Known Issue - When using the same Interactive cluster for running concurrent Databricks Jar activities (without cluster
restart), there is a known issue in Databricks where in parameters of the 1st activity will be used by following activities as
well. Hence resulting to incorrect parameters being passed to the subsequent jobs. To mitigate this use a Job cluster
instead.

Supported libraries for databricks activities


In the previous Databricks activity definition, you specified these library types: jar , egg , maven , pypi , cran .
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more information, see the Databricks documentation for library types.

How to upload a library in Databricks


You can use the Workspace UI:
1. Use the Databricks workspace UI
2. To obtain the dbfs path of the library added using UI, you can use Databricks CLI.
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through
the CLI: databricks fs ls dbfs:/FileStore/job-jars
Or you can use the Databricks CLI:
1. Follow Copy the library using Databricks CLI
2. Use Databricks CLI (installation steps)
As an example, to copy a JAR to dbfs: dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar

Next steps
For an eleven-minute introduction and demonstration of this feature, watch the video.
Transform data by running a Databricks notebook
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities. Azure Databricks is a managed
platform for running Apache Spark.

Databricks Notebook activity definition


Here is the sample JSON definition of a Databricks Notebook Activity:

{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksNotebook",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"notebookPath": "/Users/[email protected]/ScalaExampleNotebook",
"baseParameters": {
"inputpath": "input/folder1/",
"outputpath": "output/"
},
"libraries": [
{
"jar": "dbfs:/docs/library.jar"
}
]
}
}
}

Databricks Notebook activity properties


The following table describes the JSON properties used in the JSON definition:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No

type For Databricks Notebook Activity, the Yes


activity type is DatabricksNotebook.
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Name of the Databricks Linked Service Yes


on which the Databricks notebook
runs. To learn about this linked service,
see Compute linked services article.

notebookPath The absolute path of the notebook to Yes


be run in the Databricks Workspace.
This path must begin with a slash.

baseParameters An array of Key-Value pairs. Base No


parameters can be used for each
activity run. If the notebook takes a
parameter that is not specified, the
default value from the notebook will
be used. Find more on parameters in
Databricks Notebooks.

libraries A list of libraries to be installed on the No


cluster that will execute the job. It can
be an array of <string, object>.

Supported libraries for Databricks activities


In the above Databricks activity definition, you specify these library types: jar, egg, whl, maven, pypi, cran.
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more details, see the Databricks documentation for library types.

Passing parameters between notebooks and Data Factory


You can pass data factory parameters to notebooks using baseParameters property in databricks activity.
In certain cases you might require to pass back certain values from notebook back to data factory, which can be
used for control flow (conditional checks) in data factory or be consumed by downstream activities (size limit is
2MB).
1. In your notebook, you may call dbutils.notebook.exit("returnValue") and corresponding "returnValue" will
be returned to data factory.
2. You can consume the output in data factory by using expression such as
@{activity('databricks notebook activity name').output.runOutput} .

IMPORTANT
If you are passing JSON object you can retrieve values by appending property names. Example:
@{activity('databricks notebook activity name').output.runOutput.PropertyName}

How to upload a library in Databricks


You can use the Workspace UI:
1. Use the Databricks workspace UI
2. To obtain the dbfs path of the library added using UI, you can use Databricks CLI.
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through
the CLI: databricks fs ls dbfs:/FileStore/job-jars
Or you can use the Databricks CLI:
1. Follow Copy the library using Databricks CLI
2. Use Databricks CLI (installation steps)
As an example, to copy a JAR to dbfs: dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
Transform data by running a Python activity in
Azure Databricks
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks
cluster. This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. Azure Databricks is a managed platform for running
Apache Spark.
For an eleven-minute introduction and demonstration of this feature, watch the following video:

Databricks Python activity definition


Here is the sample JSON definition of a Databricks Python Activity:

{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksSparkPython",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"pythonFile": "dbfs:/docs/pi.py",
"parameters": [
"10"
],
"libraries": [
{
"pypi": {
"package": "tensorflow"
}
}
]
}
}
}

Databricks Python activity properties


The following table describes the JSON properties used in the JSON definition:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No


P RO P ERT Y DESC RIP T IO N REQ UIRED

type For Databricks Python Activity, the Yes


activity type is DatabricksSparkPython.

linkedServiceName Name of the Databricks Linked Service Yes


on which the Python activity runs. To
learn about this linked service,
see Compute linked services article.

pythonFile The URI of the Python file to be Yes


executed. Only DBFS paths are
supported.

parameters Command line parameters that will be No


passed to the Python file. This is an
array of strings.

libraries A list of libraries to be installed on the No


cluster that will execute the job. It can
be an array of <string, object>

Supported libraries for databricks activities


In the above Databricks activity definition you specify these library types: jar, egg, maven, pypi, cran.

{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}

For more details refer Databricks documentation for library types.

How to upload a library in Databricks


You can use the Workspace UI:
1. Use the Databricks workspace UI
2. To obtain the dbfs path of the library added using UI, you can use Databricks CLI.
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through
the CLI: databricks fs ls dbfs:/FileStore/job-jars
Or you can use the Databricks CLI:
1. Follow Copy the library using Databricks CLI
2. Use Databricks CLI (installation steps)
As an example, to copy a JAR to dbfs: dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
Process data by running U-SQL scripts on Azure
Data Lake Analytics
3/5/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


A pipeline in an Azure data factory processes data in linked storage services by using linked compute services. It
contains a sequence of activities where each activity performs a specific processing operation. This article
describes the Data Lake Analytics U-SQL Activity that runs a U-SQL script on an Azure Data Lake
Analytics compute linked service.
Create an Azure Data Lake Analytics account before creating a pipeline with a Data Lake Analytics U-SQL
Activity. To learn about Azure Data Lake Analytics, see Get started with Azure Data Lake Analytics.

Azure Data Lake Analytics linked service


You create an Azure Data Lake Analytics linked service to link an Azure Data Lake Analytics compute service
to an Azure data factory. The Data Lake Analytics U-SQL activity in the pipeline refers to this linked service.
The following table provides descriptions for the generic properties used in the JSON definition.

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property should be set to: Yes


AzureDataLakeAnalytics .

accountName Azure Data Lake Analytics Account Yes


Name.

dataLakeAnalyticsUri Azure Data Lake Analytics URI. No

subscriptionId Azure subscription ID No

resourceGroupName Azure resource group name No

Service principal authentication


The Azure Data Lake Analytics linked service requires a service principal authentication to connect to the Azure
Data Lake Analytics service. To use service principal authentication, register an application entity in Azure Active
Directory (Azure AD) and grant it the access to both the Data Lake Analytics and the Data Lake Store it uses. For
detailed steps, see Service-to-service authentication. Make note of the following values, which you use to define
the linked service:
Application ID
Application key
Tenant ID
Grant service principal permission to your Azure Data Lake Anatlyics using the Add User Wizard.
Use service principal authentication by specifying the following properties:
P RO P ERT Y DESC RIP T IO N REQ UIRED

ser vicePrincipalId Specify the application's client ID. Yes

ser vicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.

Example: Ser vice principal authentication

{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "<account name>",
"dataLakeAnalyticsUri": "<azure data lake analytics URI>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

To learn more about the linked service, see Compute linked services.

Data Lake Analytics U-SQL Activity


The following JSON snippet defines a pipeline with a Data Lake Analytics U-SQL Activity. The activity definition
has a reference to the Azure Data Lake Analytics linked service you created earlier. To execute a Data Lake
Analytics U-SQL script, Data Factory submits the script you specified to the Data Lake Analytics, and the required
inputs and outputs is defined in the script for Data Lake Analytics to fetch and output.
{
"name": "ADLA U-SQL Activity",
"description": "description",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": {
"referenceName": "<linked service name of Azure Data Lake Analytics>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "<linked service name of Azure Data Lake Store or Azure Storage which contains
the U-SQL script>",
"type": "LinkedServiceReference"
},
"scriptPath": "scripts\\kona\\SearchLogProcessing.txt",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
}
}

The following table describes names and descriptions of properties that are specific to this activity.

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Data Lake Analytics U-SQL activity, Yes


the activity type is
DataLakeAnalyticsU-SQL .

linkedServiceName Linked Service to Azure Data Lake Yes


Analytics. To learn about this linked
service, see Compute linked services
article.

scriptPath Path to folder that contains the U-SQL Yes


script. Name of the file is case-
sensitive.

scriptLinkedService Linked service that links the Azure Yes


Data Lake Store or Azure Storage
that contains the script to the data
factory

degreeOfParallelism The maximum number of nodes No


simultaneously used to run the job.

priority Determines which jobs out of all that No


are queued should be selected to run
first. The lower the number, the higher
the priority.
P RO P ERT Y DESC RIP T IO N REQ UIRED

parameters Parameters to pass into the U-SQL No


script.

runtimeVersion Runtime version of the U-SQL engine No


to use.

compilationMode Compilation mode of U-SQL. Must No


be one of these values: Semantic:
Only perform semantic checks and
necessary sanity checks, Full:
Perform the full compilation,
including syntax check,
optimization, code generation, etc.,
SingleBox: Perform the full
compilation, with TargetType
setting to SingleBox. If you don't
specify a value for this property,
the server determines the optimal
compilation mode.

See SearchLogProcessing.txt for the script definition.

Sample U-SQL script


@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv(nullEscape:"#NULL#");

@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";

@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start <= DateTime.Parse("2012/02/19");

OUTPUT @rs1
TO @out
USING Outputters.Tsv(quoting:false, dateTimeFormat:null);

In above script example, the input and output to the script is defined in @in and @out parameters. The values
for @in and @out parameters in the U-SQL script are passed dynamically by Data Factory using the
‘parameters’ section.
You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for
the jobs that run on the Azure Data Lake Analytics service.

Dynamic parameters
In the sample pipeline definition, in and out parameters are assigned with hard-coded values.

"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}

It is possible to use dynamic parameters instead. For example:

"parameters": {
"in": "/datalake/input/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/data.tsv",
"out": "/datalake/output/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/result.tsv"
}

In this case, input files are still picked up from the /datalake/input folder and output files are generated in the
/datalake/output folder. The file names are dynamic based on the window start time being passed in when
pipeline gets triggered.

Next steps
See the following articles that explain how to transform data in other ways:
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Hive activity in Azure
Data Factory
3/21/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
HDInsight cluster. This article builds on the data transformation activities article, which presents a general
overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

Syntax
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\HiveScripts\\MyHiveSript.hql",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Hive Activity, the activity type is Yes


HDinsightHive
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

scriptLinkedService Reference to an Azure Storage Linked No


Service used to store the Hive script to
be executed. Only Azure Blob
Storage and ADLS Gen2 linked
services are supported here. If you
don't specify this Linked Service, the
Azure Storage Linked Service defined
in the HDInsight Linked Service is
used.

scriptPath Provide the path to the script file Yes


stored in the Azure Storage referred by
scriptLinkedService. The file name is
case-sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are
passed as command-line arguments to
each task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.

queryTimeout Query timeout value (in minutes). No


Applicable when the HDInsight cluster
is with Enterprise Security Package
enabled.

NOTE
The default value for queryTimeout is 120 minutes.

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop MapReduce activity
in Azure Data Factory
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The HDInsight MapReduce activity in a Data Factory pipeline invokes MapReduce program on your own or on-
demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial:
Tutorial: transform data before reading this article.
See Pig and Hive for details about running Pig/Hive scripts on a HDInsight cluster from a pipeline by using
HDInsight Pig and Hive activities.

Syntax
{
"name": "Map Reduce Activity",
"description": "Description",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.myorg.SampleClass",
"jarLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "MyAzureStorage/jars/sample.jar",
"getDebugInfo": "Failure",
"arguments": [
"-SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity Yes

description Text describing what the activity is No


used for
P RO P ERT Y DESC RIP T IO N REQ UIRED

type For MapReduce Activity, the activity Yes


type is HDinsightMapReduce

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

className Name of the Class to be executed Yes

jarLinkedService Reference to an Azure Storage Linked No


Service used to store the Jar files. Only
Azure Blob Storage and ADLS
Gen2 linked services are supported
here. If you don't specify this Linked
Service, the Azure Storage Linked
Service defined in the HDInsight
Linked Service is used.

jarFilePath Provide the path to the Jar files stored Yes


in the Azure Storage referred by
jarLinkedService. The file name is case-
sensitive.

jarlibs String array of the path to the Jar No


library files referenced by the job
stored in the Azure Storage defined in
jarLinkedService. The file name is case-
sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
jarLinkedService. Allowed values: None,
Always, or Failure. Default value: None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are
passed as command-line arguments to
each task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.

Example
You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the
following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.
{
"name": "MapReduce Activity for Mahout",
"description": "Custom MapReduce to generate Mahout result",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"jarLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
"arguments": [
"-s",
"SIMILARITY_LOGLIKELIHOOD",
"--input",
"wasb://[email protected]/Mahout/input",
"--output",
"wasb://[email protected]/Mahout/output/",
"--maxSimilaritiesPerItem",
"500",
"--tempDir",
"wasb://[email protected]/Mahout/temp/mahout"
]
}
}

You can specify any arguments for the MapReduce program in the arguments section. At runtime, you see a
few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your
arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the
following example (-s,--input,--output etc., are options immediately followed by their values).

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Pig activity in Azure
Data Factory
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

Syntax
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\PigScripts\\MyPigSript.pig",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Hive Activity, the activity type is Yes


HDinsightPig
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

scriptLinkedService Reference to an Azure Storage Linked No


Service used to store the Pig script to
be executed. Only Azure Blob
Storage and ADLS Gen2 linked
services are supported here. If you
don't specify this Linked Service, the
Azure Storage Linked Service defined
in the HDInsight Linked Service is
used.

scriptPath Provide the path to the script file No


stored in the Azure Storage referred by
scriptLinkedService. The file name is
case-sensitive.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are
passed as command-line arguments to
each task.

defines Specify parameters as key/value pairs No


for referencing within the Pig script.

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Spark activity in Azure Data
Factory
6/10/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight
cluster. This article builds on the data transformation activities article, which presents a general overview of data
transformation and the supported transformation activities. When you use an on-demand Spark linked service,
Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then deletes the
cluster once the processing is complete.

Spark activity properties


Here is the sample JSON definition of a Spark Activity:

{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}

The following table describes the JSON properties used in the JSON definition:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline. Yes

description Text describing what the activity does. No

type For Spark Activity, the activity type is Yes


HDInsightSpark.
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Name of the HDInsight Spark Linked Yes


Service on which the Spark program
runs. To learn about this linked service,
see Compute linked services article.

SparkJobLinkedService The Azure Storage linked service that No


holds the Spark job file, dependencies,
and logs. Only Azure Blob Storage
and ADLS Gen2 linked services are
supported here. If you do not specify a
value for this property, the storage
associated with HDInsight cluster is
used. The value of this property can
only be an Azure Storage linked
service.

rootPath The Azure Blob container and folder Yes


that contains the Spark file. The file
name is case-sensitive. Refer to folder
structure section (next section) for
details about the structure of this
folder.

entryFilePath Relative path to the root folder of the Yes


Spark code/package. The entry file
must be either a Python file or a .jar
file.

className Application's Java/Spark main class No

arguments A list of command-line arguments to No


the Spark program.

proxyUser The user account to impersonate to No


execute the Spark program

sparkConfig Specify values for Spark configuration No


properties listed in the topic: Spark
Configuration - Application properties.

getDebugInfo Specifies when the Spark log files are No


copied to the Azure storage used by
HDInsight cluster (or) specified by
sparkJobLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.

Folder structure
Spark jobs are more extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies such
as jar packages (placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any other files.
Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service. Then,
upload dependent files to the appropriate sub folders in the root folder represented by entr yFilePath . For
example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At
runtime, Data Factory service expects the following folder structure in the Azure Blob storage:
PAT H DESC RIP T IO N REQ UIRED TYPE

. (root) The root path of the Spark Yes Folder


job in the storage linked
service

<user defined > The path pointing to the Yes File


entry file of the Spark job

./jars All files under this folder are No Folder


uploaded and placed on the
java classpath of the cluster

./pyFiles All files under this folder are No Folder


uploaded and placed on the
PYTHONPATH of the cluster

./files All files under this folder are No Folder


uploaded and placed on
executor working directory

./archives All files under this folder are No Folder


uncompressed

./logs The folder that contains No Folder


logs from the Spark cluster.

Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the
HDInsight linked service.

SparkJob1
main.jar
files
input1.txt
input2.txt
jars
package1.jar
package2.jar
logs

archives

pyFiles

SparkJob2
main.py
pyFiles
scrip1.py
script2.py
logs

archives

jars

files

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Streaming activity in
Azure Data Factory
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a
general overview of data transformation and the supported transformation activities.
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial:
transform data before reading this article.

JSON sample
{
"name": "Streaming Activity",
"description": "Description",
"type": "HDInsightStreaming",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mapper": "MyMapper.exe",
"reducer": "MyReducer.exe",
"combiner": "MyCombiner.exe",
"fileLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"filePaths": [
"<containername>/example/apps/MyMapper.exe",
"<containername>/example/apps/MyReducer.exe",
"<containername>/example/apps/MyCombiner.exe"
],
"input": "wasb://<containername>@<accountname>.blob.core.windows.net/example/input/MapperInput.txt",
"output":
"wasb://<containername>@<accountname>.blob.core.windows.net/example/output/ReducerOutput.txt",
"commandEnvironment": [
"CmdEnvVarName=CmdEnvVarValue"
],
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}

Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Hadoop Streaming Activity, the Yes


activity type is HDInsightStreaming

linkedServiceName Reference to the HDInsight cluster Yes


registered as a linked service in Data
Factory. To learn about this linked
service, see Compute linked services
article.

mapper Specifies the name of the mapper Yes


executable

reducer Specifies the name of the reducer Yes


executable

combiner Specifies the name of the combiner No


executable

fileLinkedService Reference to an Azure Storage Linked No


Service used to store the Mapper,
Combiner, and Reducer programs to
be executed. Only Azure Blob
Storage and ADLS Gen2 linked
services are supported here. If you
don't specify this Linked Service, the
Azure Storage Linked Service defined
in the HDInsight Linked Service is
used.

filePath Provide an array of path to the Yes


Mapper, Combiner, and Reducer
programs stored in the Azure Storage
referred by fileLinkedService. The path
is case-sensitive.

input Specifies the WASB path to the input Yes


file for the Mapper.

output Specifies the WASB path to the output Yes


file for the Reducer.

getDebugInfo Specifies when the log files are copied No


to the Azure Storage used by
HDInsight cluster (or) specified by
scriptLinkedService. Allowed values:
None, Always, or Failure. Default value:
None.
P RO P ERT Y DESC RIP T IO N REQ UIRED

arguments Specifies an array of arguments for a No


Hadoop job. The arguments are
passed as command-line arguments to
each task.

defines Specify parameters as key/value pairs No


for referencing within the Hive script.

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Execute Azure Machine Learning pipelines in Azure
Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Run your Azure Machine Learning pipelines as a step in your Azure Data Factory pipelines. The Machine
Learning Execute Pipeline activity enables batch prediction scenarios such as identifying possible loan defaults,
determining sentiment, and analyzing customer behavior patterns.
The below video features a six-minute introduction and demonstration of this feature.

Syntax
{
"name": "Machine Learning Execute Pipeline",
"type": "AzureMLExecutePipeline",
"linkedServiceName": {
"referenceName": "AzureMLService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mlPipelineId": "machine learning pipeline ID",
"experimentName": "experimentName",
"mlPipelineParameters": {
"mlParameterName": "mlParameterValue"
}
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the activity in the String Yes


pipeline

type Type of activity is String Yes


'AzureMLExecutePipeline'

linkedServiceName Linked Service to Azure Linked service reference Yes


Machine Learning

mlPipelineId ID of the published Azure String (or expression with Yes


Machine Learning pipeline resultType of string)
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

experimentName Run history experiment String (or expression with No


name of the Machine resultType of string)
Learning pipeline run

mlPipelineParameters Key, Value pairs to be Object with key value pairs No


passed to the published (or Expression with
Azure Machine Learning resultType object)
pipeline endpoint. Keys
must match the names of
pipeline parameters defined
in the published Machine
Learning pipeline

mlParentRunId The parent Azure Machine String (or expression with No


Learning pipeline run ID resultType of string)

dataPathAssignments Dictionary used for Object with key value pairs No


changing datapaths in
Azure Machine learning.
Enables the switching of
datapaths

continueOnStepFailure Whether to continue boolean No


execution of other steps in
the Machine Learning
pipeline run if a step fails

NOTE
To populate the dropdown items in Machine Learning pipeline name and ID, the user needs to have permission to list ML
pipelines. ADF UX calls AzureMLService APIs directly using the logged in user's credentials.

Next steps
See the following articles that explain how to transform data in other ways:
Execute Data Flow activity
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Create a predictive pipeline using Azure Machine
Learning Studio (classic) and Azure Data Factory
3/5/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Machine Learning Studio (classic) enables you to build, test, and deploy predictive analytics solutions.
From a high-level point of view, it is done in three steps:
1. Create a training experiment . You do this step by using the Azure Machine Learning Studio (classic).
Azure Machine Learning Studio (classic) is a collaborative visual development environment that you use to
train and test a predictive analytics model using training data.
2. Conver t it to a predictive experiment . Once your model has been trained with existing data and you are
ready to use it to score new data, you prepare and streamline your experiment for scoring.
3. Deploy it as a web ser vice . You can publish your scoring experiment as an Azure web service. You can
send data to your model via this web service end point and receive result predictions from the model.
Data Factory and Azure Machine Learning Studio (classic) together
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning Studio
(classic) web service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory
pipeline, you can invoke an Azure Machine Learning Studio (classic) web service to make predictions on the data
in batch.
Over time, the predictive models in the Azure Machine Learning Studio (classic) scoring experiments need to be
retrained using new input datasets. You can retrain a model from a Data Factory pipeline by doing the following
steps:
1. Publish the training experiment (not predictive experiment) as a web service. You do this step in the Azure
Machine Learning Studio (classic) as you did to expose predictive experiment as a web service in the previous
scenario.
2. Use the Azure Machine Learning Studio (classic) Batch Execution Activity to invoke the web service for the
training experiment. Basically, you can use the Azure Machine Learning Studio (classic) Batch Execution
activity to invoke both training web service and scoring web service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning Studio (classic) Update
Resource Activity . See Updating models using Update Resource Activity article for details.

Azure Machine Learning Studio (classic) linked service


You create an Azure Machine Learning Studio (classic) linked service to link an Azure Machine Learning
Studio (classic) Web Service to an Azure data factory. The Linked Service is used by Azure Machine Learning
Studio (classic) Batch Execution Activity and Update Resource Activity.
{
"type" : "linkedServices",
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "URL to Azure ML Predictive Web Service",
"apiKey": {
"type": "SecureString",
"value": "api key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

See Compute linked services article for descriptions about properties in the JSON definition.
Azure Machine Learning Studio (classic) supports both Classic Web Services and New Web Services for your
predictive experiment. You can choose the right one to use from Data Factory. To get the information required to
create the Azure Machine Learning Studio (classic) Linked Service, go to https://services.azureml.net, where all
your (new) Web Services and Classic Web Services are listed. Click the Web Service you would like to access,
and click Consume page. Copy Primar y Key for apiKey property, and Batch Requests for mlEndpoint
property.

Azure Machine Learning Studio (classic) Batch Execution activity


The following JSON snippet defines an Azure Machine Learning Studio (classic) Batch Execution activity. The
activity definition has a reference to the Azure Machine Learning Studio (classic) linked service you created
earlier.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"<web service input name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path1"
},
"<web service input name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"path2"
}
},
"webServiceOutputs": {
"<web service output name 1>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path3"
},
"<web service output name 2>": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"path4"
}
},
"globalParameters": {
"<Parameter 1 Name>": "<parameter value>",
"<parameter 2 name>": "<parameter 2 value>"
}
}
}

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Data Lake Analytics U-SQL activity, Yes


the activity type is
AzureMLBatchExecution .

linkedServiceName Linked Services to the Azure Machine Yes


Learning Studio (classic) Linked Service.
To learn about this linked service, see
Compute linked services article.
P RO P ERT Y DESC RIP T IO N REQ UIRED

webServiceInputs Key, Value pairs, mapping the names of No


Azure Machine Learning Studio (classic)
Web Service Inputs. Key must match
the input parameters defined in the
published Azure Machine Learning
Studio (classic) Web Service. Value is an
Azure Storage Linked Services and
FilePath properties pair specifying the
input Blob locations.

webServiceOutputs Key, Value pairs, mapping the names of No


Azure Machine Learning Studio (classic)
Web Service Outputs. Key must match
the output parameters defined in the
published Azure Machine Learning
Studio (classic) Web Service. Value is an
Azure Storage Linked Services and
FilePath properties pair specifying the
output Blob locations.

globalParameters Key, Value pairs to be passed to the No


Azure Machine Learning Studio (classic)
Batch Execution Service endpoint. Keys
must match the names of web service
parameters defined in the published
Azure Machine Learning Studio (classic)
web service. Values are passed in the
GlobalParameters property of the
Azure Machine Learning Studio (classic)
batch execution request

Scenario 1: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage
In this scenario, the Azure Machine Learning Studio (classic) Web service makes predictions using data from a
file in an Azure blob storage and stores the prediction results in the blob storage. The following JSON defines a
Data Factory pipeline with an AzureMLBatchExecution activity. The input and output data in Azure Blog Storage
is referenced using a LinkedName and FilePath pair. In the sample Linked Service of inputs and outputs are
different, you can use different Linked Services for each of your inputs/outputs for Data Factory to be able to
pick up the right files and send to Azure Machine Learning Studio (classic) Web Service.

IMPORTANT
In your Azure Machine Learning Studio (classic) experiment, web service input and output ports, and global parameters
have default names ("input1", "input2") that you can customize. The names you use for webServiceInputs,
webServiceOutputs, and globalParameters settings must exactly match the names in the experiments. You can view the
sample request payload on the Batch Execution Help page for your Azure Machine Learning Studio (classic) endpoint to
verify the expected mapping.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in1.csv"
},
"input2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in2.csv"
}
},
"webServiceOutputs": {
"outputName1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out1.csv"
},
"outputName2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out2.csv"
}
}
}
}

Scenario 2: Experiments using Reader/Writer Modules to refer to data in various storages


Another common scenario when creating Azure Machine Learning Studio (classic) experiments is to use Import
Data and Output Data modules. The Import Data module is used to load data into an experiment and the Output
Data module is to save data from your experiments. For details about Import Data and Output Data modules,
see Import Data and Output Data topics on MSDN Library.
When using the Import Data and Output Data modules, it is good practice to use a Web service parameter for
each property of these modules. These web parameters enable you to configure the values during runtime. For
example, you could create an experiment with an Import Data module that uses an Azure SQL Database:
XXX.database.windows.net. After the web service has been deployed, you want to enable the consumers of the
web service to specify another logical SQL server called YYY.database.windows.net . You can use a Web service
parameter to allow this value to be configured.
NOTE
Web service input and output are different from Web service parameters. In the first scenario, you have seen how an
input and output can be specified for an Azure Machine Learning Studio (classic) Web service. In this scenario, you pass
parameters for a Web service that correspond to properties of Import Data/Output Data modules.

Let's look at a scenario for using Web service parameters. You have a deployed Azure Machine Learning Studio
(classic) web service that uses a reader module to read data from one of the data sources supported by Azure
Machine Learning Studio (classic) (for example: Azure SQL Database). After the batch execution is performed, the
results are written using a Writer module (Azure SQL Database). No web service inputs and outputs are defined
in the experiments. In this case, we recommend that you configure relevant web service parameters for the
reader and writer modules. This configuration allows the reader/writer modules to be configured when using
the AzureMLBatchExecution activity. You specify Web service parameters in the globalParameters section in
the activity JSON as follows.

"typeProperties": {
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
}

NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones
exposed by the Web service.

After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning Studio (classic) Update
Resource Activity . See Updating models using Update Resource Activity article for details.

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Update Azure Machine Learning Studio (classic)
models by using Update Resource activity
3/22/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article complements the main Azure Data Factory - Azure Machine Learning Studio (classic) integration
article: Create predictive pipelines using Azure Machine Learning Studio (classic) and Azure Data Factory. If you
haven't already done so, review the main article before reading through this article.

Overview
As part of the process of operationalizing Azure Machine Learning Studio (classic) models, your model is trained
and saved. You then use it to create a predictive Web service. The Web service can then be consumed in web
sites, dashboards, and mobile apps.
Models you create using Azure Machine Learning Studio (classic) are typically not static. As new data becomes
available or when the consumer of the API has their own data the model needs to be retrained.
Retraining may occur frequently. With Batch Execution activity and Update Resource activity, you can
operationalize the Azure Machine Learning Studio (classic) model retraining and updating the predictive Web
Service using Data Factory.
The following picture depicts the relationship between training and predictive Web Services.

Azure Machine Learning Studio (classic) update resource activity


The following JSON snippet defines an Azure Machine Learning Studio (classic) Batch Execution activity.
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "description",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ModelName",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "ilearner file path"
}
}

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in the pipeline Yes

description Text describing what the activity does. No

type For Azure Machine Learning Studio Yes


(classic) Update Resource activity, the
activity type is
AzureMLUpdateResource .

linkedServiceName Azure Machine Learning Studio (classic) Yes


linked service that contains
updateResourceEndpoint property.

trainedModelName Name of the Trained Model module in Yes


the Web Service experiment to be
updated

trainedModelLinkedServiceName Name of Azure Storage linked service Yes


holding the ilearner file that is
uploaded by the update operation

trainedModelFilePath The relative file path in Yes


trainedModelLinkedService to
represent the ilearner file that is
uploaded by the update operation

End-to-end workflow
The entire process of operationalizing retraining a model and update the predictive Web Services involves the
following steps:
Invoke the training Web Ser vice by using the Batch Execution activity . Invoking a training Web Service
is the same as invoking a predictive Web Service described in Create predictive pipelines using Azure
Machine Learning Studio (classic) and Data Factory Batch Execution activity. The output of the training Web
Service is an iLearner file that you can use to update the predictive Web Service.
Invoke the update resource endpoint of the predictive Web Ser vice by using the Update Resource
activity to update the Web Service with the newly trained model.
Azure Machine Learning Studio (classic) linked service
For the above mentioned end-to-end workflow to work, you need to create two Azure Machine Learning Studio
(classic) linked services:
1. An Azure Machine Learning Studio (classic) linked service to the training web service, this linked service is
used by Batch Execution activity in the same way as what's mentioned in Create predictive pipelines using
Azure Machine Learning Studio (classic) and Data Factory Batch Execution activity. Difference is the output of
the training web service is an iLearner file, which is then used by Update Resource activity to update the
predictive web service.
2. An Azure Machine Learning Studio (classic) linked service to the update resource endpoint of the predictive
web service. This linked service is used by Update Resource activity to update the predictive web service
using the iLearner file returned from above step.
For the second Azure Machine Learning Studio (classic) linked service, the configuration is different when your
Azure Machine Learning Studio (classic) Web Service is a classic Web Service or a new Web Service. The
differences are discussed separately in the following sections.

Web service is new Azure Resource Manager web service


If the web service is the new type of web service that exposes an Azure Resource Manager endpoint, you do not
need to add the second non-default endpoint. The updateResourceEndpoint in the linked service is of the
format:

https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview

You can get values for place holders in the URL when querying the web service on the Azure Machine Learning
Studio (classic) Web Services Portal.
The new type of update resource endpoint requires service principal authentication. To use service principal
authentication, register an application entity in Azure Active Directory (Azure AD) and grant it the Contributor
or Owner role of the subscription or the resource group where the web service belongs to. The See how to
create service principal and assign permissions to manage Azure resource. Make note of the following values,
which you use to define the linked service:
Application ID
Application key
Tenant ID
Here is a sample linked service definition:
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "The linked service for AML web service.",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/0000000000000000
000000000000000000000/services/0000000000000000000000000000000000000/jobs?api-version=2.0",
"apiKey": {
"type": "SecureString",
"value": "APIKeyOfEndpoint1"
},
"updateResourceEndpoint":
"https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview",
"servicePrincipalId": "000000000-0000-0000-0000-0000000000000",
"servicePrincipalKey": {
"type": "SecureString",
"value": "servicePrincipalKey"
},
"tenant": "mycompany.com"
}
}
}

The following scenario provides more details. It has an example for retraining and updating Azure Machine
Learning Studio (classic) models from an Azure Data Factory pipeline.

Sample: Retraining and updating an Azure Machine Learning Studio


(classic) model
This section provides a sample pipeline that uses the Azure Machine Learning Studio (classic) Batch
Execution activity to retrain a model. The pipeline also uses the Azure Machine Learning Studio (classic)
Update Resource activity to update the model in the scoring web service. The section also provides JSON
snippets for all the linked services, datasets, and pipeline in the example.
Azure Blob storage linked service:
The Azure Storage holds the following data:
training data. The input data for the Azure Machine Learning Studio (classic) training web service.
iLearner file. The output from the Azure Machine Learning Studio (classic) training web service. This file is
also the input to the Update Resource activity.
Here is the sample JSON definition of the linked service:

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=name;AccountKey=key"
}
}
}

Linked service for Azure Machine Learning Studio (classic) training endpoint
The following JSON snippet defines an Azure Machine Learning Studio (classic) linked service that points to the
default endpoint of the training web service.
{
"name": "trainingEndpoint",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--training
experiment--/jobs",
"apiKey": "myKey"
}
}
}

In Azure Machine Learning Studio (classic) , do the following to get values for mlEndpoint and apiKey :
1. Click WEB SERVICES on the left menu.
2. Click the training web ser vice in the list of web services.
3. Click copy next to API key text box. Paste the key in the clipboard into the Data Factory JSON editor.
4. In the Azure Machine Learning Studio (classic) , click BATCH EXECUTION link.
5. Copy the Request URI from the Request section and paste it into the Data Factory JSON editor.
Linked service for Azure Machine Learning Studio (classic) updatable scoring endpoint:
The following JSON snippet defines an Azure Machine Learning Studio (classic) linked service that points to
updatable endpoint of the scoring web service.

{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/00000000
026734a5889e02fbb1f65cefd/jobs?api-version=2.0",
"apiKey":
"sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview",
"servicePrincipalId": "fe200044-c008-4008-a005-94000000731",
"servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=",
"tenant": "mycompany.com"
}
}
}

Pipeline
The pipeline has two activities: AzureMLBatchExecution and AzureMLUpdateResource . The Batch Execution
activity takes the training data as input and produces an iLearner file as an output. The Update Resource activity
then takes this iLearner file and use it to update the predictive web service.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "amlBEGetilearner",
"description": "Use AML BES to get the ileaner file from training web service",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "trainingEndpoint",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
},
"input2": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
}
},
"webServiceOutputs": {
"output1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/output"
}
}
}
},
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "Use AML Update Resource to update the predict web service",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ADFV2Sample Model [trained model]",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "azuremltesting/output/newModelForArm.ilearner"
},
"dependsOn": [
{
"activity": "amlbeGetilearner",
"dependencyConditions": [ "Succeeded" ]
}
]
}
]
}
}
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Transform data by using the SQL Server Stored
Procedure activity in Azure Data Factory
6/24/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You use data transformation activities in a Data Factory pipeline to transform and process raw data into
predictions and insights. The Stored Procedure Activity is one of the transformation activities that Data Factory
supports. This article builds on the transform data article, which presents a general overview of data
transformation and the supported transformation activities in Data Factory.

NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Tutorial:
transform data before reading this article.

You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM):
Azure SQL Database
Azure Synapse Analytics
SQL Server Database. If you are using SQL Server, install Self-hosted integration runtime on the same
machine that hosts the database or on a separate machine that has access to the database. Self-Hosted
integration runtime is a component that connects data sources on-premises/on Azure VM with cloud
services in a secure and managed way. See Self-hosted integration runtime article for details.

IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For details about the property, see following
connector articles: Azure SQL Database, SQL Server. Invoking a stored procedure while copying data into an Azure
Synapse Analytics by using a copy activity is not supported. But, you can use the stored procedure activity to invoke a
stored procedure in Azure Synapse Analytics.
When copying data from Azure SQL Database or SQL Server or Azure Synapse Analytics, you can configure SqlSource in
copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure Synapse Analytics

Syntax details
Here is the JSON format for defining a Stored Procedure Activity:
{
"name": "Stored Procedure Activity",
"description":"Description",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "usp_sample",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }

}
}
}

The following table describes these JSON properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity Yes

description Text describing what the activity is No


used for

type For Stored Procedure Activity, the Yes


activity type is
SqlSer verStoredProcedure

linkedServiceName Reference to the Azure SQL Yes


Database or Azure Synapse
Analytics or SQL Ser ver registered
as a linked service in Data Factory. To
learn about this linked service, see
Compute linked services article.

storedProcedureName Specify the name of the stored Yes


procedure to invoke.

storedProcedureParameters Specify the values for stored procedure No


parameters. Use
"param1": { "value":
"param1Value","type":"param1Type"
}
to pass parameter values and their
type supported by the data source. If
you need to pass null for a parameter,
use "param1": { "value": null }
(all lower case).

Parameter data type mapping


The data type you specify for the parameter is the Azure Data Factory type that maps to the data type in the data
source you are using. You can find the data type mappings for your data source described in the connectors
documentation. For example:
Azure Synapse Analytics
Azure SQL Database data type mapping
Oracle data type mapping
SQL Server data type mapping

Next steps
See the following articles that explain how to transform data in other ways:
U-SQL Activity
Hive Activity
Pig Activity
MapReduce Activity
Hadoop Streaming Activity
Spark Activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution Activity
Stored procedure activity
Compute environments supported by Azure Data
Factory
5/28/2021 • 22 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article explains different compute environments that you can use to process or transform data. It also
provides details about different configurations (on-demand vs. bring your own) supported by Data Factory
when configuring linked services linking these compute environments to an Azure data factory.
The following table provides a list of compute environments supported by Data Factory and the activities that
can run on them.

C O M P UT E EN VIRO N M EN T A C T IVIT IES

On-demand HDInsight cluster or your own HDInsight Hive, Pig, Spark, MapReduce, Hadoop Streaming
cluster

Azure Batch Custom

Azure Machine Learning Studio (classic) Machine Learning Studio (classic) activities: Batch Execution
and Update Resource

Azure Machine Learning Azure Machine Learning Execute Pipeline

Azure Data Lake Analytics Data Lake Analytics U-SQL

Azure SQL, Azure Synapse Analytics, SQL Server Stored Procedure

Azure Databricks Notebook, Jar, Python

Azure Function Azure Function activity

HDInsight compute environment


Refer to below table for details about the supported storage linked service types for configuration in On-
demand and BYOC (Bring your own compute) environment.

IN C O M P UT E
L IN K ED P RO P ERT Y A Z URE SQ L
SERVIC E NAME DESC RIP T IO N B LO B A DL S GEN 2 DB A DL S GEN 1

On-demand linkedService Azure Storage Yes Yes No No


Name linked service
to be used by
the on-
demand
cluster for
storing and
processing
data.
IN C O M P UT E
L IN K ED P RO P ERT Y A Z URE SQ L
SERVIC E NAME DESC RIP T IO N B LO B A DL S GEN 2 DB A DL S GEN 1

additionalLink Specifies Yes No No No


edServiceNam additional
es storage
accounts for
the HDInsight
linked service
so that the
Data Factory
service can
register them
on your
behalf.

hcatalogLinke The name of No No Yes No


dServiceName Azure SQL
linked service
that point to
the HCatalog
database. The
on-demand
HDInsight
cluster is
created by
using the
Azure SQL
database as
the
metastore.

BYOC linkedService The Azure Yes Yes No No


Name Storage linked
service
reference.

additionalLink Specifies No No No No
edServiceNam additional
es storage
accounts for
the HDInsight
linked service
so that the
Data Factory
service can
register them
on your
behalf.

hcatalogLinke A reference to No No No No
dServiceName the Azure
SQL linked
service that
points to the
HCatalog
database.

Azure HDInsight on-demand linked service


In this type of configuration, the computing environment is fully managed by the Azure Data Factory service. It
is automatically created by the Data Factory service before a job is submitted to process data and removed
when the job is completed. You can create a linked service for the on-demand compute environment, configure
it, and control granular settings for job execution, cluster management, and bootstrapping actions.

NOTE
The on-demand configuration is currently supported only for Azure HDInsight clusters. Azure Databricks also supports
on-demand jobs using job clusters. For more information, see Azure databricks linked service.

The Azure Data Factory service can automatically create an on-demand HDInsight cluster to process data. The
cluster is created in the same region as the storage account (linkedServiceName property in the JSON)
associated with the cluster. The storage account must be a general-purpose standard Azure Storage account.
Note the following impor tant points about on-demand HDInsight linked service:
The on-demand HDInsight cluster is created under your Azure subscription. You are able to see the cluster in
your Azure portal when the cluster is up and running.
The logs for jobs that are run on an on-demand HDInsight cluster are copied to the storage account
associated with the HDInsight cluster. The clusterUserName, clusterPassword, clusterSshUserName,
clusterSshPassword defined in your linked service definition are used to log in to the cluster for in-depth
troubleshooting during the lifecycle of the cluster.
You are charged only for the time when the HDInsight cluster is up and running jobs.
You can use a Script Action with the Azure HDInsight on-demand linked service.

IMPORTANT
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.

Example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster to process the required activity.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterType": "hadoop",
"clusterSize": 1,
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedSer viceName ).
HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand
HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing
live cluster (timeToLive ) and is deleted when the processing is done.
As more activity runs, you see many containers in your Azure blob storage. If you do not need them for troubleshooting
of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern:
adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft Azure Storage
Explorer to delete containers in your Azure blob storage.

Properties

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property should be set to Yes


HDInsightOnDemand .

clusterSize Number of worker/data nodes in the Yes


cluster. The HDInsight cluster is created
with 2 head nodes along with the
number of worker nodes you specify
for this property. The nodes are of size
Standard_D3 that has 4 cores, so a 4
worker node cluster takes 24 cores
(4*4 = 16 cores for worker nodes, plus
2*4 = 8 cores for head nodes). See Set
up clusters in HDInsight with Hadoop,
Spark, Kafka, and more for details.
P RO P ERT Y DESC RIP T IO N REQ UIRED

linkedServiceName Azure Storage linked service to be Yes


used by the on-demand cluster for
storing and processing data. The
HDInsight cluster is created in the
same region as this Azure Storage
account. Azure HDInsight has
limitation on the total number of cores
you can use in each Azure region it
supports. Make sure you have enough
core quotas in that Azure region to
meet the required clusterSize. For
details, refer to Set up clusters in
HDInsight with Hadoop, Spark, Kafka,
and more
Currently, you cannot create an
on-demand HDInsight cluster that
uses an Azure Data Lake Storage
(Gen 2) as the storage. If you want
to store the result data from
HDInsight processing in an Azure
Data Lake Storage (Gen 2), use a
Copy Activity to copy the data
from the Azure Blob Storage to the
Azure Data Lake Storage (Gen 2).

clusterResourceGroup The HDInsight cluster is created in this Yes


resource group.
P RO P ERT Y DESC RIP T IO N REQ UIRED

timetolive The allowed idle time for the on- Yes


demand HDInsight cluster. Specifies
how long the on-demand HDInsight
cluster stays alive after completion of
an activity run if there are no other
active jobs in the cluster. The minimal
allowed value is 5 minutes (00:05:00).

For example, if an activity run takes 6


minutes and timetolive is set to 5
minutes, the cluster stays alive for 5
minutes after the 6 minutes of
processing the activity run. If another
activity run is executed with the 6-
minutes window, it is processed by the
same cluster.

Creating an on-demand HDInsight


cluster is an expensive operation (could
take a while), so use this setting as
needed to improve performance of a
data factory by reusing an on-demand
HDInsight cluster.

If you set timetolive value to 0, the


cluster is deleted as soon as the
activity run completes. Whereas, if you
set a high value, the cluster may stay
idle for you to log on for some
troubleshooting purpose but it could
result in high costs. Therefore, it is
important that you set the appropriate
value based on your needs.

If the timetolive property value is


appropriately set, multiple pipelines
can share the instance of the on-
demand HDInsight cluster.

clusterType The type of the HDInsight cluster to be No


created. Allowed values are "hadoop"
and "spark". If not specified, default
value is hadoop. Enterprise Security
Package enabled cluster cannot be
created on-demand, instead use an
existing cluster/ bring your own
compute.

version Version of the HDInsight cluster. If not No


specified, it's using the current
HDInsight defined default version.

hostSubscriptionId The Azure subscription ID used to No


create HDInsight cluster. If not
specified, it uses the Subscription ID of
your Azure login context.
P RO P ERT Y DESC RIP T IO N REQ UIRED

clusterNamePrefix The prefix of HDI cluster name, a No


timestamp automatically appends at
the end of the cluster name

sparkVersion The version of spark if the cluster type No


is "Spark"

additionalLinkedServiceNames Specifies additional storage accounts No


for the HDInsight linked service so
that the Data Factory service can
register them on your behalf. These
storage accounts must be in the same
region as the HDInsight cluster, which
is created in the same region as the
storage account specified by
linkedServiceName.

osType Type of operating system. Allowed No


values are: Linux and Windows (for
HDInsight 3.3 only). Default is Linux.

hcatalogLinkedServiceName The name of Azure SQL linked service No


that point to the HCatalog database.
The on-demand HDInsight cluster is
created by using the Azure SQL
Database as the metastore.

connectVia The Integration Runtime to be used to No


dispatch the activities to this
HDInsight linked service. For on-
demand HDInsight linked service, it
only supports Azure Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

clusterUserName The username to access the cluster. No

clusterPassword The password in type of secure string No


to access the cluster.

clusterSshUserName The username to SSH remotely No


connects to cluster's node (for Linux).

clusterSshPassword The password in type of secure string No


to SSH remotely connect cluster's node
(for Linux).

scriptActions Specify script for HDInsight cluster No


customizations during on-demand
cluster creation.
Currently, Azure Data Factory's User
Interface authoring tool supports
specifying only 1 script action, but you
can get through this limitation in the
JSON (specify multiple script actions in
the JSON).
IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific version
of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that distribution.
The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem components and fixes.
Make sure you always refer to latest information of Supported HDInsight version and OS Type to ensure you are using
supported version of HDInsight.

IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.

additionalLinkedServiceNames JSON example

"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]

Service principal authentication


The On-Demand HDInsight linked service requires a service principal authentication to create HDInsight clusters
on your behalf. To use service principal authentication, register an application entity in Azure Active Directory
(Azure AD) and grant it the Contributor role of the subscription or the resource group in which the HDInsight
cluster is created. For detailed steps, see Use portal to create an Azure Active Directory application and service
principal that can access resources. Make note of the following values, which you use to define the linked
service:
Application ID
Application key
Tenant ID
Use service principal authentication by specifying the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

ser vicePrincipalId Specify the application's client ID. Yes

ser vicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.

Advanced Properties
You can also specify the following properties for the granular configuration of the on-demand HDInsight cluster.

P RO P ERT Y DESC RIP T IO N REQ UIRED

coreConfiguration Specifies the core configuration No


parameters (as in core-site.xml) for the
HDInsight cluster to be created.
P RO P ERT Y DESC RIP T IO N REQ UIRED

hBaseConfiguration Specifies the HBase configuration No


parameters (hbase-site.xml) for the
HDInsight cluster.

hdfsConfiguration Specifies the HDFS configuration No


parameters (hdfs-site.xml) for the
HDInsight cluster.

hiveConfiguration Specifies the hive configuration No


parameters (hive-site.xml) for the
HDInsight cluster.

mapReduceConfiguration Specifies the MapReduce configuration No


parameters (mapred-site.xml) for the
HDInsight cluster.

oozieConfiguration Specifies the Oozie configuration No


parameters (oozie-site.xml) for the
HDInsight cluster.

stormConfiguration Specifies the Storm configuration No


parameters (storm-site.xml) for the
HDInsight cluster.

yarnConfiguration Specifies the Yarn configuration No


parameters (yarn-site.xml) for the
HDInsight cluster.

Example – On-demand HDInsight cluster configuration with advanced properties


{
"name": " HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 16,
"timeToLive": "01:30:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"coreConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"hiveConfiguration": {
"templeton.mapper.memory.mb": "5000"
},
"mapReduceConfiguration": {
"mapreduce.reduce.java.opts": "-Xmx4000m",
"mapreduce.map.java.opts": "-Xmx4000m",
"mapreduce.map.memory.mb": "5000",
"mapreduce.reduce.memory.mb": "5000",
"mapreduce.job.reduce.slowstart.completedmaps": "0.8"
},
"yarnConfiguration": {
"yarn.app.mapreduce.am.resource.mb": "5000",
"mapreduce.map.memory.mb": "5000"
},
"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}

Node sizes
You can specify the sizes of head, data, and zookeeper nodes using the following properties:

P RO P ERT Y DESC RIP T IO N REQ UIRED

headNodeSize Specifies the size of the head node. The No


default value is: Standard_D3. See the
Specifying node sizes section for
details.

dataNodeSize Specifies the size of the data node. The No


default value is: Standard_D3.
P RO P ERT Y DESC RIP T IO N REQ UIRED

zookeeperNodeSize Specifies the size of the Zoo Keeper No


node. The default value is:
Standard_D3.

Specifying node sizes See the Sizes of Virtual Machines article for string values you need to specify for the
properties mentioned in the previous section. The values need to conform to the CMDLETs & APIS
referenced in the article. As you can see in the article, the data node of Large (default) size has 7-GB memory,
which may not be good enough for your scenario.
If you want to create D4 sized head nodes and worker nodes, specify Standard_D4 as the value for
headNodeSize and dataNodeSize properties.

"headNodeSize": "Standard_D4",
"dataNodeSize": "Standard_D4",

If you specify a wrong value for these properties, you may receive the following error : Failed to create cluster.
Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind
state: 'Error'. Message: 'PreClusterCreationValidationFailure'. When you receive this error, ensure that you are
using the CMDLET & APIS name from the table in the Sizes of Virtual Machines article.
Bring your own compute environment
In this type of configuration, users can register an already existing computing environment as a linked service in
Data Factory. The computing environment is managed by the user and the Data Factory service uses it to
execute the activities.
This type of configuration is supported for the following compute environments:
Azure HDInsight
Azure Batch
Azure Machine Learning
Azure Data Lake Analytics
Azure SQL DB, Azure Synapse Analytics, SQL Server

Azure HDInsight linked service


You can create an Azure HDInsight linked service to register your own HDInsight cluster with Data Factory.
Example
{
"name": "HDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/",
"userName": "username",
"password": {
"value": "passwordvalue",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property should be set to Yes


HDInsight .

clusterUri The URI of the HDInsight cluster. Yes

username Specify the name of the user to be Yes


used to connect to an existing
HDInsight cluster.

password Specify password for the user account. Yes

linkedServiceName Name of the Azure Storage linked Yes


service that refers to the Azure blob
storage used by the HDInsight cluster.
Currently, you cannot specify an
Azure Data Lake Storage (Gen 2)
linked service for this property. If
the HDInsight cluster has access to
the Data Lake Store, you may
access data in the Azure Data Lake
Storage (Gen 2) from Hive/Pig
scripts.

isEspEnabled Specify 'true' if the HDInsight cluster is No


Enterprise Security Package enabled.
Default is 'false'.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to be used to No


dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.
For Enterprise Security Package (ESP)
enabled HDInsight cluster use a self-
hosted integration runtime, which has
a line of sight to the cluster or it
should be deployed inside the same
Virtual Network as the ESP HDInsight
cluster.

IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific version
of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that distribution.
The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem components and fixes.
Make sure you always refer to latest information of Supported HDInsight version and OS Type to ensure you are using
supported version of HDInsight.

IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.

Azure Batch linked service


NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) to a data factory.
You can run Custom activity using Azure Batch.
See following articles if you are new to Azure Batch service:
Azure Batch basics for an overview of the Azure Batch service.
New-AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account article for detailed
instructions on using the cmdlet.
New-AzBatchPool cmdlet to create an Azure Batch pool.

IMPORTANT
When creating a new Azure Batch pool, ‘VirtualMachineConfiguration’ must be used and NOT
‘CloudServiceConfiguration'. For more details refer Azure Batch Pool migration guidance.

Example
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property should be set to Yes


AzureBatch .

accountName Name of the Azure Batch account. Yes

accessKey Access key for the Azure Batch Yes


account.

batchUri URL to your Azure Batch account, in Yes


format of
https://batchaccountname.region.batc
h.azure.com.

poolName Name of the pool of virtual machines. Yes

linkedServiceName Name of the Azure Storage linked Yes


service associated with this Azure
Batch linked service. This linked service
is used for staging files required to run
the activity.

connectVia The Integration Runtime to be used to No


dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Machine Learning Studio (classic) linked service


You create an Azure Machine Learning Studio (classic) linked service to register a Machine Learning Studio
(classic) batch scoring endpoint to a data factory.
Example

{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": {
"type": "SecureString",
"value": "access key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

Type The type property should be set to: Yes


AzureML .

mlEndpoint The batch scoring URL. Yes

apiKey The published workspace model's API. Yes

updateResourceEndpoint The Update Resource URL for an Azure No


Machine Learning Studio (classic) Web
Service endpoint used to update the
predictive Web Service with trained
model file

servicePrincipalId Specify the application's client ID. Required if updateResourceEndpoint is


specified

servicePrincipalKey Specify the application's key. Required if updateResourceEndpoint is


specified

tenant Specify the tenant information (domain Required if updateResourceEndpoint is


name or tenant ID) under which your specified
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.

connectVia The Integration Runtime to be used to No


dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Machine Learning linked service


You create an Azure Machine Learning linked service to connect an Azure Machine Learning workspace to a data
factory.

NOTE
Currently only service principal authentication is supported for the Azure Machine Learning linked service.

Example

{
"name": "AzureMLServiceLinkedService",
"properties": {
"type": "AzureMLService",
"typeProperties": {
"subscriptionId": "subscriptionId",
"resourceGroupName": "resourceGroupName",
"mlWorkspaceName": "mlWorkspaceName",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID"
},
"connectVia": {
"referenceName": "<name of Integration Runtime?",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

Type The type property should be set to: Yes


AzureMLSer vice .

subscriptionId Azure subscription ID Yes

resourceGroupName name Yes

mlWorkspaceName Azure Machine Learning workspace Yes


name

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Yes

tenant Specify the tenant information (domain Required if updateResourceEndpoint is


name or tenant ID) under which your specified
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.
P RO P ERT Y DESC RIP T IO N REQ UIRED

connectVia The Integration Runtime to be used to No


dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Data Lake Analytics linked service


You create an Azure Data Lake Analytics linked service to link an Azure Data Lake Analytics compute service
to an Azure data factory. The Data Lake Analytics U-SQL activity in the pipeline refers to this linked service.
Example

{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics URI",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID",
"subscriptionId": "<optional, subscription ID of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property should be set to: Yes


AzureDataLakeAnalytics .

accountName Azure Data Lake Analytics Account Yes


Name.

dataLakeAnalyticsUri Azure Data Lake Analytics URI. No

subscriptionId Azure subscription ID No

resourceGroupName Azure resource group name No

servicePrincipalId Specify the application's client ID. Yes

servicePrincipalKey Specify the application's key. Yes


P RO P ERT Y DESC RIP T IO N REQ UIRED

tenant Specify the tenant information (domain Yes


name or tenant ID) under which your
application resides. You can retrieve it
by hovering the mouse in the upper-
right corner of the Azure portal.

connectVia The Integration Runtime to be used to No


dispatch the activities to this linked
service. You can use Azure Integration
Runtime or Self-hosted Integration
Runtime. If not specified, it uses the
default Azure Integration Runtime.

Azure Databricks linked service


You can create Azure Databricks linked ser vice to register Databricks workspace that you use to run the
Databricks workloads(notebook, jar, python).

IMPORTANT
Databricks linked services supports Instance pools & System-assigned managed identity authentication.

Example - Using new job cluster in Databricks

{
"name": "AzureDatabricks_LS",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://eastus.azuredatabricks.net",
"newClusterNodeType": "Standard_D3_v2",
"newClusterNumOfWorker": "1:10",
"newClusterVersion": "4.0.x-scala2.11",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c721144c3a790b35000b57f7124f"
}
}
}
}

Example - Using existing Interactive cluster in Databricks


{
"name": " AzureDataBricksLinedService",
"properties": {
"type": " AzureDatabricks",
"typeProperties": {
"domain": "https://westeurope.azuredatabricks.net",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c72344c3a790b35000b57f7124f"
},
"existingClusterId": "{clusterId}"
}
}

Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the Linked Service Yes

type The type property should be set to: Yes


Azure Databricks .

domain Specify the Azure Region accordingly Yes


based on the region of the Databricks
workspace. Example:
https://eastus.azuredatabricks.net

accessToken Access token is required for Data No


Factory to authenticate to Azure
Databricks. Access token needs to be
generated from the databricks
workspace. More detailed steps to find
the access token can be found here

MSI Use Data Factory's managed identity No


(system-assigned) to authenticate to
Azure Databricks. You do not need
Access Token when using 'MSI'
authentication

existingClusterId Cluster ID of an existing cluster to run No


all jobs on this. This should be an
already created Interactive Cluster. You
may need to manually restart the
cluster if it stops responding.
Databricks suggest running jobs on
new clusters for greater reliability. You
can find the Cluster ID of an
Interactive Cluster on Databricks
workspace -> Clusters -> Interactive
Cluster Name -> Configuration ->
Tags. More details

instancePoolId Instance Pool ID of an existing pool in No


databricks workspace.
P RO P ERT Y DESC RIP T IO N REQ UIRED

newClusterVersion The Spark version of the cluster. It No


creates a job cluster in databricks.

newClusterNumOfWorker Number of worker nodes that this No


cluster should have. A cluster has one
Spark Driver and num_workers
Executors for a total of num_workers +
1 Spark nodes. A string formatted
Int32, like "1" means numOfWorker is
1 or "1:10" means autoscale from 1 as
min and 10 as max.

newClusterNodeType This field encodes, through a single No


value, the resources available to each
of the Spark nodes in this cluster. For
example, the Spark nodes can be
provisioned and optimized for memory
or compute intensive workloads. This
field is required for new cluster

newClusterSparkConf a set of optional, user-specified Spark No


configuration key-value pairs. Users
can also pass in a string of extra JVM
options to the driver and the executors
via spark.driver.extraJavaOptions and
spark.executor.extraJavaOptions
respectively.

newClusterInitScripts a set of optional, user-defined No


initialization scripts for the new cluster.
Specifying the DBFS path to the init
scripts.

Azure SQL Database linked service


You create an Azure SQL linked service and use it with the Stored Procedure Activity to invoke a stored
procedure from a Data Factory pipeline. See Azure SQL Connector article for details about this linked service.

Azure Synapse Analytics linked service


You create an Azure Synapse Analytics linked service and use it with the Stored Procedure Activity to invoke a
stored procedure from a Data Factory pipeline. See Azure Synapse Analytics Connector article for details about
this linked service.

SQL Server linked service


You create a SQL Server linked service and use it with the Stored Procedure Activity to invoke a stored
procedure from a Data Factory pipeline. See SQL Server connector article for details about this linked service.

Azure Function linked service


You create an Azure Function linked service and use it with the Azure Function activity to run Azure Functions in
a Data Factory pipeline. The return type of the Azure function has to be a valid JObject . (Keep in mind that
JArray is not a JObject .) Any return type other than JObject fails and raises the user error Response Content is
not a valid JObject.
P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: yes


AzureFunction

function app url URL for the Azure Function App. yes
Format is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal

function key Access key for the Azure Function. yes


Click on the Manage section for the
respective function, and copy either
the Function Key or the Host key .
Find out more here: Azure Functions
HTTP triggers and bindings

Next steps
For a list of the transformation activities supported by Azure Data Factory, see Transform data.
Append Variable Activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the Append Variable activity to add a value to an existing array variable defined in a Data Factory pipeline.

Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in pipeline Yes

description Text describing what the activity does no

type Activity Type is AppendVariable yes

value String literal or expression object value yes


used to append into specified variable

variableName Name of the variable that will be yes


modified by activity, the variable must
be of type ‘Array’

Next steps
Learn about a related control flow activity supported by Data Factory:
Set Variable Activity
Execute Pipeline activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.

Syntax
{
"name": "MyPipeline",
"properties": {
"activities": [
{
"name": "ExecutePipelineActivity",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"mySourceDatasetFolderPath": {
"value": "@pipeline().parameters.mySourceDatasetFolderPath",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "<InvokedPipelineName>",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
],
"parameters": [
{
"mySourceDatasetFolderPath": {
"type": "String"
}
}
]
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the execute String Yes


pipeline activity.

type Must be set to: String Yes


ExecutePipeline .
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

pipeline Pipeline reference to the PipelineReference Yes


dependent pipeline that this
pipeline invokes. A pipeline
reference object has two
properties:
referenceName and type .
The referenceName
property specifies the name
of the reference pipeline.
The type property must be
set to PipelineReference.

parameters Parameters to be passed to A JSON object that maps No


the invoked pipeline parameter names to
argument values

waitOnCompletion Defines whether activity Boolean No


execution waits for the
dependent pipeline
execution to finish. Default
is false.

Sample
This scenario has two pipelines:
Master pipeline - This pipeline has one Execute Pipeline activity that calls the invoked pipeline. The master
pipeline takes two parameters: masterSourceBlobContainer , masterSinkBlobContainer .
Invoked pipeline - This pipeline has one Copy activity that copies data from an Azure Blob source to Azure
Blob sink. The invoked pipeline takes two parameters: sourceBlobContainer , sinkBlobContainer .
Master pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "MyExecutePipelineActivity"
}
],
"parameters": {
"masterSourceBlobContainer": {
"type": "String"
},
"masterSinkBlobContainer": {
"type": "String"
}
}
}
}

Invoked pipeline definition


{
"name": "invokedPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"name": "CopyBlobtoBlob",
"inputs": [
{
"referenceName": "SourceBlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sinkBlobDataset",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceBlobContainer": {
"type": "String"
},
"sinkBlobContainer": {
"type": "String"
}
}
}
}

Linked ser vice

{
"name": "BlobStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=*****;AccountKey=*****"
}
}
}

Source dataset
{
"name": "SourceBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sourceBlobContainer",
"type": "Expression"
},
"fileName": "salesforce.txt"
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

Sink dataset

{
"name": "sinkBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sinkBlobContainer",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}

Running the pipeline


To run the master pipeline in this example, the following values are passed for the masterSourceBlobContainer
and masterSinkBlobContainer parameters:

{
"masterSourceBlobContainer": "executetest",
"masterSinkBlobContainer": "executesink"
}

The master pipeline forwards these values to the invoked pipeline as shown in the following example:
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},

....
}

Next steps
See other control flow activities supported by Data Factory:
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Filter activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

You can use a Filter activity in a pipeline to apply a filter expression to an input array.
APPLIES TO: Azure Data Factory Azure Synapse Analytics

Syntax
{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "<condition>",
"items": "<input array>"
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the Filter String Yes


activity.

type Must be set to filter . String Yes

condition Condition to be used for Expression Yes


filtering the input.

items Input array on which filter Expression Yes


should be applied.

Example
In this example, the pipeline has two activities: Filter and ForEach . The Filter activity is configured to filter the
input array for items with a value greater than 3. The ForEach activity then iterates over the filtered values and
sets the variable test to the current value.
{
"name": "PipelineName",
"properties": {
"activities": [{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "@greater(item(),3)",
"items": "@pipeline().parameters.inputs"
}
},
{
"name": "MyForEach",
"type": "ForEach",
"dependsOn": [
{
"activity": "MyFilterActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('MyFilterActivity').output.value",
"type": "Expression"
},
"isSequential": "false",
"batchCount": 1,
"activities": [
{
"name": "Set Variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "test",
"value": {
"value": "@string(item())",
"type": "Expression"
}
}
}
]
}
}],
"parameters": {
"inputs": {
"type": "Array",
"defaultValue": [1, 2, 3, 4, 5, 6]
}
},
"variables": {
"test": {
"type": "String"
}
},
"annotations": []
}
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
ForEach activity in Azure Data Factory
6/23/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The ForEach Activity defines a repeating control flow in your pipeline. This activity is used to iterate over a
collection and executes specified activities in a loop. The loop implementation of this activity is similar to
Foreach looping structure in programming languages.

Syntax
The properties are described later in this article. The items property is the collection and each item in the
collection is referred to by using the @item() as shown in the following syntax:

{
"name":"MyForEachActivityName",
"type":"ForEach",
"typeProperties":{
"isSequential":"true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPathCollection",
"type": "Expression"
},
"activities":[
{
"name":"MyCopyActivity",
"type":"Copy",
"typeProperties":{
...
},
"inputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@item()"
}
}
]
}
]
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the for-each String Yes


activity.

type Must be set to ForEach String Yes

isSequential Specifies whether the loop Boolean No. Default is False.


should be executed
sequentially or in parallel.
Maximum of 50 loop
iterations can be executed
at once in parallel). For
example, if you have a
ForEach activity iterating
over a copy activity with 10
different source and sink
datasets with isSequential
set to False, all copies are
executed at once. Default is
False.

If "isSequential" is set to
False, ensure that there is a
correct configuration to run
multiple executables.
Otherwise, this property
should be used with caution
to avoid incurring write
conflicts. For more
information, see Parallel
execution section.

batchCount Batch count to be used for Integer (maximum 50) No. Default is 20.
controlling the number of
parallel execution (when
isSequential is set to false).
This is the upper
concurrency limit, but the
for-each activity will not
always execute at this
number

Items An expression that returns Expression (which returns a Yes


a JSON Array to be iterated JSON Array)
over.

Activities The activities to be List of Activities Yes


executed.

Parallel execution
If isSequential is set to false, the activity iterates in parallel with a maximum of 50 concurrent iterations. This
setting should be used with caution. If the concurrent iterations are writing to the same folder but to different
files, this approach is fine. If the concurrent iterations are writing concurrently to the exact same file, this
approach most likely causes an error.

Iteration expression language


In the ForEach activity, provide an array to be iterated over for the property items ." Use @item() to iterate over
a single enumeration in ForEach activity. For example, if items is an array: [1, 2, 3], @item() returns 1 in the first
iteration, 2 in the second iteration, and 3 in the third iteration. You can also use @range(0,10) like expression to
iterate ten times starting with 0 ending with 9.

Iterating over a single activity


Scenario: Copy from the same source file in Azure Blob to multiple destination files in Azure Blob.
Pipeline definition
{
"name": "<MyForEachPipeline>",
"properties": {
"activities": [
{
"name": "<MyForEachActivity>",
"type": "ForEach",
"typeProperties": {
"isSequential": "true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPath",
"type": "Expression"
},
"activities": [
{
"name": "MyCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": "false"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
},
"inputs": [
{
"referenceName": "<MyDataset>",
"type": "DatasetReference",
"parameters": {
"MyFolderPath": "@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs": [
{
"referenceName": "MyDataset",
"type": "DatasetReference",
"parameters": {
"MyFolderPath": "@item()"
}
}
]
}
]
}
}
],
"parameters": {
"mySourceDatasetFolderPath": {
"type": "String"
},
"mySinkDatasetFolderPath": {
"type": "String"
}
}
}
}

Blob dataset definition


{
"name":"<MyDataset>",
"properties":{
"type":"AzureBlob",
"typeProperties":{
"folderPath":{
"value":"@dataset().MyFolderPath",
"type":"Expression"
}
},
"linkedServiceName":{
"referenceName":"StorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"MyFolderPath":{
"type":"String"
}
}
}
}

Run parameter values

{
"mySourceDatasetFolderPath": "input/",
"mySinkDatasetFolderPath": [ "outputs/file1", "outputs/file2" ]
}

Iterate over multiple activities


It's possible to iterate over multiple activities (for example: copy and web activities) in a ForEach activity. In this
scenario, we recommend that you abstract out multiple activities into a separate pipeline. Then, you can use the
ExecutePipeline activity in the pipeline with ForEach activity to invoke the separate pipeline with multiple
activities.
Syntax
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "<MyForEachMultipleActivities>"
"typeProperties": {
"isSequential": true,
"items": {
...
},
"activities": [
{
"type": "ExecutePipeline",
"name": "<MyInnerPipeline>"
"typeProperties": {
"pipeline": {
"referenceName": "<copyHttpPipeline>",
"type": "PipelineReference"
},
"parameters": {
...
},
"waitOnCompletion": true
}
}
]
}
}
],
"parameters": {
...
}
}
}

Example
Scenario: Iterate over an InnerPipeline within a ForEach activity with Execute Pipeline activity. The inner pipeline
copies with schema definitions parameterized.
Master Pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "MyForEachActivity",
"typeProperties": {
"isSequential": true,
"items": {
"value": "@pipeline().parameters.inputtables",
"type": "Expression"
},
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "InnerCopyPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceTableName": {
"value": "@item().SourceTable",
"type": "Expression"
},
"sourceTableStructure": {
"value": "@item().SourceTableStructure",
"type": "Expression"
},
"sinkTableName": {
"value": "@item().DestTable",
"type": "Expression"
},
"sinkTableStructure": {
"value": "@item().DestTableStructure",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "ExecuteCopyPipeline"
}
]
}
}
],
"parameters": {
"inputtables": {
"type": "Array"
}
}
}
}

Inner pipeline definition

{
"name": "InnerCopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"type": "SqlSource",
}
},
"sink": {
"type": "SqlSink"
}
},
"name": "CopyActivity",
"inputs": [
{
"referenceName": "sqlSourceDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sourceTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sourceTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sqlSinkDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sinkTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sinkTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceTableName": {
"type": "String"
},
"sourceTableStructure": {
"type": "String"
},
"sinkTableName": {
"type": "String"
},
"sinkTableStructure": {
"type": "String"
}
}
}
}

Source dataset definition


{
"name": "sqlSourceDataset",
"properties": {
"type": "SqlServerTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "sqlserverLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}

Sink dataset definition

{
"name": "sqlSinkDataSet",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "azureSqlLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}

Master pipeline parameters


{
"inputtables": [
{
"SourceTable": "department",
"SourceTableStructure": [
{
"name": "departmentid",
"type": "int"
},
{
"name": "departmentname",
"type": "string"
}
],
"DestTable": "department2",
"DestTableStructure": [
{
"name": "departmentid",
"type": "int"
},
{
"name": "departmentname",
"type": "string"
}
]
}
]

Aggregating outputs
To aggregate outputs of foreach activity, please utilize Variables and Append Variable activity.
First, declare an array variable in the pipeline. Then, invoke Append Variable activity inside each foreach loop.
Subsequently, you can retrieve the aggregation from your array.

Limitations and workarounds


Here are some limitations of the ForEach activity and suggested workarounds.

L IM ITAT IO N W O RK A RO UN D

You can't nest a ForEach loop inside another ForEach loop Design a two-level pipeline where the outer pipeline with the
(or an Until loop). outer ForEach loop iterates over an inner pipeline with the
nested loop.

The ForEach activity has a maximum batchCount of 50 for Design a two-level pipeline where the outer pipeline with the
parallel processing, and a maximum of 100,000 items. ForEach activity iterates over an inner pipeline.

SetVariable can't be used inside a ForEach activity that runs Consider using sequential ForEach or use Execute Pipeline
in parallel as the variables are global to the whole pipeline, inside ForEach (Variable/Parameter handled in child Pipeline).
they are not scoped to a ForEach or any other activity.

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
Get Metadata Activity
Lookup Activity
Web Activity
Get Metadata activity in Azure Data Factory
5/14/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can use the Get Metadata activity to retrieve the metadata of any data in Azure Data Factory. You can use the
output from the Get Metadata activity in conditional expressions to perform validation, or consume the
metadata in subsequent activities.

Supported capabilities
The Get Metadata activity takes a dataset as an input and returns metadata information as output. Currently, the
following connectors and the corresponding retrievable metadata are supported. The maximum size of returned
metadata is 4 MB .
Supported connectors
File storage

L A ST M
IT EM N IT EM T C REAT O DIF IE EXIST S
C ONN AME YPE ED D1 C H IL DI C ONT E C OLU 3
EC TO R ( F IL E/ F ( F IL E/ F ( F IL E/ F ( F IL E/ F T EM S NT MD ST RUC MNCO ( F IL E/ F
/ M ETA O L DER O L DER SIZ E O L DER O L DER ( F O L DE 5 T URE 2 UN T 2 O L DER
DATA ) ) ( F IL E) ) ) R) ( F IL E) ( F IL E) ( F IL E) )

Amazo Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã


n S3

Amazo Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã


n S3
Compa
tible
Storag
e

Google Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã


Cloud
Storag
e

Oracle Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã


Cloud
Storag
e

Azure Ã/Ã Ã/Ã Ã x/x Ã/Ã Ã Ã Ã Ã Ã/Ã


Blob
storag
e

Azure Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã


Data
Lake
Storag
e Gen1
L A ST M
IT EM N IT EM T C REAT O DIF IE EXIST S
C ONN AME YPE ED D C H IL DI C ONT E C OLU
EC TO R ( F IL E/ F ( F IL E/ F ( F IL E/ F ( F IL E/ F T EM S NT MD ST RUC MNCO ( F IL E/ F
/ M ETA O L DER O L DER SIZ E O L DER O L DER ( F O L DE 5 T URE UN T O L DER
DATA ) ) ( F IL E) ) ) R) ( F IL E) ( F IL E) ( F IL E) )

Azure Ã/Ã Ã/Ã Ã x/x Ã/Ã Ã Ã Ã Ã Ã/Ã


Data
Lake
Storag
e Gen2

Azure Ã/à Ã/à à Ã/à Ã/à à x à à Ã/Ã


Files

File Ã/à Ã/à à Ã/à Ã/à à x à à Ã/Ã


system

SFTP Ã/à Ã/à à x/x Ã/à à x à à Ã/Ã

FTP Ã/à Ã/à à x/x x/x à x à à Ã/Ã

1 Metadata lastModified :
For Amazon S3, Amazon S3 Compatible Storage, Google Cloud Storage and Oracle Cloud Storage,
lastModified applies to the bucket and the key but not to the virtual folder, and exists applies to the
bucket and the key but not to the prefix or virtual folder.
For Azure Blob storage, lastModified applies to the container and the blob but not to the virtual folder.
2 Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML
files.
3 Metadata exists : For Amazon S3, Amazon S3 Compatible Storage, Google Cloud Storage and Oracle Cloud
Storage, exists applies to the bucket and the key but not to the prefix or virtual folder.
Note the following:
When using Get Metadata activity against a folder, make sure you have LIST/EXECUTE permission to the
given folder.
Wildcard filter on folders/files is not supported for Get Metadata activity.
modifiedDatetimeStart and modifiedDatetimeEnd filter set on connector:
These two properties are used to filter the child items when getting metadata from a folder. It does not
apply when getting metadata from a file.
When such filter is used, the childItems in output includes only the files that are modified within the
specified range but not folders.
To apply such filter, GetMetadata activity will enumerate all the files in the specified folder and check
the modified time. Avoid pointing to a folder with a large number of files even if the expected qualified
file count is small.
Relational database

C O N N EC TO R/ M ETA DATA ST RUC T URE C O L UM N C O UN T EXIST S

Azure SQL Database à à Ã


C O N N EC TO R/ M ETA DATA ST RUC T URE C O L UM N C O UN T EXIST S

Azure SQL Managed à à Ã


Instance

Azure Synapse Analytics à à Ã

SQL Server à à Ã

Metadata options
You can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding
information:

M ETA DATA T Y P E DESC RIP T IO N

itemName Name of the file or folder.

itemType Type of the file or folder. Returned value is File or


Folder .

size Size of the file, in bytes. Applicable only to files.

created Created datetime of the file or folder.

lastModified Last modified datetime of the file or folder.

childItems List of subfolders and files in the given folder. Applicable only
to folders. Returned value is a list of the name and type of
each child item.

contentMD5 MD5 of the file. Applicable only to files.

structure Data structure of the file or relational database table.


Returned value is a list of column names and column types.

columnCount Number of columns in the file or relational table.

exists Whether a file, folder, or table exists. If exists is specified


in the Get Metadata field list, the activity won't fail even if
the file, folder, or table doesn't exist. Instead,
exists: false is returned in the output.

TIP
When you want to validate that a file, folder, or table exists, specify exists in the Get Metadata activity field list. You can
then check the exists: true/false result in the activity output. If exists isn't specified in the field list, the Get
Metadata activity will fail if the object isn't found.
NOTE
When you get metadata from file stores and configure modifiedDatetimeStart or modifiedDatetimeEnd , the
childItems in the output includes only files in the specified path that have a last modified time within the specified
range. Items in subfolders are not included.

NOTE
For the Structure field list to provide the actual data structure for delimited text and Excel format datasets, you must
enable the First Row as Header property, which is supported only for these data sources.

Syntax
Get Metadata activity

{
"name":"MyActivity",
"type":"GetMetadata",
"dependsOn":[

],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[

],
"typeProperties":{
"dataset":{
"referenceName":"MyDataset",
"type":"DatasetReference"
},
"fieldList":[
"size",
"lastModified",
"structure"
],
"storeSettings":{
"type":"AzureBlobStorageReadSettings"
},
"formatSettings":{
"type":"JsonReadSettings"
}
}
}

Dataset
{
"name":"MyDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[

],
"type":"Json",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"file.json",
"folderPath":"folder",
"container":"container"
}
}
}
}

Type properties
Currently, the Get Metadata activity can return the following types of metadata information:

P RO P ERT Y DESC RIP T IO N REQ UIRED

fieldList The types of metadata information Yes


required. For details on supported
metadata, see the Metadata options
section of this article.

dataset The reference dataset whose metadata Yes


is to be retrieved by the Get Metadata
activity. See the Capabilities section for
information on supported connectors.
Refer to the specific connector topics
for dataset syntax details.

formatSettings Apply when using format type dataset. No

storeSettings Apply when using format type dataset. No

Sample output
The Get Metadata results are shown in the activity output. Following are two samples showing extensive
metadata options. To use the results in a subsequent activity, use this pattern:
@{activity('MyGetMetadataActivity').output.itemName} .

Get a file's metadata


{
"exists": true,
"itemName": "test.csv",
"itemType": "File",
"size": 104857600,
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"contentMD5": "cMauY+Kz5zDm3eWa9VpoyQ==",
"structure": [
{
"name": "id",
"type": "Int64"
},
{
"name": "name",
"type": "String"
}
],
"columnCount": 2
}

Get a folder's metadata

{
"exists": true,
"itemName": "testFolder",
"itemType": "Folder",
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"childItems": [
{
"name": "test.avro",
"type": "File"
},
{
"name": "folder hello",
"type": "Folder"
}
]
}

Next steps
Learn about other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
Lookup activity
Web activity
If Condition activity in Azure Data Factory
5/28/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The If Condition activity provides the same functionality that an if statement provides in programming
languages. It executes a set of activities when the condition evaluates to true and another set of activities when
the condition evaluates to false .

Syntax
{
"name": "<Name of the activity>",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},

"ifTrueActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
],

"ifFalseActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the if-condition String Yes


activity.

type Must be set to String Yes


IfCondition
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

expression Expression that must Expression with result type Yes


evaluate to true or false boolean

ifTrueActivities Set of activities that are Array Yes


executed when the
expression evaluates to
true .

ifFalseActivities Set of activities that are Array Yes


executed when the
expression evaluates to
false .

Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is
determined by the value of pipeline parameter: routeSelection. If the value of routeSelection is true, the data is
copied to outputPath1. And, if the value of routeSelection is false, the data is copied to outputPath2.

NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.

Pipeline with IF -Condition activity (Adfv2QuickStartPipeline.json)

{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "@bool(pipeline().parameters.routeSelection)",
"type": "Expression"
},

"ifTrueActivities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1"
},
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"ifFalseActivities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}

Another example for expression is:


"expression": {
"value": "@equals(pipeline().parameters.routeSelection,1)",
"type": "Expression"
}

Azure Storage linked service (AzureStorageLinkedService.json)

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account
name>;AccountKey=<Azure Storage account key>"
}
}
}

Parameterized Azure Blob dataset (BlobDataset.json)


The pipeline sets the folderPath to the value of either outputPath1 or outputPath2 parameter of the pipeline.

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

Pipeline parameter JSON (PipelineParameters.json)

{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputIf",
"outputPath2": "adftutorial/outputElse",
"routeSelection": "false"
}

PowerShell commands

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
These commands assume that you have saved the JSON files into the folder: C:\ADF.

Connect-AzAccount
Select-AzSubscription "<Your subscription name>"

$resourceGroupName = "<Resource Group Name>"


$dataFactoryName = "<Data Factory Name. Must be globally unique>";
Remove-AzDataFactoryV2 $dataFactoryName -ResourceGroupName $resourceGroupName -force

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureStorageLinkedService" -DefinitionFile "C:\ADF\AzureStorageLinkedService.json"
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"BlobDataset" -DefinitionFile "C:\ADF\BlobDataset.json"
Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"Adfv2QuickStartPipeline" -DefinitionFile "C:\ADF\Adfv2QuickStartPipeline.json"
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName "Adfv2QuickStartPipeline" -ParameterFile C:\ADF\PipelineParameters.json
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "\nActivity 'Error' section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Lookup activity in Azure Data Factory
5/6/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Lookup activity can retrieve a dataset from any of the Azure Data Factory-supported data sources. you can use it
to dynamically determine which objects to operate on in a subsequent activity, instead of hard coding the object
name. Some object examples are files and tables.
Lookup activity reads and returns the content of a configuration file or table. It also returns the result of
executing a query or stored procedure. The output can be a singleton value or an array of attributes, which can
be consumed in a subsequent copy, transformation, or control flow activities like ForEach activity.

Supported capabilities
Note the following:
The Lookup activity can return up to 5000 rows ; if the result set contains more records, the first 5000 rows
will be returned.
The Lookup activity output supports up to 4 MB in size, activity will fail if the size exceeds the limit.
The longest duration for Lookup activity before timeout is 24 hours .
When you use query or stored procedure to lookup data, make sure to return one and exact one result set.
Otherwise, Lookup activity fails.
The following data sources are supported for Lookup activity.

C AT EGO RY DATA STO RE

Azure Azure Blob storage

Azure Cosmos DB (SQL API)

Azure Data Explorer

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen2

Azure Database for MariaDB

Azure Database for MySQL

Azure Database for PostgreSQL

Azure Databricks Delta Lake

Azure Files

Azure SQL Database


C AT EGO RY DATA STO RE

Azure SQL Managed Instance

Azure Synapse Analytics

Azure Table storage

Database Amazon Redshift

DB2

Drill

Google BigQuery

Greenplum

HBase

Hive

Apache Impala

Informix

MariaDB

Microsoft Access

MySQL

Netezza

Oracle

Phoenix

PostgreSQL

Presto

SAP Business Warehouse Open Hub

SAP Business Warehouse via MDX

SAP HANA

SAP Table

Snowflake
C AT EGO RY DATA STO RE

Spark

SQL Server

Sybase

Teradata

Vertica

NoSQL Cassandra

Couchbase (Preview)

File Amazon S3

Amazon S3 Compatible Storage

File System

FTP

Google Cloud Storage

HDFS

Oracle Cloud Storage

SFTP

Generic protocol Generic HTTP

Generic OData

Generic ODBC

Ser vices and apps Amazon Marketplace Web Service

Concur (Preview)

Dataverse

Dynamics 365

Dynamics AX

Dynamics CRM

Google AdWords
C AT EGO RY DATA STO RE

HubSpot

Jira

Magento (Preview)

Marketo (Preview)

Oracle Eloqua (Preview)

Oracle Responsys (Preview)

Oracle Service Cloud (Preview)

PayPal (Preview)

QuickBooks (Preview)

Salesforce

Salesforce Service Cloud

Salesforce Marketing Cloud

SAP Cloud for Customer (C4C)

SAP ECC

ServiceNow

Shopify (Preview)

SharePoint Online List

Square (Preview)

Web Table (HTML table)

Xero

Zoho (Preview)

NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency
on preview connectors in your solution, please contact Azure support.

Syntax
{
"name":"LookupActivity",
"type":"Lookup",
"typeProperties":{
"source":{
"type":"<source type>"
},
"dataset":{
"referenceName":"<source dataset name>",
"type":"DatasetReference"
},
"firstRowOnly":<true or false>
}
}

Type properties
NAME DESC RIP T IO N TYPE REQ UIRED?

dataset Provides the dataset Key/value pair Yes


reference for the lookup.
Get details from the
Dataset proper ties
section in each
corresponding connector
article.

source Contains dataset-specific Key/value pair Yes


source properties, the same
as the Copy Activity source.
Get details from the Copy
Activity proper ties
section in each
corresponding connector
article.

firstRowOnly Indicates whether to return Boolean No. The default is true .


only the first row or all
rows.

NOTE
Source columns with ByteArray type aren't supported.
Structure isn't supported in dataset definitions. For text-format files, use the header row to provide the column name.
If your lookup source is a JSON file, the jsonPathDefinition setting for reshaping the JSON object isn't supported.
The entire objects will be retrieved.

Use the Lookup activity result


The lookup result is returned in the output section of the activity run result.
When firstRowOnly is set to true (default) , the output format is as shown in the following code. The
lookup result is under a fixed firstRow key. To use the result in subsequent activity, use the pattern of
@{activity('LookupActivity').output.firstRow.table} .
{
"firstRow":
{
"Id": "1",
"schema":"dbo",
"table":"Table1"
}
}

When firstRowOnly is set to false , the output format is as shown in the following code. A count
field indicates how many records are returned. Detailed values are displayed under a fixed value array.
In such a case, the Lookup activity is followed by a Foreach activity. You pass the value array to the
ForEach activity items field by using the pattern of @activity('MyLookupActivity').output.value . To
access elements in the value array, use the following syntax:
@{activity('lookupActivity').output.value[zero based index].propertyname} . An example is
@{activity('lookupActivity').output.value[0].schema} .

{
"count": "2",
"value": [
{
"Id": "1",
"schema":"dbo",
"table":"Table1"
},
{
"Id": "2",
"schema":"dbo",
"table":"Table2"
}
]
}

Example
In this example, the pipeline contains two activities: Lookup and Copy . The Copy Activity copies data from a
SQL table in your Azure SQL Database instance to Azure Blob storage. The name of the SQL table is stored in a
JSON file in Blob storage. The Lookup activity looks up the table name at runtime. JSON is modified dynamically
by using this approach. You don't need to redeploy pipelines or datasets.
This example demonstrates lookup for the first row only. For lookup for all rows and to chain the results with
ForEach activity, see the samples in Copy multiple tables in bulk by using Azure Data Factory.
Pipeline
The Lookup activity is configured to use LookupDataset , which refers to a location in Azure Blob storage.
The Lookup activity reads the name of the SQL table from a JSON file in this location.
The Copy Activity uses the output of the Lookup activity, which is the name of the SQL table. The tableName
property in the SourceDataset is configured to use the output from the Lookup activity. Copy Activity
copies data from the SQL table to a location in Azure Blob storage. The location is specified by the
SinkDataset property.

{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "JsonSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "JsonReadSettings"
}
},
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
},
"firstRowOnly": true
}
},
{
"name": "CopyActivity",
"type": "Copy",
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": {
"value": "select * from [@{activity('LookupActivity').output.firstRow.schema}].
[@{activity('LookupActivity').output.firstRow.table}]",
"type": "Expression"
},
"queryTimeout": "02:00:00",
"partitionOption": "None"
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference",
"parameters": {
"schemaName": {
"value": "@activity('LookupActivity').output.firstRow.schema",
"type": "Expression"
},
"tableName": {
"value": "@activity('LookupActivity').output.firstRow.table",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"schema": {
"value": "@activity('LookupActivity').output.firstRow.schema",
"type": "Expression"
},
"table": {
"value": "@activity('LookupActivity').output.firstRow.table",
"type": "Expression"
}
}
}
]
}
],
"annotations": [],
"lastPublishTime": "2020-08-17T10:48:25Z"
}
}

Lookup dataset
The lookup dataset is the sourcetable.json file in the Azure Storage lookup folder specified by the
AzureBlobStorageLinkedSer vice type.
{
"name": "LookupDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Json",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "sourcetable.json",
"container": "lookup"
}
}
}
}

Source dataset for Copy Activity


The source dataset uses the output of the Lookup activity, which is the name of the SQL table. Copy Activity
copies data from this SQL table to a location in Azure Blob storage. The location is specified by the sink dataset.

{
"name": "SourceDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSqlDatabase",
"type": "LinkedServiceReference"
},
"parameters": {
"schemaName": {
"type": "string"
},
"tableName": {
"type": "string"
}
},
"annotations": [],
"type": "AzureSqlTable",
"schema": [],
"typeProperties": {
"schema": {
"value": "@dataset().schemaName",
"type": "Expression"
},
"table": {
"value": "@dataset().tableName",
"type": "Expression"
}
}
}
}

Sink dataset for Copy Activity


Copy Activity copies data from the SQL table to the filebylookup.csv file in the csv folder in Azure Storage. The
file is specified by the AzureBlobStorageLinkedSer vice property.
{
"name": "SinkDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"schema": {
"type": "string"
},
"table": {
"type": "string"
}
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": {
"value": "@{dataset().schema}_@{dataset().table}.csv",
"type": "Expression"
},
"container": "csv"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}

sourcetable.json
You can use following two kinds of formats for sourcetable.json file.
Set of objects

{
"Id":"1",
"schema":"dbo",
"table":"Table1"
}
{
"Id":"2",
"schema":"dbo",
"table":"Table2"
}

Array of objects
[
{
"Id": "1",
"schema":"dbo",
"table":"Table1"
},
{
"Id": "2",
"schema":"dbo",
"table":"Table2"
}
]

Limitations and workarounds


Here are some limitations of the Lookup activity and suggested workarounds.

L IM ITAT IO N W O RK A RO UN D

The Lookup activity has a maximum of 5,000 rows, and a Design a two-level pipeline where the outer pipeline iterates
maximum size of 4 MB. over an inner pipeline, which retrieves data that doesn't
exceed the maximum rows or size.

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
GetMetadata activity
Web activity
Set Variable Activity in Azure Data Factory
6/17/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the Set Variable activity to set the value of an existing variable of type String, Bool, or Array defined in a Data
Factory pipeline.

Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED

name Name of the activity in pipeline yes

description Text describing what the activity does no

type Must be set to SetVariable yes

value String literal or expression object value yes


that the variable is assigned to

variableName Name of the variable that is set by this yes


activity

Incrementing a variable
A common scenario involving variables in Azure Data Factory is using a variable as an iterator within an until or
foreach activity. In a set variable activity you cannot reference the variable being set in the value field. To
workaround this limitation, set a temporary variable and then create a second set variable activity. The second
set variable activity sets the value of the iterator to the temporary variable.
Below is an example of this pattern:
{
"name": "pipeline3",
"properties": {
"activities": [
{
"name": "Set I",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Increment J",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "i",
"value": {
"value": "@variables('j')",
"type": "Expression"
}
}
},
{
"name": "Increment J",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "j",
"value": {
"value": "@string(add(int(variables('i')), 1))",
"type": "Expression"
}
}
}
],
"variables": {
"i": {
"type": "String",
"defaultValue": "0"
},
"j": {
"type": "String",
"defaultValue": "0"
}
},
"annotations": []
}
}

Variables are currently scoped at the pipeline level. This means that they are not thread safe and can cause
unexpected and undesired behavior if they are accessed from within a parallel iteration activity such as a foreach
loop, especially when the value is also being modified within that foreach activity.

Next steps
Learn about a related control flow activity supported by Data Factory:
Append Variable Activity
Switch activity in Azure Data Factory
7/12/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Switch activity provides the same functionality that a switch statement provides in programming languages.
It evaluates a set of activities corresponding to a case that matches the condition evaluation.

Syntax
{
"name": "<Name of the activity>",
"type": "Switch",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to some string value>",
"type": "Expression"
},
"cases": [
{
"value": "<string value that matches expression evaluation>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
],
"defaultActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the switch activity. String Yes

type Must be set to Switch* String Yes


P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

expression Expression that must Expression with result type Yes


evaluate to string value string

cases Set of cases that contain a Array of Case Objects Yes


value and a set of activities
to execute when the value
matches the expression
evaluation. Must provide at
least one case. There's a
max limit of 25 cases.

defaultActivities Set of activities that are Array of Activities Yes


executed when the
expression evaluation isn't
satisfied.

Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is
determined by the value of pipeline parameter: routeSelection.

NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.

Pipeline with Switch activity (Adfv2QuickStartPipeline.json)

{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MySwitch",
"type": "Switch",
"typeProperties": {
"expression": {
"value": "@pipeline().parameters.routeSelection",
"type": "Expression"
},
"cases": [
{
"value": "1",
"activities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
{
"value": "2",
"activities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath",
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
{
"value": "3",
"activities": [
{
"name": "CopyFromBlobToBlob3",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath",
},
"type": "DatasetReference"
}
],
"outputs": [
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath3",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
],
"defaultActivities": []
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"outputPath3": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}

Azure Storage linked service (AzureStorageLinkedService.json)

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account
name>;AccountKey=<Azure Storage account key>"
}
}
}

Parameterized Azure Blob dataset (BlobDataset.json)


The pipeline sets the folderPath to the value of either outputPath1 or outputPath2 parameter of the pipeline.
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

Pipeline parameter JSON (PipelineParameters.json)

{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputCase1",
"outputPath2": "adftutorial/outputCase2",
"outputPath2": "adftutorial/outputCase3",
"routeSelection": "1"
}

PowerShell commands

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

These commands assume that you've saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"

$resourceGroupName = "<Resource Group Name>"


$dataFactoryName = "<Data Factory Name. Must be globally unique>";
Remove-AzDataFactoryV2 $dataFactoryName -ResourceGroupName $resourceGroupName -force

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureStorageLinkedService" -DefinitionFile "C:\ADF\AzureStorageLinkedService.json"
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"BlobDataset" -DefinitionFile "C:\ADF\BlobDataset.json"
Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"Adfv2QuickStartPipeline" -DefinitionFile "C:\ADF\Adfv2QuickStartPipeline.json"
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName "Adfv2QuickStartPipeline" -ParameterFile C:\ADF\PipelineParameters.json
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}

Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "\nActivity 'Error' section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until activity in Azure Data Factory
5/28/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true.
You can specify a timeout value for the until activity in Data Factory.

Syntax
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"timeout": "<time out for the loop. for example: 00:01:00 (1 minute)>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
},
"name": "MyUntilActivity"
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the Until String Yes


activity.

type Must be set to Until. String Yes

expression Expression that must Expression. Yes


evaluate to true or false

timeout The do-until loop times out String. d.hh:mm:ss (or) No


after the specified time hh:mm:ss . The default
here. value is 7 days. Maximum
value is: 90 days.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

Activities Set of activities that are Array of activities. Yes


executed until expression
evaluates to true .

Example 1
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.

Pipeline with Until activity


In this example, the pipeline has two activities: Until and Wait . The Wait activity waits for the specified period of
time before running the Web activity in the loop. To learn about expressions and functions in Data Factory, see
Expression language and functions.
{
"name": "DoUntilPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('Failed', coalesce(body('MyUnauthenticatedActivity')?.status,
actions('MyUnauthenticatedActivity')?.status, 'null'))",
"type": "Expression"
},
"timeout": "00:00:01",
"activities": [
{
"name": "MyUnauthenticatedActivity",
"type": "WebActivity",
"typeProperties": {
"method": "get",
"url": "https://www.fake.com/",
"headers": {
"Content-Type": "application/json"
}
},
"dependsOn": [
{
"activity": "MyWaitActivity",
"dependencyConditions": [ "Succeeded" ]
}
]
},
{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
},
"name": "MyWaitActivity"
}
]
},
"name": "MyUntilActivity"
}
]
}
}

Example 2
The pipeline in this sample copies data from an input folder to an output folder in a loop. The loop terminates
when the value for the repeat parameter is set to false or it times out after one minute.
Pipeline with Until activity (Adfv2QuickStartPipeline.json)

{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('false', pipeline().parameters.repeat)",
"type": "Expression"
},
},
"timeout": "00:01:00",
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"retry": 1,
"timeout": "00:10:00",
"retryIntervalInSeconds": 60
}
}
]
},
"name": "MyUntilActivity"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
},
"repeat": {
"type": "String"
}
}
}
}

Azure Storage linked service (AzureStorageLinkedService.json)


{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account
name>;AccountKey=<Azure Storage account key>"
}
}
}

Parameterized Azure Blob dataset (BlobDataset.json)


The pipeline sets the folderPath to the value of either outputPath1 or outputPath2 parameter of the pipeline.

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

Pipeline parameter JSON (PipelineParameters.json)

{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/outputUntil",
"repeat": "true"
}

PowerShell commands

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

These commands assume that you have saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"

$resourceGroupName = "<Resource Group Name>"


$dataFactoryName = "<Data Factory Name. Must be globally unique>";
Remove-AzDataFactoryV2 $dataFactoryName -ResourceGroupName $resourceGroupName -force

Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location "East US" -Name $dataFactoryName


Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -
Name "AzureStorageLinkedService" -DefinitionFile "C:\ADF\AzureStorageLinkedService.json"
Set-AzDataFactoryV2Dataset -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"BlobDataset" -DefinitionFile "C:\ADF\BlobDataset.json"
Set-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName $resourceGroupName -Name
"Adfv2QuickStartPipeline" -DefinitionFile "C:\ADF\Adfv2QuickStartPipeline.json"
$runId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineName "Adfv2QuickStartPipeline" -ParameterFile C:\ADF\PipelineParameters.json

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"
}

Start-Sleep -Seconds 15
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Validation activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can use a Validation in a pipeline to ensure the pipeline only continues execution once it has validated the
attached dataset reference exists, that it meets the specified criteria, or timeout has been reached.

Syntax
{
"name": "Validation_Activity",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_File",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"minimumSize": 20
}
},
{
"name": "Validation_Activity_Folder",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_Folder",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"childItems": true
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the 'Validation' String Yes


activity

type Must be set to Validation . String Yes


P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

dataset Activity will block execution Dataset reference Yes


until it has validated this
dataset reference exists and
that it meets the specified
criteria, or timeout has been
reached. Dataset provided
should support
"MinimumSize" or
"ChildItems" property.

timeout Specifies the timeout for the String No


activity to run. If no value is
specified, default value is 7
days ("7.00:00:00"). Format
is d.hh:mm:ss

sleep A delay in seconds between Integer No


validation attempts. If no
value is specified, default
value is 10 seconds.

childItems Checks if the folder has Boolean No


child items. Can be set to-
true : Validate that the
folder exists and that it has
items. Blocks until at least
one item is present in the
folder or timeout value is
reached.-false: Validate that
the folder exists and that it
is empty. Blocks until folder
is empty or until timeout
value is reached. If no value
is specified, activity will
block until the folder exists
or until timeout is reached.

minimumSize Minimum size of a file in Integer No


bytes. If no value is
specified, default value is 0
bytes

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute wait activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

When you use a Wait activity in a pipeline, the pipeline waits for the specified period of time before continuing
with execution of subsequent activities.
APPLIES TO: Azure Data Factory Azure Synapse Analytics

Syntax
{
"name": "MyWaitActivity",
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the Wait activity. String Yes

type Must be set to Wait . String Yes

waitTimeInSeconds The number of seconds that Integer Yes


the pipeline waits before
continuing with the
processing.

Example
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.

Pipeline with Wait activity


In this example, the pipeline has two activities: Until and Wait . The Wait activity is configured to wait for one
second. The pipeline runs the Web activity in a loop with one second waiting time between each run.
{
"name": "DoUntilPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('Failed', coalesce(body('MyUnauthenticatedActivity')?.status,
actions('MyUnauthenticatedActivity')?.status, 'null'))",
"type": "Expression"
},
"timeout": "00:00:01",
"activities": [
{
"name": "MyUnauthenticatedActivity",
"type": "WebActivity",
"typeProperties": {
"method": "get",
"url": "https://www.fake.com/",
"headers": {
"Content-Type": "application/json"
}
},
"dependsOn": [
{
"activity": "MyWaitActivity",
"dependencyConditions": [ "Succeeded" ]
}
]
},
{
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
},
"name": "MyWaitActivity"
}
]
},
"name": "MyUntilActivity"
}
]
}
}

Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Web activity in Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. You can pass datasets
and linked services to be consumed and accessed by the activity.

NOTE
Web Activity is supported for invoking URLs that are hosted in a private virtual network as well by leveraging self-hosted
integration runtime. The integration runtime should have a line of sight to the URL endpoint.

NOTE
The maximum supported output response payload size is 4 MB.

Syntax
{
"name":"MyWebActivity",
"type":"WebActivity",
"typeProperties":{
"method":"Post",
"url":"<URLEndpoint>",
"connectVia": {
"referenceName": "<integrationRuntimeName>",
"type": "IntegrationRuntimeReference"
}
"headers":{
"Content-Type":"application/json"
},
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
},
"datasets":[
{
"referenceName":"<ConsumedDatasetName>",
"type":"DatasetReference",
"parameters":{
...
}
}
],
"linkedServices":[
{
"referenceName":"<ConsumedLinkedServiceName>",
"type":"LinkedServiceReference"
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name Name of the web activity String Yes

type Must be set to String Yes


WebActivity .

method Rest API method for the String. Yes


target endpoint.
Supported Types: "GET",
"POST", "PUT"

url Target endpoint and path String (or expression with Yes
resultType of string). The
activity will timeout at 1
minute with an error if it
does not receive a response
from the endpoint.

headers Headers that are sent to String (or expression with Yes, Content-type header is
the request. For example, to resultType of string) required.
set the language and type "headers":{ "Content-
on a request: Type":"application/json"}
"headers" : { "Accept-
Language": "en-us",
"Content-Type":
"application/json" }
.

body Represents the payload that String (or expression with Required for POST/PUT
is sent to the endpoint. resultType of string). methods.

See the schema of the


request payload in Request
payload schema section.

authentication Authentication method String (or expression with No


used for calling the resultType of string)
endpoint. Supported Types
are "Basic, or
ClientCertificate." For more
information, see
Authentication section. If
authentication is not
required, exclude this
property.

datasets List of datasets passed to Array of dataset references. Yes


the endpoint. Can be an empty array.

linkedServices List of linked services Array of linked service Yes


passed to endpoint. references. Can be an
empty array.
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

connectVia The integration runtime to The integration runtime No


be used to connect to the reference.
data store. You can use the
Azure integration runtime
or the self-hosted
integration runtime (if your
data store is in a private
network). If this property
isn't specified, the service
uses the default Azure
integration runtime.

NOTE
REST endpoints that the web activity invokes must return a response of type JSON. The activity will timeout at 1 minute
with an error if it does not receive a response from the endpoint.

The following table shows the requirements for JSON content:

VA L UE T Y P E REQ UEST B O DY RESP O N SE B O DY

JSON object Supported Supported

JSON array Supported Unsupported


(At present, JSON arrays don't work as
a result of a bug. A fix is in progress.)

JSON value Supported Unsupported

Non-JSON type Unsupported Unsupported

Authentication
Below are the supported authentication types in the web activity.
None
If authentication is not required, do not include the "authentication" property.
Basic
Specify user name and password to use with the basic authentication.

"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}

Client certificate
Specify base64-encoded contents of a PFX file and the password.
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}

Managed Identity
Specify the resource uri for which the access token will be requested using the managed identity for the data
factory. To call the Azure Resource Management API, use https://management.azure.com/ . For more information
about how managed identities works see the managed identities for Azure resources overview page.

"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}

NOTE
If your data factory is configured with a git repository, you must store your credentials in Azure Key Vault to use basic or
client certificate authentication. Azure Data Factory doesn't store passwords in git.

Request payload schema


When you use the POST/PUT method, the body property represents the payload that is sent to the endpoint. You
can pass linked services and datasets as part of the payload. Here is the schema for the payload:

{
"body": {
"myMessage": "Sample",
"datasets": [{
"name": "MyDataset1",
"properties": {
...
}
}],
"linkedServices": [{
"name": "MyStorageLinkedService1",
"properties": {
...
}
}]
}
}

Example
In this example, the web activity in the pipeline calls a REST end point. It passes an Azure SQL linked service and
an Azure SQL dataset to the endpoint. The REST end point uses the Azure SQL connection string to connect to
the logical SQL server and returns the name of the instance of SQL server.
Pipeline definition
{
"name": "<MyWebActivityPipeline>",
"properties": {
"activities": [
{
"name": "<MyWebActivity>",
"type": "WebActivity",
"typeProperties": {
"method": "Post",
"url": "@pipeline().parameters.url",
"headers": {
"Content-Type": "application/json"
},
"authentication": {
"type": "ClientCertificate",
"pfx": "*****",
"password": "*****"
},
"datasets": [
{
"referenceName": "MySQLDataset",
"type": "DatasetReference",
"parameters": {
"SqlTableName": "@pipeline().parameters.sqlTableName"
}
}
],
"linkedServices": [
{
"referenceName": "SqlLinkedService",
"type": "LinkedServiceReference"
}
]
}
}
],
"parameters": {
"sqlTableName": {
"type": "String"
},
"url": {
"type": "String"
}
}
}
}

Pipeline parameter values

{
"sqlTableName": "department",
"url": "https://adftes.azurewebsites.net/api/execute/running"
}

Web service endpoint code


[HttpPost]
public HttpResponseMessage Execute(JObject payload)
{
Trace.TraceInformation("Start Execute");

JObject result = new JObject();


result.Add("status", "complete");

JArray datasets = payload.GetValue("datasets") as JArray;


result.Add("sinktable", datasets[0]["properties"]["typeProperties"]["tableName"].ToString());

JArray linkedServices = payload.GetValue("linkedServices") as JArray;


string connString = linkedServices[0]["properties"]["typeProperties"]["connectionString"].ToString();

System.Data.SqlClient.SqlConnection sqlConn = new System.Data.SqlClient.SqlConnection(connString);

result.Add("sinkServer", sqlConn.DataSource);

Trace.TraceInformation("Stop Execute");

return this.Request.CreateResponse(HttpStatusCode.OK, result);


}

Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Webhook activity in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


A webhook activity can control the execution of pipelines through your custom code. With the webhook activity,
customers' code can call an endpoint and pass it a callback URL. The pipeline run waits for the callback
invocation before it proceeds to the next activity.

IMPORTANT
WebHook activity now allows you to surface error status and custom messages back to activity and pipeline. Set
reportStatusOnCallBack to true, and include StatusCode and Error in callback payload. For more information, see
Additional Notes section.

Syntax
{
"name": "MyWebHookActivity",
"type": "WebHook",
"typeProperties": {
"method": "POST",
"url": "<URLEndpoint>",
"headers": {
"Content-Type": "application/json"
},
"body": {
"key": "value"
},
"timeout": "00:03:00",
"reportStatusOnCallBack": false,
"authentication": {
"type": "ClientCertificate",
"pfx": "****",
"password": "****"
}
}
}

Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

name The name of the webhook String Yes


activity.

type Must be set to "WebHook". String Yes

method The REST API method for String. The supported type Yes
the target endpoint. is "POST".
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED

url The target endpoint and A string or an expression Yes


path. with the resultType value
of a string.

headers Headers that are sent to A string or an expression Yes. A Content-Type


the request. Here's an with the resultType value header like
example that sets the of a string. "headers":{ "Content-
language and type on a Type":"application/json"}
request: is required.
"headers" : { "Accept-
Language": "en-us",
"Content-Type":
"application/json" }
.

body Represents the payload that Valid JSON or an expression Yes


is sent to the endpoint. with the resultType value
of JSON. See Request
payload schema for the
schema of the request
payload.

authentication The authentication method A string or an expression No


used to call the endpoint. with the resultType value
Supported types are "Basic" of a string.
and "ClientCertificate". For
more information, see
Authentication. If
authentication isn't
required, exclude this
property.

timeout How long the activity waits String No


for the callback specified by
callBackUri to be invoked.
The default value is 10
minutes ("00:10:00"). Values
have the TimeSpan format
d.hh:mm:ss.

Repor t status on Lets a user report the failed Boolean No


callback status of a webhook
activity.

Authentication
A webhook activity supports the following authentication types.
None
If authentication isn't required, don't include the authentication property.
Basic
Specify the username and password to use with basic authentication.
"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}

Client certificate
Specify the Base64-encoded contents of a PFX file and a password.

"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}

Managed identity
Use the data factory's managed identity to specify the resource URI for which the access token is requested. To
call the Azure Resource Management API, use https://management.azure.com/ . For more information about how
managed identities work, see the managed identities for Azure resources overview.

"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}

NOTE
If your data factory is configured with a Git repository, you must store your credentials in Azure Key Vault to use basic or
client-certificate authentication. Azure Data Factory doesn't store passwords in Git.

Additional notes
Data Factory passes the additional property callBackUri in the body sent to the URL endpoint. Data Factory
expects this URI to be invoked before the specified timeout value. If the URI isn't invoked, the activity fails with
the status "TimedOut".
The webhook activity fails when the call to the custom endpoint fails. Any error message can be added to the
callback body and used in a later activity.
For every REST API call, the client times out if the endpoint doesn't respond within one minute. This behavior is
standard HTTP best practice. To fix this problem, implement a 202 pattern. In the current case, the endpoint
returns 202 (Accepted) and the client polls.
The one-minute timeout on the request has nothing to do with the activity timeout. The latter is used to wait for
the callback specified by callbackUri .
The body passed back to the callback URI must be valid JSON. Set the Content-Type header to
application/json .

When you use the Repor t status on callback property, you must add the following code to the body when
you make the callback:
{
"Output": {
// output object is used in activity output
"testProp": "testPropValue"
},
"Error": {
// Optional, set it when you want to fail the activity
"ErrorCode": "testErrorCode",
"Message": "error message to show in activity error"
},
"StatusCode": "403" // when status code is >=400, activity is marked as failed
}

Next steps
See the following control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Mapping data flow transformation overview
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Below is a list of the transformations currently supported in mapping data flow. Click on each transformations to
learn its configuration details.

NAME C AT EGO RY DESC RIP T IO N

Aggregate Schema modifier Define different types of aggregations


such as SUM, MIN, MAX, and COUNT
grouped by existing or computed
columns.

Alter row Row modifier Set insert, delete, update, and upsert
policies on rows.

Conditional split Multiple inputs/outputs Route rows of data to different streams


based on matching conditions.

Derived column Schema modifier generate new columns or modify


existing fields using the data flow
expression language.

Exists Multiple inputs/outputs Check whether your data exists in


another source or stream.

Filter Row modifier Filter a row based upon a condition.

Flatten Schema modifier Take array values inside hierarchical


structures such as JSON and unroll
them into individual rows.

Join Multiple inputs/outputs Combine data from two sources or


streams.

Lookup Multiple inputs/outputs Reference data from another source.

New branch Multiple inputs/outputs Apply multiple sets of operations and


transformations against the same data
stream.

Parse Formatter Parse text columns in your data stream


that are strings of JSON, delimited text,
or XML formatted text.

Pivot Schema modifier An aggregation where one or more


grouping columns has its distinct row
values transformed into individual
columns.
NAME C AT EGO RY DESC RIP T IO N

Rank Schema modifier Generate an ordered ranking based


upon sort conditions

Select Schema modifier Alias columns and stream names, and


drop or reorder columns

Sink - A final destination for your data

Sort Row modifier Sort incoming rows on the current


data stream

Source - A data source for the data flow

Surrogate key Schema modifier Add an incrementing non-business


arbitrary key value

Union Multiple inputs/outputs Combine multiple data streams


vertically

Unpivot Schema modifier Pivot columns into row values

Window Schema modifier Define window-based aggregations of


columns in your data streams.

Parse Schema modifier Parse column data to Json or delimited


text
Aggregate transformation in mapping data flow
11/2/2020 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Aggregate transformation defines aggregations of columns in your data streams. Using the Expression
Builder, you can define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing
or computed columns.

Group by
Select an existing column or create a new computed column to use as a group by clause for your aggregation.
To use an existing column, select it from the dropdown. To create a new computed column, hover over the clause
and click Computed column . This opens the data flow expression builder. Once you create your computed
column, enter the output column name under the Name as field. If you wish to add an additional group by
clause, hover over an existing clause and click the plus icon.

A group by clause is optional in an Aggregate transformation.

Aggregate columns
Go to the Aggregates tab to build aggregation expressions. You can either overwrite an existing column with
an aggregation, or create a new field with a new name. The aggregation expression is entered in the right-hand
box next to the column name selector. To edit the expression, click on the text box and open the expression
builder. To add more aggregate columns, click on Add above the column list or the plus icon next to an existing
aggregate column. Choose either Add column or Add column pattern . Each aggregation expression must
contain at least one aggregate function.
NOTE
In Debug mode, the expression builder cannot produce data previews with aggregate functions. To view data previews for
aggregate transformations, close the expression builder and view the data via the 'Data Preview' tab.

Column patterns
Use column patterns to apply the same aggregation to a set of columns. This is useful if you wish to persist
many columns from the input schema as they are dropped by default. Use a heuristic such as first() to persist
input columns through the aggregation.

Reconnect rows and columns


Aggregate transformations are similar to SQL aggregate select queries. Columns that aren't included in your
group by clause or aggregate functions won't flow through to the output of your aggregate transformation. If
you wish to include other columns in your aggregated output, do one of the following methods:
Use an aggregate function such as last() or first() to include that additional column.
Rejoin the columns to your output stream using the self join pattern.

Removing duplicate rows


A common use of the aggregate transformation is removing or identifying duplicate entries in source data. This
process is known as deduplication. Based upon a set of group by keys, use a heuristic of your choosing to
determine which duplicate row to keep. Common heuristics are first() , last() , max() , and min() . Use
column patterns to apply the rule to every column except for the group by columns.
In the above example, columns ProductID and Name are being use for grouping. If two rows have the same
values for those two columns, they're considered duplicates. In this aggregate transformation, the values of the
first row matched will be kept and all others will be dropped. Using column pattern syntax, all columns whose
names aren't ProductID and Name are mapped to their existing column name and given the value of the first
matched rows. The output schema is the same as the input schema.
For data validation scenarios, the count() function can be used to count how many duplicates there are.

Data flow script


Syntax

<incomingStream>
aggregate(
groupBy(
<groupByColumnName> = <groupByExpression1>,
<groupByExpression2>
),
<aggregateColumn1> = <aggregateExpression1>,
<aggregateColumn2> = <aggregateExpression2>,
each(
match(matchExpression),
<metadataColumn1> = <metadataExpression1>,
<metadataColumn2> = <metadataExpression2>
)
) ~> <aggregateTransformationName>

Example
The below example takes an incoming stream MoviesYear and groups rows by column year . The
transformation creates an aggregate column avgrating that evaluates to the average of column Rating . This
aggregate transformation is named AvgComedyRatingsByYear .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below.

MoviesYear aggregate(
groupBy(year),
avgrating = avg(toInteger(Rating))
) ~> AvgComedyRatingByYear

MoviesYear : Derived Column defining year and title columns AvgComedyRatingByYear : Aggregate transformation
for average rating of comedies grouped by year avgrating : Name of new column being created to hold the
aggregated value

MoviesYear aggregate(groupBy(year),
avgrating = avg(toInteger(Rating))) ~> AvgComedyRatingByYear

Next steps
Define window-based aggregation using the Window transformation
Alter row transformation in mapping data flow
5/8/2020 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the Alter Row transformation to set insert, delete, update, and upsert policies on rows. You can add one-to-
many conditions as expressions. These conditions should be specified in order of priority, as each row will be
marked with the policy corresponding to the first-matching expression. Each of those conditions can result in a
row (or rows) being inserted, updated, deleted, or upserted. Alter Row can produce both DDL & DML actions
against your database.

Alter Row transformations will only operate on database or CosmosDB sinks in your data flow. The actions that
you assign to rows (insert, update, delete, upsert) won't occur during debug sessions. Run an Execute Data Flow
activity in a pipeline to enact the alter row policies on your database tables.

Specify a default row policy


Create an Alter Row transformation and specify a row policy with a condition of true() . Each row that doesn't
match any of the previously defined expressions will be marked for the specified row policy. By default, each row
that doesn't match any conditional expression will be marked for Insert .

NOTE
To mark all rows with one policy, you can create a condition for that policy and specify the condition as true() .
View policies in data preview
Use debug mode to view the results of your alter row policies in the data preview pane. A data preview of an
alter row transformation won't produce DDL or DML actions against your target.

Each alter row policy is represented by an icon that indicates whether an insert, update, upsert, or deleted action
will occur. The top header shows how many rows are affected by each policy in the preview.

Allow alter row policies in sink


For the alter row policies to work, the data stream must write to a database or Cosmos sink. In the Settings tab
in your sink, enable which alter row policies are allowed for that sink.

The default behavior is to only allow inserts. To allow updates, upserts, or deletes, check the box in the sink
corresponding to that condition. If updates, upserts, or, deletes are enabled, you must specify which key columns
in the sink to match on.

NOTE
If your inserts, updates, or upserts modify the schema of the target table in the sink, the data flow will fail. To modify the
target schema in your database, choose Recreate table as the table action. This will drop and recreate your table with
the new schema definition.

The sink transformation requires either a single key or a series of keys for unique row identification in your
target database. For SQL sinks, set the keys in the sink settings tab. For CosmosDB, set the partition key in the
settings and also set the CosmosDB system field "id" in your sink mapping. For CosmosDB, it is mandatory to
include the system column "id" for updates, upserts, and deletes.

Merges and upserts with Azure SQL Database and Synapse


ADF Data Flows supports merges against Azure SQL Database and Synapse database pool (data warehouse)
with the upsert option.
However, you may run into scenarios where your target database schema utilized the identity property of key
columns. ADF requires you to identify the keys that you will use to match the row values for updates and
upserts. But if the target column has the identity property set and you are using the upsert policy, the target
database will not allow you to write to the column. You may also run into errors when you try to upsert against
a distributed table's distribution column.
Here are ways to fix that:
1. Go to the Sink transformation Settings and set "Skip writing key columns". This will tell ADF to not write
the column that you have selected as the key value for your mapping.
2. If that key column is not the column that is causing the issue for identity columns, then you can use the
Sink transformation pre-processing SQL option: SET IDENTITY_INSERT tbl_content ON . Then, turn it off
with the post-processing SQL property: SET IDENTITY_INSERT tbl_content OFF .
3. For both the identity case and the distribution column case, you can switch your logic from Upsert to
using a separate update condition and a separate insert condition using a Conditional Split
transformation. This way, you can set the mapping on the update path to ignore the key column mapping.

Data flow script


Syntax

<incomingStream>
alterRow(
insertIf(<condition>?),
updateIf(<condition>?),
deleteIf(<condition>?),
upsertIf(<condition>?),
) ~> <alterRowTransformationName>

Example
The below example is an alter row transformation named CleanData that takes an incoming stream
SpecifyUpsertConditions and creates three alter row conditions. In the previous transformation, a column
named alterRowCondition is calculated that determines whether or not a row is inserted, updated, or deleted in
the database. If the value of the column has a string value that matches the alter row rule, it is assigned that
policy.
In the Data Factory UX, this transformation looks like the below image:

The data flow script for this transformation is in the snippet below:

SpecifyUpsertConditions alterRow(insertIf(alterRowCondition == 'insert'),


updateIf(alterRowCondition == 'update'),
deleteIf(alterRowCondition == 'delete')) ~> AlterRow
Next steps
After the Alter Row transformation, you may want to sink your data into a destination data store.
Conditional split transformation in mapping data
flow
5/22/2020 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The conditional split transformation routes data rows to different streams based on matching conditions. The
conditional split transformation is similar to a CASE decision structure in a programming language. The
transformation evaluates expressions, and based on the results, directs the data row to the specified stream.

Configuration
The Split on setting determines whether the row of data flows to the first matching stream or every stream it
matches to.
Use the data flow expression builder to enter an expression for the split condition. To add a new condition, click
on the plus icon in an existing row. A default stream can be added as well for rows that don't match any
condition.

Data flow script


Syntax

<incomingStream>
split(
<conditionalExpression1>
<conditionalExpression2>
...
disjoint: {true | false}
) ~> <splitTx>@(stream1, stream2, ..., <defaultStream>)

Example
The below example is a conditional split transformation named SplitByYear that takes in incoming stream
CleanData . This transformation has two split conditions year < 1960 and year > 1980 . disjoint is false
because the data goes to the first matching condition. Every row matching the first condition goes to output
stream moviesBefore1960 . All remaining rows matching the second condition go to output stream
moviesAFter1980 . All other rows flow through the default stream AllOtherMovies .
In the Data Factory UX, this transformation looks like the below image:

The data flow script for this transformation is in the snippet below:

CleanData
split(
year < 1960,
year > 1980,
disjoint: false
) ~> SplitByYear@(moviesBefore1960, moviesAfter1980, AllOtherMovies)

Next steps
Common data flow transformations used with conditional split are the join transformation, lookup
transformation, and the select transformation
Derived column transformation in mapping data
flow
11/2/2020 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the derived column transformation to generate new columns in your data flow or to modify existing fields.

Create and update columns


When creating a derived column, you can either generate a new column or update an existing one. In the
Column textbox, enter in the column you are creating. To override an existing column in your schema, you can
use the column dropdown. To build the derived column's expression, click on the Enter expression textbox. You
can either start typing your expression or open up the expression builder to construct your logic.

To add more derived columns, click on Add above the column list or the plus icon next to an existing derived
column. Choose either Add column or Add column pattern .

Column patterns
In cases where your schema is not explicitly defined or if you want to update a set of columns in bulk, you will
want to create column patters. Column patterns allow for you to match columns using rules based upon the
column metadata and create derived columns for each matched column. For more information, learn how to
build column patterns in the derived column transformation.
Building schemas using the expression builder
When using the mapping data flow expression builder, you can create, edit, and manage your derived columns
in the Derived Columns section. All columns that are created or changed in the transformation are listed.
Interactively choose which column or pattern you are editing by clicking on the column name. To add an
additional column select Create new and choose whether you wish to add a single column or a pattern.

When working with complex columns, you can create subcolumns. To do this, click on the plus icon next to any
column and select Add subcolumn . For more information on handling complex types in data flow, see JSON
handling in mapping data flow.
For more information on handling complex types in data flow, see JSON handling in mapping data flow.

Locals
If you are sharing logic across multiple columns or want to compartmentalize your logic, you can create a local
within a derived column transformation. A local is a set of logic that doesn't get propagated downstream to the
following transformation. Locals can be created within the expression builder by going to Expression
elements and selecting Locals . Create a new one by selecting Create new .
Locals can reference any expression element a derived column including functions, input schema, parameters,
and other locals. When referencing other locals, order does matter as the referenced local needs to be "above"
the current one.

To reference a local in a derived column, either click on the local from the Expression elements view or
reference it with a colon in front of its name. For example, a local called local1 would be referenced by :local1 .
To edit a local definition, hover over it in the expression elements view and click on the pencil icon.
Data flow script
Syntax
<incomingStream>
derive(
<columnName1> = <expression1>,
<columnName2> = <expression2>,
each(
match(matchExpression),
<metadataColumn1> = <metadataExpression1>,
<metadataColumn2> = <metadataExpression2>
)
) ~> <deriveTransformationName>

Example
The below example is a derived column named CleanData that takes an incoming stream MoviesYear and
creates two derived columns. The first derived column replaces column Rating with Rating's value as an integer
type. The second derived column is a pattern that matches each column whose name starts with 'movies'. For
each matched column, it creates a column movie that is equal to the value of the matched column prefixed with
'movie_'.
In the Data Factory UX, this transformation looks like the below image:

The data flow script for this transformation is in the snippet below:

MoviesYear derive(
Rating = toInteger(Rating),
each(
match(startsWith(name,'movies')),
'movie' = 'movie_' + toString($$)
)
) ~> CleanData

Next steps
Learn more about the Mapping Data Flow expression language.
Exists transformation in mapping data flow
5/8/2020 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The exists transformation is a row filtering transformation that checks whether your data exists in another
source or stream. The output stream includes all rows in the left stream that either exist or don't exist in the right
stream. The exists transformation is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS .

Configuration
1. Choose which data stream you're checking for existence in the Right stream dropdown.
2. Specify whether you're looking for the data to exist or not exist in the Exist type setting.
3. Select whether or not your want a Custom expression .
4. Choose which key columns you want to compare as your exists conditions. By default, data flow looks for
equality between one column in each stream. To compare via a computed value, hover over the column
dropdown and select Computed column .

Multiple exists conditions


To compare multiple columns from each stream, add a new exists condition by clicking the plus icon next to an
existing row. Each additional condition is joined by an "and" statement. Comparing two columns is the same as
the following expression:
source1@column1 == source2@column1 && source1@column2 == source2@column2

Custom expression
To create a free-form expression that contains operators other than "and" and "equals to", select the Custom
expression field. Enter a custom expression via the data flow expression builder by clicking on the blue box.
Broadcast optimization

In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.

Data flow script


Syntax

<leftStream>, <rightStream>
exists(
<conditionalExpression>,
negate: { true | false },
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <existsTransformationName>

Example
The below example is an exists transformation named checkForChanges that takes left stream NameNorm2 and
right stream TypeConversions . The exists condition is the expression
NameNorm2@EmpID == TypeConversions@EmpID && NameNorm2@Region == DimEmployees@Region that returns true if both
the EMPID and Region columns in each stream matches. As we're checking for existence, negate is false. We
aren't enabling any broadcasting in the optimize tab so broadcast has value 'none' .
In the Data Factory UX, this transformation looks like the below image:

The data flow script for this transformation is in the snippet below:

NameNorm2, TypeConversions
exists(
NameNorm2@EmpID == TypeConversions@EmpID && NameNorm2@Region == DimEmployees@Region,
negate:false,
broadcast: 'auto'
) ~> checkForChanges

Next steps
Similar transformations are Lookup and Join.
Filter transformation in mapping data flow
11/2/2020 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Filter transforms allows row filtering based upon a condition. The output stream includes all rows that
matching the filtering condition. The filter transformation is similar to a WHERE clause in SQL.

Configuration
Use the data flow expression builder to enter an expression for the filter condition. To open the expression
builder, click on the blue box. The filter condition must be of type boolean. For more information on how to
create an expression, see the expression builder documentation.

Data flow script


Syntax

<incomingStream>
filter(
<conditionalExpression>
) ~> <filterTransformationName>

Example
The below example is a filter transformation named FilterBefore1960 that takes in incoming stream CleanData .
The filter condition is the expression year <= 1960 .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:

CleanData
filter(
year <= 1960
) ~> FilterBefore1960

Next steps
Filter out columns with the select transformation
Flatten transformation in mapping data flow
7/9/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them
into individual rows. This process is known as denormalization.

Configuration
The flatten transformation contains the following configuration settings

Unroll by
Select an array to unroll. The output data will have one row per item in each array. If the unroll by array in the
input row is null or empty, there will be one output row with unrolled values as null.
Unroll root
By default, the flatten transformation unrolls an array to the top of the hierarchy it exists in. You can optionally
select an array as your unroll root. The unroll root must be an array of complex objects that either is or contains
the unroll by array. If an unroll root is selected, the output data will contain at least one row per items in the
unroll root. If the input row doesn't have any items in the unroll root, it will be dropped from the output data.
Choosing an unroll root will always output a less than or equal number of rows than the default behavior.
Flatten mapping
Similar to the select transformation, choose the projection of the new structure from incoming fields and the
denormalized array. If a denormalized array is mapped, the output column will be the same data type as the
array. If the unroll by array is an array of complex objects that contains subarrays, mapping an item of that
subarry will output an array.
Refer to the inspect tab and data preview to verify your mapping output.

Rule-based mapping
The flatten transformation supports rule-based mapping allowing you to create dynamic and flexible
transformations that will flatten arrays based on rules and flatten structures based on hierarchy levels.
Matching condition
Enter a pattern matching condition for the column or columns that you wish to flatten using either exact
matching or patterns. Example: like(name,'cust%')
Deep column traversal
Optional setting that tells ADF to handle all subcolumns of a complex object individually instead of handling the
complex object as a whole column.
Hierarchy level
Choose the level of the hierarchy that you would like expand.
Name matches (regex)
Optionally choose to express your name matching as a regular expression in this box, instead of using the
matching condition above.

Examples
Refer to the following JSON object for the below examples of the flatten transformation

{
"name":"MSFT","location":"Redmond", "satellites": ["Bay Area", "Shanghai"],
"goods": {
"trade":true, "customers":["government", "distributer", "retail"],
"orders":[
{"orderId":1,"orderTotal":123.34,"shipped":{"orderItems":[{"itemName":"Laptop","itemQty":20},
{"itemName":"Charger","itemQty":2}]}},
{"orderId":2,"orderTotal":323.34,"shipped":{"orderItems":[{"itemName":"Mice","itemQty":2},
{"itemName":"Keyboard","itemQty":1}]}}
]}}
{"name":"Company1","location":"Seattle", "satellites": ["New York"],
"goods":{"trade":false, "customers":["store1", "store2"],
"orders":[
{"orderId":4,"orderTotal":123.34,"shipped":{"orderItems":[{"itemName":"Laptop","itemQty":20},
{"itemName":"Charger","itemQty":3}]}},
{"orderId":5,"orderTotal":343.24,"shipped":{"orderItems":[{"itemName":"Chair","itemQty":4},
{"itemName":"Lamp","itemQty":2}]}}
]}}
{"name": "Company2", "location": "Bellevue",
"goods": {"trade": true, "customers":["Bank"], "orders": [{"orderId": 4, "orderTotal": 123.34}]}}
{"name": "Company3", "location": "Kirkland"}

No unroll root with string array


UN RO L L B Y UN RO L L RO OT P RO JEC T IO N

goods.customers None name


customer = goods.customer

Output
{ 'MSFT', 'government'}
{ 'MSFT', 'distributer'}
{ 'MSFT', 'retail'}
{ 'Company1', 'store'}
{ 'Company1', 'store2'}
{ 'Company2', 'Bank'}
{ 'Company3', null}

No unroll root with complex array


UN RO L L B Y UN RO L L RO OT P RO JEC T IO N

goods.orders.shipped.orderItems None name


orderId = goods.orders.orderId
itemName =
goods.orders.shipped.orderItems.item
Name
itemQty =
goods.orders.shipped.orderItems.item
Qty
location = location

Output

{ 'MSFT', 1, 'Laptop', 20, 'Redmond'}


{ 'MSFT', 1, 'Charger', 2, 'Redmond'}
{ 'MSFT', 2, 'Mice', 2, 'Redmond'}
{ 'MSFT', 2, 'Keyboard', 1, 'Redmond'}
{ 'Company1', 4, 'Laptop', 20, 'Seattle'}
{ 'Company1', 4, 'Charger', 3, 'Seattle'}
{ 'Company1', 5, 'Chair', 4, 'Seattle'}
{ 'Company1', 5, 'Lamp', 2, 'Seattle'}
{ 'Company2', 4, null, null, 'Bellevue'}
{ 'Company3', null, null, null, 'Kirkland'}

Same root as unroll array


UN RO L L B Y UN RO L L RO OT P RO JEC T IO N

goods.orders goods.orders name


goods.orders.shipped.orderItems.item
Name
goods.customers
location

Output

{ 'MSFT', ['Laptop','Charger'], ['government','distributer','retail'], 'Redmond'}


{ 'MSFT', ['Mice', 'Keyboard'], ['government','distributer','retail'], 'Redmond'}
{ 'Company1', ['Laptop','Charger'], ['store', 'store2'], 'Seattle'}
{ 'Company1', ['Chair', 'Lamp'], ['store', 'store2'], 'Seattle'}
{ 'Company2', null, ['Bank'], 'Bellevue'}

Unroll root with complex array


UN RO L L B Y UN RO L L RO OT P RO JEC T IO N
UN RO L L B Y UN RO L L RO OT P RO JEC T IO N

goods.orders.shipped.orderItem goods.orders name


orderId = goods.orders.orderId
itemName =
goods.orders.shipped.orderItems.item
Name
itemQty =
goods.orders.shipped.orderItems.item
Qty
location = location

Output

{ 'MSFT', 1, 'Laptop', 20, 'Redmond'}


{ 'MSFT', 1, 'Charger', 2, 'Redmond'}
{ 'MSFT', 2, 'Mice', 2, 'Redmond'}
{ 'MSFT', 2, 'Keyboard', 1, 'Redmond'}
{ 'Company1', 4, 'Laptop', 20, 'Seattle'}
{ 'Company1', 4, 'Charger', 3, 'Seattle'}
{ 'Company1', 5, 'Chair', 4, 'Seattle'}
{ 'Company1', 5, 'Lamp', 2, 'Seattle'}
{ 'Company2', 4, null, null, 'Bellevue'}

Data flow script


Syntax

<incomingStream>
foldDown(unroll(<unroll cols>),
mapColumn(
name,
each(<array>(type == '<arrayDataType>')),
each(<array>, match(true())),
location
)) ~> <transformationName>

Example

source foldDown(unroll(goods.orders.shipped.orderItems, goods.orders),


mapColumn(
name,
orderId = goods.orders.orderId,
itemName = goods.orders.shipped.orderItems.itemName,
itemQty = goods.orders.shipped.orderItems.itemQty,
location = location
),
skipDuplicateMapInputs: false,
skipDuplicateMapOutputs: false)

Next steps
Use the Pivot transformation to pivot rows to columns.
Use the Unpivot transformation to pivot columns to rows.
Join transformation in mapping data flow
11/2/2020 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the join transformation to combine data from two sources or streams in a mapping data flow. The output
stream will include all columns from both sources matched based on a join condition.

Join types
Mapping data flows currently supports five different join types.
Inner Join
Inner join only outputs rows that have matching values in both tables.
Left Outer
Left outer join returns all rows from the left stream and matched records from the right stream. If a row from
the left stream has no match, the output columns from the right stream are set to NULL. The output will be the
rows returned by an inner join plus the unmatched rows from the left stream.

NOTE
The Spark engine used by data flows will occasionally fail due to possible cartesian products in your join conditions. If this
occurs, you can switch to a custom cross join and manually enter your join condition. This may result in slower
performance in your data flows as the execution engine may need to calculate all rows from both sides of the relationship
and then filter rows.

Right Outer
Right outer join returns all rows from the right stream and matched records from the left stream. If a row from
the right stream has no match, the output columns from the left stream are set to NULL. The output will be the
rows returned by an inner join plus the unmatched rows from the right stream.
Full Outer
Full outer join outputs all columns and rows from both sides with NULL values for columns that aren't matched.
Custom cross join
Cross join outputs the cross product of the two streams based upon a condition. If you're using a condition that
isn't equality, specify a custom expression as your cross join condition. The output stream will be all rows that
meet the join condition.
You can use this join type for non-equi joins and OR conditions.
If you would like to explicitly produce a full cartesian product, use the Derived Column transformation in each of
the two independent streams before the join to create a synthetic key to match on. For example, create a new
column in Derived Column in each stream called SyntheticKey and set it equal to 1 . Then use
a.SyntheticKey == b.SyntheticKey as your custom join expression.
NOTE
Make sure to include at least one column from each side of your left and right relationship in a custom cross join.
Executing cross joins with static values instead of columns from each side results in full scans of the entire dataset, causing
your data flow to perform poorly.

Configuration
1. Choose which data stream you're joining with in the Right stream dropdown.
2. Select your Join type
3. Choose which key columns you want to match on for you join condition. By default, data flow looks for
equality between one column in each stream. To compare via a computed value, hover over the column
dropdown and select Computed column .

Non-equi joins
To use a conditional operator such as not equals (!=) or greater than (>) in your join conditions, change the
operator dropdown between the two columns. Non-equi joins require at least one of the two streams to be
broadcasted using Fixed broadcasting in the Optimize tab.
Optimizing join performance
Unlike merge join in tools like SSIS, the join transformation isn't a mandatory merge join operation. The join
keys don't require sorting. The join operation occurs based on the optimal join operation in Spark, either
broadcast or map-side join.

In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.

Self-Join
To self-join a data stream with itself, alias an existing stream with a select transformation. Create a new branch
by clicking on the plus icon next to a transformation and selecting New branch . Add a select transformation to
alias the original stream. Add a join transformation and choose the original stream as the Left stream and the
select transformation as the Right stream .
Testing join conditions
When testing the join transformations with data preview in debug mode, use a small set of known data. When
sampling rows from a large dataset, you can't predict which rows and keys will be read for testing. The result is
non-deterministic, meaning that your join conditions may not return any matches.

Data flow script


Syntax

<leftStream>, <rightStream>
join(
<conditionalExpression>,
joinType: { 'inner'> | 'outer' | 'left_outer' | 'right_outer' | 'cross' }
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <joinTransformationName>

Inner join example


The below example is a join transformation named JoinMatchedData that takes left stream TripData and right
stream TripFare . The join condition is the expression
hack_license == { hack_license} && TripData@medallion == TripFare@medallion && vendor_id == { vendor_id} &&
pickup_datetime == { pickup_datetime}
that returns true if the hack_license , medallion , vendor_id , and pickup_datetime columns in each stream
match. The joinType is 'inner' . We're enabling broadcasting in only the left stream so broadcast has value
'left' .

In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:

TripData, TripFare
join(
hack_license == { hack_license}
&& TripData@medallion == TripFare@medallion
&& vendor_id == { vendor_id}
&& pickup_datetime == { pickup_datetime},
joinType:'inner',
broadcast: 'left'
)~> JoinMatchedData

Custom cross join example


The below example is a join transformation named JoiningColumns that takes left stream LeftStream and right
stream RightStream . This transformation takes in two streams and joins together all rows where column
leftstreamcolumn is greater than column rightstreamcolumn . The joinType is cross . Broadcasting is not
enabled broadcast has value 'none' .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:

LeftStream, RightStream
join(
leftstreamcolumn > rightstreamcolumn,
joinType:'cross',
broadcast: 'none'
)~> JoiningColumns

Next steps
After joining data, create a derived column and sink your data to a destination data store.
Lookup transformation in mapping data flow
5/11/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the lookup transformation to reference data from another source in a data flow stream. The lookup
transformation appends columns from matched data to your source data.
A lookup transformation is similar to a left outer join. All rows from the primary stream will exist in the output
stream with additional columns from the lookup stream.

Configuration

Primar y stream: The incoming stream of data. This stream is equivalent to the left side of a join.
Lookup stream: The data that is appended to the primary stream. Which data is added is determined by the
lookup conditions. This stream is equivalent to the right side of a join.
Match multiple rows: If enabled, a row with multiple matches in the primary stream will return multiple rows.
Otherwise, only a single row will be returned based upon the 'Match on' condition.
Match on: Only visible if 'Match multiple rows' is not selected. Choose whether to match on any row, the first
match, or the last match. Any row is recommended as it executes the fastest. If first row or last row is selected,
you'll be required to specify sort conditions.
Lookup conditions: Choose which columns to match on. If the equality condition is met, then the rows will be
considered a match. Hover and select 'Computed column' to extract a value using the data flow expression
language.
All columns from both streams are included in the output data. To drop duplicate or unwanted columns, add a
select transformation after your lookup transformation. Columns can also be dropped or renamed in a sink
transformation.
Non-equi joins
To use a conditional operator such as not equals (!=) or greater than (>) in your lookup conditions, change the
operator dropdown between the two columns. Non-equi joins require at least one of the two streams to be
broadcasted using Fixed broadcasting in the Optimize tab.
Analyzing matched rows
After your lookup transformation, the function isMatch() can be used to see if the lookup matched for
individual rows.

An example of this pattern is using the conditional split transformation to split on the isMatch() function. In the
example above, matching rows go through the top stream and non-matching rows flow through the NoMatch
stream.

Testing lookup conditions


When testing the lookup transformation with data preview in debug mode, use a small set of known data. When
sampling rows from a large dataset, you can't predict which rows and keys will be read for testing. The result is
non-deterministic, meaning that your join conditions may not return any matches.

Broadcast optimization
In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.

Cached lookup
If you're doing multiple smaller lookups on the same source, a cached sink and lookup maybe a better use case
than the lookup transformation. Common examples where a cache sink may be better are looking up a max
value on a data store and matching error codes to an error message database. For more information, learn
about cache sinks and cached lookups.

Data flow script


Syntax

<leftStream>, <rightStream>
lookup(
<lookupConditionExpression>,
multiple: { true | false },
pickup: { 'first' | 'last' | 'any' }, ## Only required if false is selected for multiple
{ desc | asc }( <sortColumn>, { true | false }), ## Only required if 'first' or 'last' is selected.
true/false determines whether to put nulls first
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <lookupTransformationName>

Example

The data flow script for the above lookup configuration is in the code snippet below.
SQLProducts, DimProd lookup(ProductID == ProductKey,
multiple: false,
pickup: 'first',
asc(ProductKey, true),
broadcast: 'auto')~> LookupKeys

Next steps
The join and exists transformations both take in multiple stream inputs
Use a conditional split transformation with isMatch() to split rows on matching and non-matching values
Creating a new branch in mapping data flow
4/17/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Add a new branch to do multiple sets of operations and transformations against the same data stream. Adding a
new branch is useful when you want to use the same source to for multiple sinks or for self-joining data
together.
A new branch can be added from the transformation list similar to other transformations. New Branch will only
be available as an action when there's an existing transformation following the transformation you're attempting
to branch.

In the below example, the data flow is reading taxi trip data. Output aggregated by both day and vendor is
required. Instead of creating two separate data flows that read from the same source, a new branch can be
added. This way both aggregations can be executed as part of the same data flow.
NOTE
When clicking the plus (+) to add transformations to your graph, you will only see the New Branch option when there are
subsequent transformation blocks. This is because New Branch creates a reference to the existing stream and requires
further upstream processing to operate on. If you do not see the New Branch option, add a Derived Column or other
transformation first, then return to the previous block and you will see New Branch as an option.

Next steps
After branching, you may want to use the data flow transformations
Parse transformation in mapping data flow
5/11/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the Parse transformation to parse columns in your data that are in document form. The current supported
types of embedded documents that can be parsed are JSON, XML, and delimited text.

Configuration
In the parse transformation configuration panel, you will first pick the type of data contained in the columns that
you wish to parse inline. The parse transformation also contains the following configuration settings.

Column
Similar to derived columns and aggregates, this is where you will either modify an exiting column by selecting it
from the drop-down picker. Or you can type in the name of a new column here. ADF will store the parsed source
data in this column. In most cases, you will want to define a new column that parses the incoming embedded
document field.
Expression
Use the expression builder to set the source for your parsing. This can be as simple as just selecting the source
column with the self-contained data that you wish to parse, or you can create complex expressions to parse.
Example expressions
Source string data: chrome|steel|plastic

Expression: (desc1 as string, desc2 as string, desc3 as string)

Source JSON data:


{"ts":1409318650332,"userId":"309","sessionId":1879,"page":"NextSong","auth":"Logged
In","method":"PUT","status":200,"level":"free","itemInSession":2,"registration":1384448}

Expression: (level as string, registration as long)


Source XML data:
<Customers><Customer>122</Customer><CompanyName>Great Lakes Food Market</CompanyName></Customers>

Expression: (Customers as (Customer as integer, CompanyName as string))

Output column type


Here is where you will configure the target output schema from the parsing that will be written into a single
column.
In this example, we have defined parsing of the incoming field "jsonString" which is plain text, but formatted as a
JSON structure. We're going to store the parsed results as JSON in a new column called "json" with this schema:
(trade as boolean, customers as string[])

Refer to the inspect tab and data preview to verify your output is mapped properly.

Examples
source(output(
name as string,
location as string,
satellites as string[],
goods as (trade as boolean, customers as string[], orders as (orderId as string, orderTotal as double,
shipped as (orderItems as (itemName as string, itemQty as string)[]))[])
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false,
documentForm: 'documentPerLine') ~> JsonSource
source(output(
movieId as string,
title as string,
genres as string
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false) ~> CsvSource
JsonSource derive(jsonString = toString(goods)) ~> StringifyJson
StringifyJson parse(json = jsonString ? (trade as boolean,
customers as string[]),
format: 'json',
documentForm: 'arrayOfDocuments') ~> ParseJson
CsvSource derive(csvString = 'Id|name|year\n\'1\'|\'test1\'|\'1999\'') ~> CsvString
CsvString parse(csv = csvString ? (id as integer,
name as string,
year as string),
format: 'delimited',
columnNamesAsHeader: true,
columnDelimiter: '|',
nullValue: '',
documentForm: 'documentPerLine') ~> ParseCsv
ParseJson select(mapColumn(
jsonString,
json
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> KeepStringAndParsedJson
ParseCsv select(mapColumn(
csvString,
csv
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> KeepStringAndParsedCsv

Data flow script


Syntax
Examples
parse(json = jsonString ? (trade as boolean,
customers as string[]),
format: 'json|XML|delimited',
documentForm: 'singleDocument') ~> ParseJson

parse(csv = csvString ? (id as integer,


name as string,
year as string),
format: 'delimited',
columnNamesAsHeader: true,
columnDelimiter: '|',
nullValue: '',
documentForm: 'documentPerLine') ~> ParseCsv

Next steps
Use the Flatten transformation to pivot rows to columns.
Use the Derived column transformation to pivot columns to rows.
Pivot transformation in mapping data flow
11/2/2020 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the pivot transformation to create multiple columns from the unique row values of a single column. Pivot is
an aggregation transformation where you select group by columns and generate pivot columns using
aggregate functions.

Configuration
The pivot transformation requires three different inputs: group by columns, the pivot key, and how to generate
the pivoted columns
Group by

Select which columns to aggregate the pivoted columns over. The output data will group all rows with the same
group by values into one row. The aggregation done in the pivoted column will occur over each group.
This section is optional. If no group by columns are selected, the entire data stream will be aggregated and only
one row will be outputted.
Pivot key
The pivot key is the column whose row values get pivoted into new columns. By default, the pivot
transformation will create a new column for each unique row value.
In the section labeled Value , you can enter specific row values to be pivoted. Only the row values entered in this
section will be pivoted. Enabling Null value will create a pivoted column for the null values in the column.
Pivoted columns

For each unique pivot key value that becomes a column, generate an aggregated row value for each group. You
can create multiple columns per pivot key. Each pivot column must contain at least one aggregate function.
Column name pattern: Select how to format the column name of each pivot column. The outputted column
name will be a combination of the pivot key value, column prefix and optional prefix, suffice, middle characters.
Column arrangement: If you generate more than one pivot column per pivot key, choose how you want the
columns to be ordered.
Column prefix: If you generate more than one pivot column per pivot key, enter a column prefix for each
column. This setting is optional if you only have one pivoted column.

Help graphic
The below help graphic shows how the different pivot components interact with one another

Pivot metadata
If no values are specified in the pivot key configuration, the pivoted columns will be dynamically generated at
run time. The number of pivoted columns will equal the number of unique pivot key values multiplied by the
number of pivot columns. As this can be a changing number, the UX will not display the column metadata in the
Inspect tab and there will be no column propagation. To transformation these columns, use the column pattern
capabilities of mapping data flow.
If specific pivot key values are set, the pivoted columns will appear in the metadata. The column names will be
available to you in the Inspect and Sink mapping.
Generate metadata from drifted columns
Pivot generates new column names dynamically based on row values. You can add these new columns into the
metadata that can be referenced later in your data flow. To do this, use the map drifted quick action in data
preview.

Sinking pivoted columns


Although pivoted columns are dynamic, they can still be written into your destination data store. Enable Allow
schema drift in your sink settings. This will allow you to write columns that are not included in metadata. You
will not see the new dynamic names in your column metadata, but the schema drift option will allow you to land
the data.
Rejoin original fields
The pivot transformation will only project the group by and pivoted columns. If you want your output data to
include other input columns, use a self join pattern.

Data flow script


Syntax

<incomingStreamName>
pivot(groupBy(Tm),
pivotBy(<pivotKeyColumn, [<specifiedColumnName1>,...,<specifiedColumnNameN>]),
<pivotColumnPrefix> = <pivotedColumnValue>,
columnNaming: '< prefix >< $N | $V ><middle >< $N | $V >< suffix >',
lateral: { 'true' | 'false'}
) ~> <pivotTransformationName

Example
The screens shown in the configuration section, have the following data flow script:

BasketballPlayerStats pivot(groupBy(Tm),
pivotBy(Pos),
{} = count(),
columnNaming: '$V$N count',
lateral: true) ~> PivotExample

Next steps
Try the unpivot transformation to turn column values into row values.
Rank transformation in mapping data flow
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the rank transformation to generate an ordered ranking based upon sort conditions specified by the user.

Configuration

Case insensitive: If a sort column is of type string, case will be factored into the ranking.
Dense: If enabled, the rank column will be dense ranked. Each rank count will be a consecutive number and
rank values won't be skipped after a tie.
Rank column: The name of the rank column generated. This column will be of type long.
Sor t conditions: Choose which columns you're sorting by and in which order the sort happens. The order
determines sorting priority.
The above configuration takes incoming basketball data and creates a rank column called 'pointsRanking'. The
row with the highest value of the column PTS will have a pointsRanking value of 1.

Data flow script


Syntax

<incomingStream>
rank(
desc(<sortColumn1>),
asc(<sortColumn2>),
...,
caseInsensitive: { true | false }
dense: { true | false }
output(<rankColumn> as long)
) ~> <sortTransformationName<>

Example
The data flow script for the above rank configuration is in the following code snippet.

PruneColumns
rank(
desc(PTS, true),
caseInsensitive: false,
output(pointsRanking as long),
dense: false
) ~> RankByPoints

Next steps
Filter rows based upon the rank values using the filter transformation.
Select transformation in mapping data flow
11/2/2020 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the select transformation to rename, drop, or reorder columns. This transformation doesn't alter row data,
but chooses which columns are propagated downstream.
In a select transformation, users can specify fixed mappings, use patterns to do rule-based mapping, or enable
auto mapping. Fixed and rule-based mappings can both be used within the same select transformation. If a
column doesn't match one of the defined mappings, it will be dropped.

Fixed mapping
If there are fewer than 50 columns defined in your projection, all defined columns will have a fixed mapping by
default. A fixed mapping takes a defined, incoming column and maps it an exact name.

NOTE
You can't map or rename a drifted column using a fixed mapping

Mapping hierarchical columns


Fixed mappings can be used to map a subcolumn of a hierarchical column to a top-level column. If you have a
defined hierarchy, use the column dropdown to select a subcolumn. The select transformation will create a new
column with the value and data type of the subcolumn.
Rule-based mapping
If you wish to map many columns at once or pass drifted columns downstream, use rule-based mapping to
define your mappings using column patterns. Match based on the name , type , stream , and position of
columns. You can have any combination of fixed and rule-based mappings. By default, all projections with
greater than 50 columns will default to a rule-based mapping that matches on every column and outputs the
inputted name.
To add a rule-based mapping, click Add mapping and select Rule-based mapping .

Each rule-based mapping requires two inputs: the condition on which to match by and what to name each
mapped column. Both values are inputted via the expression builder. In the left expression box, enter your
boolean match condition. In the right expression box, specify what the matched column will be mapped to.

Use $$ syntax to reference the input name of a matched column. Using the above image as an example, say a
user wants to match on all string columns whose names are shorter than six characters. If one incoming column
was named test , the expression $$ + '_short' will rename the column test_short . If that's the only mapping
that exists, all columns that don't meet the condition will be dropped from the outputted data.
Patterns match both drifted and defined columns. To see which defined columns are mapped by a rule, click the
eyeglasses icon next to the rule. Verify your output using data preview.
Regex mapping
If you click the downward chevron icon, you can specify a regex-mapping condition. A regex-mapping condition
matches all column names that match the specified regex condition. This can be used in combination with
standard rule-based mappings.
The above example matches on regex pattern (r) or any column name that contains a lower case r. Similar to
standard rule-based mapping, all matched columns are altered by the condition on the right using $$ syntax.
If you have multiple regex matches in your column name, you can refer to specific matches using $n where 'n'
refers to which match. For example, '$2' refers to the second match within a column name.
Rule -based hierarchies
If your defined projection has a hierarchy, you can use rule-based mapping to map the hierarchies subcolumns.
Specify a matching condition and the complex column whose subcolumns you wish to map. Every matched
subcolumn will be outputted using the 'Name as' rule specified on the right.

The above example matches on all subcolumns of complex column a . a contains two subcolumns b and c .
The output schema will include two columns b and c as the 'Name as' condition is $$ .
Parameterization
You can parameterize column names using rule-based mapping. Use the keyword name to match incoming
column names against a parameter. For example, if you have a data flow parameter mycolumn , you can create a
rule that matches any column name that is equal to mycolumn . You can rename the matched column to a hard-
coded string such as 'business key' and reference it explicitly. In this example, the matching condition is
name == $mycolumn and the name condition is 'business key'.

Auto mapping
When adding a select transformation, Auto mapping can be enabled by switching the Auto mapping slider.
With auto mapping, the select transformation maps all incoming columns, excluding duplicates, with the same
name as their input. This will include drifted columns, which means the output data may contain columns not
defined in your schema. For more information on drifted columns, see schema drift.
With auto mapping on, the select transformation will honor the skip duplicate settings and provide a new alias
for the existing columns. Aliasing is useful when doing multiple joins or lookups on the same stream and in self-
join scenarios.

Duplicate columns
By default, the select transformation drops duplicate columns in both the input and output projection. Duplicate
input columns often come from join and lookup transformations where column names are duplicated on each
side of the join. Duplicate output columns can occur if you map two different input columns to the same name.
Choose whether to drop or pass on duplicate columns by toggling the checkbox.

Ordering of columns
The order of mappings determines the order of the output columns. If an input column is mapped multiple
times, only the first mapping will be honored. For any duplicate column dropping, the first match will be kept.

Data flow script


Syntax

<incomingStream>
select(mapColumn(
each(<hierarchicalColumn>, match(<matchCondition>), <nameCondition> = $$), ## hierarchical rule-
based matching
<fixedColumn>, ## fixed mapping, no rename
<renamedFixedColumn> = <fixedColumn>, ## fixed mapping, rename
each(match(<matchCondition>), <nameCondition> = $$), ## rule-based mapping
each(patternMatch(<regexMatching>), <nameCondition> = $$) ## regex mapping
),
skipDuplicateMapInputs: { true | false },
skipDuplicateMapOutputs: { true | false }) ~> <selectTransformationName>

Example
Below is an example of a select mapping and its data flow script:
DerivedColumn1 select(mapColumn(
each(a, match(true())),
movie,
title1 = title,
each(match(name == 'Rating')),
each(patternMatch(`(y)`),
$1 + 'regex' = $$)
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> Select1

Next steps
After using Select to rename, reorder, and alias columns, use the Sink transformation to land your data into a
data store.
Sink transformation in mapping data flow
7/21/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


After you finish transforming your data, write it into a destination store by using the sink transformation. Every
data flow requires at least one sink transformation, but you can write to as many sinks as necessary to complete
your transformation flow. To write to additional sinks, create new streams via new branches and conditional
splits.
Each sink transformation is associated with exactly one Azure Data Factory dataset object or linked service. The
sink transformation determines the shape and location of the data you want to write to.

Inline datasets
When you create a sink transformation, choose whether your sink information is defined inside a dataset object
or within the sink transformation. Most formats are available in only one or the other. To learn how to use a
specific connector, see the appropriate connector document.
When a format is supported for both inline and in a dataset object, there are benefits to both. Dataset objects
are reusable entities that can be used in other data flows and activities such as Copy. These reusable entities are
especially useful when you use a hardened schema. Datasets aren't based in Spark. Occasionally, you might
need to override certain settings or schema projection in the sink transformation.
Inline datasets are recommended when you use flexible schemas, one-off sink instances, or parameterized sinks.
If your sink is heavily parameterized, inline datasets allow you to not create a "dummy" object. Inline datasets
are based in Spark, and their properties are native to data flow.
To use an inline dataset, select the format you want in the Sink type selector. Instead of selecting a sink dataset,
you select the linked service you want to connect to.

Supported sink types


Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that
are all in Azure. Currently, the following datasets can be used in a source transformation.
C O N N EC TO R F O RM AT DATA SET / IN L IN E

Azure Blob Storage Avro ✓/-


Delimited text ✓/-
Delta -/✓
JSON ✓/-
ORC ✓/✓
Parquet ✓/-

Azure Cosmos DB (SQL API) ✓/-

Azure Data Lake Storage Gen1 Avro ✓/-


Delimited text ✓/-
JSON ✓/-
ORC ✓/✓
Parquet ✓/-

Azure Data Lake Storage Gen2 Avro ✓/-


Common Data Model -/✓
Delimited text ✓/-
Delta -/✓
JSON ✓/-
ORC ✓/✓
Parquet ✓/-

Azure Database for MySQL ✓/✓

Azure Database for PostgreSQL ✓/✓

Azure SQL Database ✓/✓

Azure SQL Managed Instance ✓/-

Azure Synapse Analytics ✓/-

Snowflake ✓/✓

SQL Server ✓/✓

Settings specific to these connectors are located on the Settings tab. Information and data flow script examples
on these settings are located in the connector documentation.
Azure Data Factory has access to more than 90 native connectors. To write data to those other sources from
your data flow, use the Copy Activity to load that data from a supported sink.

Sink settings
After you've added a sink, configure via the Sink tab. Here you can pick or create the dataset your sink writes to.
Development values for dataset parameters can be configured in Debug settings. (Debug mode must be turned
on.)
The following video explains a number of different sink options for text-delimited file types.
Schema drift : Schema drift is the ability of Data Factory to natively handle flexible schemas in your data flows
without needing to explicitly define column changes. Enable Allow schema drift to write additional columns
on top of what's defined in the sink data schema.
Validate schema : If validate schema is selected, the data flow will fail if any column of the incoming source
schema isn't found in the source projection, or if the data types don't match. Use this setting to enforce that the
source data meets the contract of your defined projection. It's useful in database source scenarios to signal that
column names or types have changed.

Cache sink
A cache sink is when a data flow writes data into the Spark cache instead of a data store. In mapping data flows,
you can reference this data within the same flow many times using a cache lookup. This is useful when you want
to reference data as part of an expression but don't want to explicitly join the columns to it. Common examples
where a cache sink can help are looking up a max value on a data store and matching error codes to an error
message database.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types,
you don't need to select a dataset or linked service because you aren't writing to an external store.
In the sink settings, you can optionally specify the key columns of the cache sink. These are used as matching
conditions when using the lookup() function in a cache lookup. If you specify key columns, you can't use the
outputs() function in a cache lookup. To learn more about the cache lookup syntax, see cached lookups.

For example, if I specify a single key column of column1 in a cache sink called cacheExample , calling
cacheExample#lookup() would have one parameter specifies which row in the cache sink to match on. The
function outputs a single complex column with subcolumns for each column mapped.

NOTE
A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup.
A cache sink also must the first sink written.

Write to activity output The cached sink can optionally write your output data to the input of the next
pipeline activity. This will allow you to quickly and easily pass data out of your data flow activity without needing
to persist the data in a data store.

Field mapping
Similar to a select transformation, on the Mapping tab of the sink, you can decide which incoming columns will
get written. By default, all input columns, including drifted columns, are mapped. This behavior is known as
automapping.
When you turn off automapping, you can add either fixed column-based mappings or rule-based mappings.
With rule-based mappings, you can write expressions with pattern matching. Fixed mapping maps logical and
physical column names. For more information on rule-based mapping, see Column patterns in mapping data
flow.

Custom sink ordering


By default, data is written to multiple sinks in a nondeterministic order. The execution engine writes data in
parallel as the transformation logic is completed, and the sink ordering might vary each run. To specify an exact
sink ordering, enable Custom sink ordering on the General tab of the data flow. When enabled, sinks are
written sequentially in increasing order.

NOTE
When utilizing cached lookups, make sure that your sink ordering has the cached sinks set to 1, the lowest (or first) in
ordering.
Sink groups
You can group sinks together by applying the same order number for a series of sinks. ADF will treat those sinks
as groups that can execute in parallel. Options for parallel execution will surface in the pipeline data flow activity.

Error row handling


When writing to databases, certain rows of data may fail due to constraints set by the destination. By default, a
data flow run will fail on the first error it gets. In certain connectors, you can choose to Continue on error that
allows your data flow to complete even if individual rows have errors. Currently, this capability is only available
in Azure SQL Database and Synapse. For more information, see error row handling in Azure SQL DB.
Below is a video tutorial on how to use database error row handling automatically in your sink transformation.

Data preview in sink


When fetching a data preview in debug mode, no data will be written to your sink. A snapshot of what the data
looks like will be returned, but nothing will be written to your destination. To test writing data into your sink, run
a pipeline debug from the pipeline canvas.

Data flow script


Example
Below is an example of a sink transformation and its data flow script:
sink(input(
movie as integer,
title as string,
genres as string,
year as integer,
Rating as integer
),
allowSchemaDrift: true,
validateSchema: false,
deletable:false,
insertable:false,
updateable:true,
upsertable:false,
keys:['movie'],
format: 'table',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true,
saveOrder: 1,
errorHandlingOption: 'stopOnFirstError') ~> sink1

Next steps
Now that you've created your data flow, add a data flow activity to your pipeline.
Sort transformation in mapping data flow
4/17/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The sort transformation allows you to sort the incoming rows on the current data stream. You can choose
individual columns and sort them in ascending or descending order.

NOTE
Mapping data flows are executed on spark clusters which distribute data across multiple nodes and partitions. If you
choose to repartition your data in a subsequent transformation, you may lose your sorting due to reshuffling of data. The
best way to maintain sort order in your data flow is to set single partition in the Optimize tab on the transformation and
keep the Sort transformation as close to the Sink as possible.

Configuration

Case insensitive: Whether or not you wish to ignore case when sorting string or text fields
Sor t Only Within Par titions: As data flows are run on spark, each data stream is divided into partitions. This
setting sorts data only within the incoming partitions rather than sorting the entire data stream.
Sor t conditions: Choose which columns you are sorting by and in which order the sort happens. The order
determines sorting priority. Choose whether or not nulls will appear at the beginning or end of the data stream.
Computed columns
To modify or extract a column value before applying the sort, hover over the column and select "computed
column". This will open the expression builder to create an expression for the sort operation instead of using a
column value.

Data flow script


Syntax

<incomingStream>
sort(
desc(<sortColumn1>, { true | false }),
asc(<sortColumn2>, { true | false }),
...
) ~> <sortTransformationName<>
Example

The data flow script for the above sort configuration is in the code snippet below.

BasketballStats sort(desc(PTS, true),


asc(Age, true)) ~> Sort1

Next steps
After sorting, you may want to use the Aggregate Transformation
Source transformation in mapping data flow
7/15/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


A source transformation configures your data source for the data flow. When you design data flows, your first
step is always configuring a source transformation. To add a source, select the Add Source box in the data flow
canvas.
Every data flow requires at least one source transformation, but you can add as many sources as necessary to
complete your data transformations. You can join those sources together with a join, lookup, or a union
transformation.
Each source transformation is associated with exactly one dataset or linked service. The dataset defines the
shape and location of the data you want to write to or read from. If you use a file-based dataset, you can use
wildcards and file lists in your source to work with more than one file at a time.

Inline datasets
The first decision you make when you create a source transformation is whether your source information is
defined inside a dataset object or within the source transformation. Most formats are available in only one or
the other. To learn how to use a specific connector, see the appropriate connector document.
When a format is supported for both inline and in a dataset object, there are benefits to both. Dataset objects
are reusable entities that can be used in other data flows and activities such as Copy. These reusable entities are
especially useful when you use a hardened schema. Datasets aren't based in Spark. Occasionally, you might
need to override certain settings or schema projection in the source transformation.
Inline datasets are recommended when you use flexible schemas, one-off source instances, or parameterized
sources. If your source is heavily parameterized, inline datasets allow you to not create a "dummy" object. Inline
datasets are based in Spark, and their properties are native to data flow.
To use an inline dataset, select the format you want in the Source type selector. Instead of selecting a source
dataset, you select the linked service you want to connect to.

Supported source types


Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that
are all in Azure. Currently, the following datasets can be used in a source transformation.
C O N N EC TO R F O RM AT DATA SET / IN L IN E

Azure Blob Storage Avro ✓/✓


Delimited text ✓/✓
Delta ✓/✓
Excel ✓/✓
JSON ✓/✓
ORC ✓/✓
Parquet ✓/✓
XML ✓/✓

Azure Cosmos DB (SQL API) ✓/-

Azure Data Lake Storage Gen1 Avro ✓/✓


Delimited text ✓/✓
Excel ✓/✓
JSON ✓/✓
ORC ✓/✓
Parquet ✓/✓
XML ✓/✓

Azure Data Lake Storage Gen2 Avro ✓/✓


Common Data Model -/✓
Delimited text ✓/✓
Delta ✓/✓
Excel ✓/✓
JSON ✓/✓
ORC ✓/✓
Parquet ✓/✓
XML ✓/✓

Azure Database for MySQL ✓/✓

Azure Database for PostgreSQL ✓/✓

Azure SQL Database ✓/✓

Azure SQL Managed Instance ✓/✓

Azure Synapse Analytics ✓/✓

Hive -/✓

Snowflake ✓/✓

SQL Server ✓/✓

Settings specific to these connectors are located on the Source options tab. Information and data flow script
examples on these settings are located in the connector documentation.
Azure Data Factory has access to more than 90 native connectors. To include data from those other sources in
your data flow, use the Copy Activity to load that data into one of the supported staging areas.

Source settings
After you've added a source, configure via the Source settings tab. Here you can pick or create the dataset
your source points at. You can also select schema and sampling options for your data.
Development values for dataset parameters can be configured in debug settings. (Debug mode must be turned
on.)

Output stream name : The name of the source transformation.


Source type : Choose whether you want to use an inline dataset or an existing dataset object.
Test connection : Test whether or not the data flow's Spark service can successfully connect to the linked
service used in your source dataset. Debug mode must be on for this feature to be enabled.
Schema drift : Schema drift is the ability of Data Factory to natively handle flexible schemas in your data flows
without needing to explicitly define column changes.
Select the Allow schema drift check box if the source columns will change often. This setting allows all
incoming source fields to flow through the transformations to the sink.
Selecting Infer drifted column types instructs Data Factory to detect and define data types for each
new column discovered. With this feature turned off, all drifted columns will be of type string.
Validate schema: If Validate schema is selected, the data flow will fail to run if the incoming source data
doesn't match the defined schema of the dataset.
Skip line count : The Skip line count field specifies how many lines to ignore at the beginning of the dataset.
Sampling : Enable Sampling to limit the number of rows from your source. Use this setting when you test or
sample data from your source for debugging purposes. This is very useful when executing data flows in debug
mode from a pipeline.
To validate your source is configured correctly, turn on debug mode and fetch a data preview. For more
information, see Debug mode.

NOTE
When debug mode is turned on, the row limit configuration in debug settings will overwrite the sampling setting in the
source during data preview.

Source options
The Source options tab contains settings specific to the connector and format chosen. For more information
and examples, see the relevant connector documentation.

Projection
Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the
source data. For most dataset types, such as SQL and Parquet, the projection in a source is fixed to reflect the
schema defined in a dataset. When your source files aren't strongly typed (for example, flat .csv files rather than
Parquet files), you can define the data types for each field in the source transformation.

If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the
data types. Select Define default format to autodetect the default data formats.
Reset schema resets the projection to what is defined in the referenced dataset.
You can modify the column data types in a downstream derived-column transformation. Use a select
transformation to modify the column names.
Import schema
Select the Impor t schema button on the Projection tab to use an active debug cluster to create a schema
projection. It's available in every source type. Importing the schema here will override the projection defined in
the dataset. The dataset object won't be changed.
Importing schema is useful in datasets like Avro and Azure Cosmos DB that support complex data structures
that don't require schema definitions to exist in the dataset. For inline datasets, importing schema is the only
way to reference column metadata without schema drift.

Optimize the source transformation


The Optimize tab allows for editing of partition information at each transformation step. In most cases, Use
current par titioning will optimize for the ideal partitioning structure for a source.
If you're reading from an Azure SQL Database source, custom Source partitioning will likely read data the
fastest. Data Factory will read large queries by making connections to your database in parallel. This source
partitioning can be done on a column or by using a query.
For more information on optimization within mapping data flow, see the Optimize tab.

Next steps
Begin building your data flow with a derived-column transformation and a select transformation.
Surrogate key transformation in mapping data flow
11/2/2020 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use the surrogate key transformation to add an incrementing key value to each row of data. This is useful when
designing dimension tables in a star schema analytical data model. In a star schema, each member in your
dimension tables requires a unique key that is a non-business key.

Configuration

Key column: The name of the generated surrogate key column.


Star t value: The lowest key value that will be generated.

Increment keys from existing sources


To start your sequence from a value that exists in a source, we recommend to use a cache sink to save that value
and use a derived column transformation to add the two values together. Use a cached lookup to get the output
and append it to the generated key. For more information, learn about cache sinks and cached lookups.

Increment from existing maximum value


To seed the key value with the previous max, there are two techniques that you can use based on where your
source data is.
Database sources
Use a SQL query option to select MAX() from your source. For example,
Select MAX(<surrogateKeyName>) as maxval from <sourceTable> .

File sources
If your previous max value is in a file, use the max() function in the aggregate transformation to get the
previous max value:

In both cases, you will need to write to a cache sink and lookup the value.

Data flow script


Syntax
<incomingStream>
keyGenerate(
output(<surrogateColumnName> as long),
startAt: <number>L
) ~> <surrogateKeyTransformationName>

Example

The data flow script for the above surrogate key configuration is in the code snippet below.

AggregateDayStats
keyGenerate(
output(key as long),
startAt: 1L
) ~> SurrogateKey1

Next steps
These examples use the Join and Derived Column transformations.
Union transformation in mapping data flow
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Union will combine multiple data streams into one, with the SQL Union of those streams as the new output
from the Union transformation. All of the schema from each input stream will be combined inside of your data
flow, without needing to have a join key.
You can combine n-number of streams in the settings table by selecting the "+" icon next to each configured
row, including both source data as well as streams from existing transformations in your data flow.
Here is a short video walk-through of the union transformation in ADF's mapping data flow:

In this case, you can combine disparate metadata from multiple sources (in this example, three different source
files) and combine them into a single stream:

To achieve this, add additional rows in the Union Settings by including all source you wish to add. There is no
need for a common lookup or join key:
If you set a Select transformation after your Union, you will be able to rename overlapping fields or fields that
were not named from headerless sources. Click on "Inspect" to see the combine metadata with 132 total
columns in this example from three different sources:

Name and position


When you choose "union by name", each column value will drop into the corresponding column from each
source, with a new concatenated metadata schema.
If you choose "union by position", each column value will drop into the original position from each
corresponding source, resulting in a new combined stream of data where the data from each source is added to
the same stream:
Next steps
Explore similar transformations including Join and Exists.
Unpivot transformation in mapping data flow
11/2/2020 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Use Unpivot in ADF mapping data flow as a way to turn an unnormalized dataset into a more normalized
version by expanding values from multiple columns in a single record into multiple records with the same
values in a single column.

Ungroup By

First, set the columns that you wish to ungroup by for your unpivot aggregation. Set one or more columns for
ungrouping with the + sign next to the column list.

Unpivot Key
The Unpivot Key is the column that ADF will pivot from column to row. By default, each unique value in the
dataset for this field will pivot to a row. However, you can optionally enter the values from the dataset that you
wish to pivot to row values.

Unpivoted Columns

Lastly, choose the column name for storing the values for unpivoted columns that are transformed into rows.
(Optional) You can drop rows with Null values.
For instance, SumCost is the column name that is chosen in the example shared above.
Setting the Column Arrangement to "Normal" will group together all of the new unpivoted columns from a
single value. Setting the columns arrangement to "Lateral" will group together new unpivoted columns
generated from an existing column.

The final unpivoted data result set shows the column totals now unpivoted into separate row values.

Next steps
Use the Pivot transformation to pivot rows to columns.
Window transformation in mapping data flow
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Window transformation is where you will define window-based aggregations of columns in your data
streams. In the Expression Builder, you can define different types of aggregations that are based on data or time
windows (SQL OVER clause) such as LEAD, LAG, NTILE, CUMEDIST, RANK, etc.). A new field will be generated in
your output that includes these aggregations. You can also include optional group-by fields.

Over
Set the partitioning of column data for your window transformation. The SQL equivalent is the Partition By in
the Over clause in SQL. If you wish to create a calculation or create an expression to use for the partitioning, you
can do that by hovering over the column name and select "computed column".

Sort
Another part of the Over clause is setting the Order By . This will set the data sort ordering. You can also create
an expression for a calculate value in this column field for sorting.

Range By
Next, set the window frame as Unbounded or Bounded. To set an unbounded window frame, set the slider to
Unbounded on both ends. If you choose a setting between Unbounded and Current Row, then you must set the
Offset start and end values. Both values will be positive integers. You can use either relative numbers or values
from your data.
The window slider has two values to set: the values before the current row and the values after the current row.
The Start and End offset matches the two selectors on the slider.
Window columns
Lastly, use the Expression Builder to define the aggregations you wish to use with the data windows such as
RANK, COUNT, MIN, MAX, DENSE RANK, LEAD, LAG, etc.

The full list of aggregation and analytical functions available for you to use in the ADF Data Flow Expression
Language via the Expression Builder are listed here: https://aka.ms/dataflowexpressions.

Next steps
If you are looking for a simple group-by aggregation, use the Aggregate transformation
Parameterize linked services in Azure Data Factory
6/8/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can now parameterize a linked service and pass dynamic values at run time. For example, if you want to
connect to different databases on the same logical SQL server, you can now parameterize the database name in
the linked service definition. This prevents you from having to create a linked service for each database on the
logical SQL server. You can parameterize other properties in the linked service definition as well - for example,
User name.
You can use the Data Factory UI in the Azure portal or a programming interface to parameterize linked services.

TIP
We recommend not to parameterize passwords or secrets. Store all secrets in Azure Key Vault instead, and parameterize
the Secret Name.

NOTE
There is open bug to use "-" in parameter names, we recommend to use names without "-" until the bug is resolved.

For a seven-minute introduction and demonstration of this feature, watch the following video:

Supported linked service types


All the linked service types are supported for parameterization.
Natively suppor ted on ADF UI: When authoring linked service on UI, Data Factory provides built-in
parameterization experience for the following types of linked services. In linked service creation/edit blade, you
can find options to new parameters and add dynamic content. Refer to Data Factory UI experience.
Amazon Redshift
Amazon S3
Amazon S3 Compatible Storage
Azure Blob Storage
Azure Cosmos DB (SQL API)
Azure Data Lake Storage Gen2
Azure Database for MySQL
Azure Databricks
Azure Key Vault
Azure SQL Database
Azure SQL Managed Instance
Azure Synapse Analytics
Azure Table Storage
Generic HTTP
Generic REST
MySQL
Oracle
Oracle Cloud Storage
SQL Server
Advanced authoring: For other linked service types that are not in above list, you can parameterize the linked
service by editing the JSON on UI:
In linked service creation/edit blade -> expand "Advanced" at the bottom -> check "Specify dynamic contents
in JSON format" checkbox -> specify the linked service JSON payload.
Or, after you create a linked service without parameterization, in Management hub -> Linked services -> find
the specific linked service -> click "Code" (button "{}") to edit the JSON.
Refer to the JSON sample to add parameters section to define parameters and reference the parameter using
@{linkedService().paraName} .

Data Factory UI
JSON
{
"name": "AzureSqlDatabase",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString":
"Server=tcp:myserver.database.windows.net,1433;Database=@{linkedService().DBName};User
ID=user;Password=fake;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": null,
"parameters": {
"DBName": {
"type": "String"
}
}
}
}
Global parameters in Azure Data Factory
5/28/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Global parameters are constants across a data factory that can be consumed by a pipeline in any expression.
They're useful when you have multiple pipelines with identical parameter names and values. When promoting a
data factory using the continuous integration and deployment process (CI/CD), you can override these
parameters in each environment.

Creating global parameters


To create a global parameter, go to the Global parameters tab in the Manage section. Select New to open the
creation side-nav.

In the side-nav, enter a name, select a data type, and specify the value of your parameter.

After a global parameter is created, you can edit it by clicking the parameter's name. To alter multiple
parameters at once, select Edit all .
Using global parameters in a pipeline
Global parameters can be used in any pipeline expression. If a pipeline is referencing another resource such as a
dataset or data flow, you can pass down the global parameter value via that resource's parameters. Global
parameters are referenced as pipeline().globalParameters.<parameterName> .

Global parameters in CI/CD


There are two ways to integrate global parameters in your continuous integration and deployment solution:
Include global parameters in the ARM template
Deploy global parameters via a PowerShell script
For general use cases, it is recommended to include global parameters in the ARM template. This integrates
natively with the solution outlined in the CI/CD doc. In case of automatic publishing and Purview connection,
PowerShell script method is required. You can find more about PowerShell script method later. Global
parameters will be added as an ARM template parameter by default as they often change from environment to
environment. You can enable the inclusion of global parameters in the ARM template from the Manage hub.

NOTE
The Include in ARM template configuration is only available in "Git mode". Currently it is disabled in "live mode" or
"Data Factory" mode. In case of automatic publishing or Purview connection, do not use Include global parameters
method; use PowerShell script method.

WARNING
You cannot use ‘-‘ in the parameter name. You will receive an errorcode "
{"code":"BadRequest","message":"ErrorCode=InvalidTemplate,ErrorMessage=The expression
>'pipeline().globalParameters.myparam-dbtest-url' is not valid: .....}". But, you can use the ‘_’ in the parameter name.

Adding global parameters to the ARM template adds a factory-level setting that will override other factory-level
settings such as a customer-managed key or git configuration in other environments. If you have these settings
enabled in an elevated environment such as UAT or PROD, it's better to deploy global parameters via a
PowerShell script in the steps highlighted below.
Deploying using PowerShell
The following steps outline how to deploy global parameters via PowerShell. This is useful when your target
factory has a factory-level setting such as customer-managed key.
When you publish a factory or export an ARM template with global parameters, a folder called
globalParameters is created with a file called your-factory-name_GlobalParameters.json. This file is a JSON
object that contains each global parameter type and value in the published factory.
If you're deploying to a new environment such as TEST or PROD, it's recommended to create a copy of this
global parameters file and overwrite the appropriate environment-specific values. When you republish the
original global parameters file will get overwritten, but the copy for the other environment will be untouched.
For example, if you have a factory named 'ADF-DEV' and a global parameter of type string named 'environment'
with a value 'dev', when you publish a file named ADF-DEV_GlobalParameters.json will get generated. If
deploying to a test factory named 'ADF_TEST', create a copy of the JSON file (for example named ADF-
TEST_GlobalParameters.json) and replace the parameter values with the environment-specific values. The
parameter 'environment' may have a value 'test' now.

Use the below PowerShell script to promote global parameters to additional environments. Add an Azure
PowerShell DevOps task before your ARM Template deployment. In the DevOps task, you must specify the
location of the new parameters file, the target resource group, and the target data factory.

NOTE
To deploy global parameters using PowerShell, you must use at least version 4.4.0 of the Az module.

param
(
[parameter(Mandatory = $true)] [String] $globalParametersFilePath,
[parameter(Mandatory = $true)] [String] $resourceGroupName,
[parameter(Mandatory = $true)] [String] $dataFactoryName
)

Import-Module Az.DataFactory

$newGlobalParameters = New-Object
'system.collections.generic.dictionary[string,Microsoft.Azure.Management.DataFactory.Models.GlobalParameterS
pecification]'

Write-Host "Getting global parameters JSON from: " $globalParametersFilePath


$globalParametersJson = Get-Content $globalParametersFilePath

Write-Host "Parsing JSON..."


$globalParametersObject = [Newtonsoft.Json.Linq.JObject]::Parse($globalParametersJson)

foreach ($gp in $globalParametersObject.GetEnumerator()) {


Write-Host "Adding global parameter:" $gp.Key
$globalParameterValue =
$gp.Value.ToObject([Microsoft.Azure.Management.DataFactory.Models.GlobalParameterSpecification])
$newGlobalParameters.Add($gp.Key, $globalParameterValue)
}

$dataFactory = Get-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Name $dataFactoryName


$dataFactory.GlobalParameters = $newGlobalParameters

Write-Host "Updating" $newGlobalParameters.Count "global parameters."

Set-AzDataFactoryV2 -InputObject $dataFactory -Force

Next steps
Learn about Azure Data Factory's continuous integration and deployment process
Learn how to use the control flow expression language
Expressions and functions in Azure Data Factory
7/16/2021 • 53 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article provides details about expressions and functions supported by Azure Data Factory.

Expressions
JSON values in the definition can be literal or expressions that are evaluated at runtime. For example:

"name": "value"

or

"name": "@pipeline().parameters.password"

Expressions can appear anywhere in a JSON string value and always result in another JSON value. If a JSON
value is an expression, the body of the expression is extracted by removing the at-sign (@). If a literal string is
needed that starts with @, it must be escaped by using @@. The following examples show how expressions are
evaluated.

JSO N VA L UE RESULT

"parameters" The characters 'parameters' are returned.

"parameters[1]" The characters 'parameters[1]' are returned.

"@@" A 1 character string that contains '@' is returned.

" @" A 2 character string that contains ' @' is returned.

Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"

Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as
foo :

JSO N VA L UE RESULT

"@pipeline().parameters.myString" Returns foo as a string.

"@{pipeline().parameters.myString}" Returns foo as a string.

"@pipeline().parameters.myNumber" Returns 42 as a number.

"@{pipeline().parameters.myNumber}" Returns 42 as a string.


JSO N VA L UE RESULT

"Answer is: @{pipeline().parameters.myNumber}" Returns the string Answer is: 42 .

"@concat('Answer is: ', Returns the string Answer is: 42


string(pipeline().parameters.myNumber))"

"Answer is: @@{pipeline().parameters.myNumber}" Returns the string


Answer is: @{pipeline().parameters.myNumber} .

In the control flow activities like ForEach activity, you can provide an array to be iterated over for the property
items and use @item() to iterate over a single enumeration in ForEach activity. For example, if items is an array:
[1, 2, 3], @item() returns 1 in the first iteration, 2 in the second iteration, and 3 in the third iteration. You can also
use @range(0,10) like expression to iterate ten times starting with 0 ending with 9.
You can use @activity('activity name') to capture output of activity and make decisions. Consider a web activity
called Web1. For placing the output of the first activity in the body of the second, the expression generally looks
like: @activity('Web1').output or @activity('Web1').output.data or something similar depending upon what the
output of the first activity looks like.

Examples
Complex expression example
The below example shows a complex example that references a deep sub-field of activity output. To reference a
pipeline parameter that evaluates to a sub-field, use [] syntax instead of dot(.) operator (as in case of subfield1
and subfield2), as part of an activity output.
@activity('*activityName*').output.*subfield1*.*subfield2*[pipeline().parameters.*subfield3*].*subfield4*

Dynamic content editor


Dynamic content editor automatically escapes characters in your content when you finish editing. For example,
the following content in content editor is a string interpolation with two expression functions.

{
"type": "@{if(equals(1, 2), 'Blob', 'Table' )}",
"name": "@{toUpper('myData')}"
}

Dynamic content editor converts above content to expression


"{ \n \"type\": \"@{if(equals(1, 2), 'Blob', 'Table' )}\",\n \"name\": \"@{toUpper('myData')}\"\n}" . The
result of this expression is a JSON format string showed below.

{
"type": "Table",
"name": "MYDATA"
}

A dataset with a parameter


In the following example, the BlobDataset takes a parameter named path . Its value is used to set a value for the
folderPath property by using the expression: dataset().path .
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@dataset().path"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

A pipeline with a parameter


In the following example, the pipeline takes inputPath and outputPath parameters. The path for the
parameterized blob dataset is set by using values of these parameters. The syntax used here is:
pipeline().parameters.parametername .
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}

Replacing special characters


Dynamic content editor automatically escapes characters like double quote, backslash in your content when you
finish editing. This causes trouble if you want to replace line feed or tab by using \n , \t in replace() function. You
can of edit your dynamic content in code view to remove the extra \ in the expression, or you can follow below
steps to replace special characters using expression language:
1. URL encoding against the original string value
2. Replace URL encoded string, for example, line feed (%0A), carriage return(%0D), horizontal tab(%09).
3. URL decoding
For example, variable companyName with a newline character in its value, expression
@uriComponentToString(replace(uriComponent(variables('companyName')), '%0A', '')) can remove the newline
character.
Contoso-
Corporation

Escaping single quote character


Expression functions use single quote for string value parameters. Use two single quotes to escape a ' character
in string functions. For example, expression @concat('Baba', '''s ', 'book store') will return below result.

Baba's book store

Tutorial
This tutorial walks you through how to pass parameters between a pipeline and activity as well as between the
activities.

Functions
You can call functions within expressions. The following sections provide information about the functions that
can be used in an expression.

String functions
To work with strings, you can use these string functions and also some collection functions. String functions
work only on strings.

ST RIN G F UN C T IO N TA SK

concat Combine two or more strings, and return the combined


string.

endsWith Check whether a string ends with the specified substring.

guid Generate a globally unique identifier (GUID) as a string.

indexOf Return the starting position for a substring.

lastIndexOf Return the starting position for the last occurrence of a


substring.

replace Replace a substring with the specified string, and return the
updated string.

split Return an array that contains substrings, separated by


commas, from a larger string based on a specified delimiter
character in the original string.

startsWith Check whether a string starts with a specific substring.

substring Return characters from a string, starting from the specified


position.

toLower Return a string in lowercase format.

toUpper Return a string in uppercase format.


ST RIN G F UN C T IO N TA SK

trim Remove leading and trailing whitespace from a string, and


return the updated string.

Collection functions
To work with collections, generally arrays, strings, and sometimes, dictionaries, you can use these collection
functions.

C O L L EC T IO N F UN C T IO N TA SK

contains Check whether a collection has a specific item.

empty Check whether a collection is empty.

first Return the first item from a collection.

intersection Return a collection that has only the common items across
the specified collections.

join Return a string that has all the items from an array,
separated by the specified character.

last Return the last item from a collection.

length Return the number of items in a string or array.

skip Remove items from the front of a collection, and return all
the other items.

take Return items from the front of a collection.

union Return a collection that has all the items from the specified
collections.

Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.

LO GIC A L C O M PA RISO N F UN C T IO N TA SK

and Check whether all expressions are true.

equals Check whether both values are equivalent.

greater Check whether the first value is greater than the second
value.

greaterOrEquals Check whether the first value is greater than or equal to the
second value.
LO GIC A L C O M PA RISO N F UN C T IO N TA SK

if Check whether an expression is true or false. Based on the


result, return a specified value.

less Check whether the first value is less than the second value.

lessOrEquals Check whether the first value is less than or equal to the
second value.

not Check whether an expression is false.

or Check whether at least one expression is true.

Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries

C O N VERSIO N F UN C T IO N TA SK

array Return an array from a single specified input. For multiple


inputs, see createArray.

base64 Return the base64-encoded version for a string.

base64ToBinary Return the binary version for a base64-encoded string.

base64ToString Return the string version for a base64-encoded string.

binary Return the binary version for an input value.

bool Return the Boolean version for an input value.

coalesce Return the first non-null value from one or more parameters.

createArray Return an array from multiple inputs.

dataUri Return the data URI for an input value.

dataUriToBinary Return the binary version for a data URI.

dataUriToString Return the string version for a data URI.

decodeBase64 Return the string version for a base64-encoded string.


C O N VERSIO N F UN C T IO N TA SK

decodeDataUri Return the binary version for a data URI.

decodeUriComponent Return a string that replaces escape characters with decoded


versions.

encodeUriComponent Return a string that replaces URL-unsafe characters with


escape characters.

float Return a floating point number for an input value.

int Return the integer version for a string.

json Return the JavaScript Object Notation (JSON) type value or


object for a string or XML.

string Return the string version for an input value.

uriComponent Return the URI-encoded version for an input value by


replacing URL-unsafe characters with escape characters.

uriComponentToBinary Return the binary version for a URI-encoded string.

uriComponentToString Return the string version for a URI-encoded string.

xml Return the XML version for a string.

xpath Check XML for nodes or values that match an XPath (XML
Path Language) expression, and return the matching nodes
or values.

Math functions
These functions can be used for either types of numbers: integers and floats .

M AT H F UN C T IO N TA SK

add Return the result from adding two numbers.

div Return the result from dividing two numbers.

max Return the highest value from a set of numbers or an array.

min Return the lowest value from a set of numbers or an array.

mod Return the remainder from dividing two numbers.

mul Return the product from multiplying two numbers.

rand Return a random integer from a specified range.

range Return an integer array that starts from a specified integer.


M AT H F UN C T IO N TA SK

sub Return the result from subtracting the second number from
the first number.

Date functions
DAT E O R T IM E F UN C T IO N TA SK

addDays Add a number of days to a timestamp.

addHours Add a number of hours to a timestamp.

addMinutes Add a number of minutes to a timestamp.

addSeconds Add a number of seconds to a timestamp.

addToTime Add a number of time units to a timestamp. See also


getFutureTime.

convertFromUtc Convert a timestamp from Universal Time Coordinated


(UTC) to the target time zone.

convertTimeZone Convert a timestamp from the source time zone to the


target time zone.

convertToUtc Convert a timestamp from the source time zone to Universal


Time Coordinated (UTC).

dayOfMonth Return the day of the month component from a timestamp.

dayOfWeek Return the day of the week component from a timestamp.

dayOfYear Return the day of the year component from a timestamp.

formatDateTime Return the timestamp as a string in optional format.

getFutureTime Return the current timestamp plus the specified time units.
See also addToTime.

getPastTime Return the current timestamp minus the specified time units.
See also subtractFromTime.

startOfDay Return the start of the day for a timestamp.

startOfHour Return the start of the hour for a timestamp.

startOfMonth Return the start of the month for a timestamp.

subtractFromTime Subtract a number of time units from a timestamp. See also


getPastTime.

ticks Return the ticks property value for a specified timestamp.


DAT E O R T IM E F UN C T IO N TA SK

utcNow Return the current timestamp as a string.

Function reference
This section lists all the available functions in alphabetical order.

add
Return the result from adding two numbers.

add(<summand_1>, <summand_2>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<summand_1>, Yes Integer, Float, or mixed The numbers to add


<summand_2>

RET URN VA L UE TYPE DESC RIP T IO N

<result-sum> Integer or Float The result from adding the specified


numbers

Example
This example adds the specified numbers:

add(1, 1.5)

And returns this result: 2.5

addDays
Add a number of days to a timestamp.

addDays('<timestamp>', <days>, '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<days> Yes Integer The positive or negative


number of days to add
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp plus the specified


number of days

Example 1
This example adds 10 days to the specified timestamp:

addDays('2018-03-15T13:00:00Z', 10)

And returns this result: "2018-03-25T00:00:0000000Z"

Example 2
This example subtracts five days from the specified timestamp:

addDays('2018-03-15T00:00:00Z', -5)

And returns this result: "2018-03-10T00:00:0000000Z"

addHours
Add a number of hours to a timestamp.

addHours('<timestamp>', <hours>, '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<hours> Yes Integer The positive or negative


number of hours to add
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp plus the specified


number of hours

Example 1
This example adds 10 hours to the specified timestamp:

addHours('2018-03-15T00:00:00Z', 10)

And returns this result: "2018-03-15T10:00:0000000Z"

Example 2
This example subtracts five hours from the specified timestamp:

addHours('2018-03-15T15:00:00Z', -5)

And returns this result: "2018-03-15T10:00:0000000Z"

addMinutes
Add a number of minutes to a timestamp.

addMinutes('<timestamp>', <minutes>, '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<minutes> Yes Integer The positive or negative


number of minutes to add
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp plus the specified


number of minutes

Example 1
This example adds 10 minutes to the specified timestamp:

addMinutes('2018-03-15T00:10:00Z', 10)

And returns this result: "2018-03-15T00:20:00.0000000Z"

Example 2
This example subtracts five minutes from the specified timestamp:

addMinutes('2018-03-15T00:20:00Z', -5)

And returns this result: "2018-03-15T00:15:00.0000000Z"

addSeconds
Add a number of seconds to a timestamp.

addSeconds('<timestamp>', <seconds>, '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<seconds> Yes Integer The positive or negative


number of seconds to add
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp plus the specified


number of seconds

Example 1
This example adds 10 seconds to the specified timestamp:

addSeconds('2018-03-15T00:00:00Z', 10)

And returns this result: "2018-03-15T00:00:10.0000000Z"

Example 2
This example subtracts five seconds to the specified timestamp:

addSeconds('2018-03-15T00:00:30Z', -5)

And returns this result: "2018-03-15T00:00:25.0000000Z"

addToTime
Add a number of time units to a timestamp. See also getFutureTime().

addToTime('<timestamp>', <interval>, '<timeUnit>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<interval> Yes Integer The number of specified


time units to add

<timeUnit> Yes String The unit of time to use with


interval: "Second", "Minute",
"Hour", "Day", "Week",
"Month", "Year"
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp plus the specified


number of time units

Example 1
This example adds one day to the specified timestamp:

addToTime('2018-01-01T00:00:00Z', 1, 'Day')

And returns this result: "2018-01-02T00:00:00.0000000Z"

Example 2
This example adds one day to the specified timestamp:

addToTime('2018-01-01T00:00:00Z', 1, 'Day', 'D')

And returns the result using the optional "D" format: "Tuesday, January 2, 2018"

and
Check whether both expressions are true. Return true when both expressions are true, or return false when at
least one expression is false.

and(<expression1>, <expression2>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<expression1>, Yes Boolean The expressions to check


<expression2>

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when both expressions are


true. Return false when at least one
expression is false.
RET URN VA L UE TYPE DESC RIP T IO N

Example 1
These examples check whether the specified Boolean values are both true:

and(true, true)
and(false, true)
and(false, false)

And returns these results:


First example: Both expressions are true, so returns true .
Second example: One expression is false, so returns false .
Third example: Both expressions are false, so returns false .

Example 2
These examples check whether the specified expressions are both true:

and(equals(1, 1), equals(2, 2))


and(equals(1, 1), equals(1, 2))
and(equals(1, 2), equals(1, 3))

And returns these results:


First example: Both expressions are true, so returns true .
Second example: One expression is false, so returns false .
Third example: Both expressions are false, so returns false .

array
Return an array from a single specified input. For multiple inputs, see createArray().

array('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string for creating an


array

RET URN VA L UE TYPE DESC RIP T IO N

[<value>] Array An array that contains the single


specified input

Example
This example creates an array from the "hello" string:
array('hello')

And returns this result: ["hello"]

base64
Return the base64-encoded version for a string.

base64('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The input string

RET URN VA L UE TYPE DESC RIP T IO N

<base64-string> String The base64-encoded version for the


input string

Example
This example converts the "hello" string to a base64-encoded string:

base64('hello')

And returns this result: "aGVsbG8="

base64ToBinary
Return the binary version for a base64-encoded string.

base64ToBinary('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The base64-encoded string


to convert

RET URN VA L UE TYPE DESC RIP T IO N

<binary-for-base64-string> String The binary version for the base64-


encoded string

Example
This example converts the "aGVsbG8=" base64-encoded string to a binary string:
base64ToBinary('aGVsbG8=')

And returns this result:


"0110000101000111010101100111001101100010010001110011100000111101"

base64ToString
Return the string version for a base64-encoded string, effectively decoding the base64 string. Use this function
rather than decodeBase64(). Although both functions work the same way, base64ToString() is preferred.

base64ToString('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The base64-encoded string


to decode

RET URN VA L UE TYPE DESC RIP T IO N

<decoded-base64-string> String The string version for a base64-


encoded string

Example
This example converts the "aGVsbG8=" base64-encoded string to just a string:

base64ToString('aGVsbG8=')

And returns this result: "hello"

binary
Return the binary version for a string.

binary('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string to convert

RET URN VA L UE TYPE DESC RIP T IO N

<binary-for-input-value> String The binary version for the specified


string

Example
This example converts the "hello" string to a binary string:

binary('hello')

And returns this result:


"0110100001100101011011000110110001101111"

bool
Return the Boolean version for a value.

bool(<value>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Any The value to convert

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean The Boolean version for the specified


value

Example
These examples convert the specified values to Boolean values:

bool(1)
bool(0)

And returns these results:


First example: true
Second example: false

coalesce
Return the first non-null value from one or more parameters. Empty strings, empty arrays, and empty objects
are not null.

coalesce(<object_1>, <object_2>, ...)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<object_1>, <object_2>, ... Yes Any, can mix types One or more items to check
for null
RET URN VA L UE TYPE DESC RIP T IO N

<first-non-null-item> Any The first item or value that is not null.


If all parameters are null, this function
returns null.

Example
These examples return the first non-null value from the specified values, or null when all the values are null:

coalesce(null, true, false)


coalesce(null, 'hello', 'world')
coalesce(null, null, null)

And returns these results:


First example: true
Second example: "hello"
Third example: null

concat
Combine two or more strings, and return the combined string.

concat('<text1>', '<text2>', ...)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text1>, <text2>, ... Yes String At least two strings to


combine

RET URN VA L UE TYPE DESC RIP T IO N

<text1text2...> String The string created from the combined


input strings

Example
This example combines the strings "Hello" and "World":

concat('Hello', 'World')

And returns this result: "HelloWorld"

contains
Check whether a collection has a specific item. Return true when the item is found, or return false when not
found. This function is case-sensitive.
contains('<collection>', '<value>')
contains([<collection>], '<value>')

Specifically, this function works on these collection types:


A string to find a substring
An array to find a value
A dictionary to find a key

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String, Array, or Dictionary The collection to check

<value> Yes String, Array, or Dictionary, The item to find


respectively

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the item is found.


Return false when not found.

Example 1
This example checks the string "hello world" for the substring "world" and returns true:

contains('hello world', 'world')

Example 2
This example checks the string "hello world" for the substring "universe" and returns false:

contains('hello world', 'universe')

convertFromUtc
Convert a timestamp from Universal Time Coordinated (UTC) to the target time zone.

convertFromUtc('<timestamp>', '<destinationTimeZone>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<destinationTimeZone> Yes String The name for the target


time zone. For time zone
names, see Microsoft Time
Zone Index Values, but you
might have to remove any
punctuation from the time
zone name.

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<converted-timestamp> String The timestamp converted to the target


time zone

Example 1
This example converts a timestamp to the specified time zone:

convertFromUtc('2018-01-01T08:00:00.0000000Z', 'Pacific Standard Time')

And returns this result: "2018-01-01T00:00:00Z"

Example 2
This example converts a timestamp to the specified time zone and format:

convertFromUtc('2018-01-01T08:00:00.0000000Z', 'Pacific Standard Time', 'D')

And returns this result: "Monday, January 1, 2018"

convertTimeZone
Convert a timestamp from the source time zone to the target time zone.

convertTimeZone('<timestamp>', '<sourceTimeZone>', '<destinationTimeZone>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<sourceTimeZone> Yes String The name for the source


time zone. For time zone
names, see Microsoft Time
Zone Index Values, but you
might have to remove any
punctuation from the time
zone name.

<destinationTimeZone> Yes String The name for the target


time zone. For time zone
names, see Microsoft Time
Zone Index Values, but you
might have to remove any
punctuation from the time
zone name.

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<converted-timestamp> String The timestamp converted to the target


time zone

Example 1
This example converts the source time zone to the target time zone:

convertTimeZone('2018-01-01T08:00:00.0000000Z', 'UTC', 'Pacific Standard Time')

And returns this result: "2018-01-01T00:00:00.0000000"

Example 2
This example converts a time zone to the specified time zone and format:

convertTimeZone('2018-01-01T08:00:00.0000000Z', 'UTC', 'Pacific Standard Time', 'D')

And returns this result: "Monday, January 1, 2018"

convertToUtc
Convert a timestamp from the source time zone to Universal Time Coordinated (UTC).

convertToUtc('<timestamp>', '<sourceTimeZone>', '<format>'?)


PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<sourceTimeZone> Yes String The name for the source


time zone. For time zone
names, see Microsoft Time
Zone Index Values, but you
might have to remove any
punctuation from the time
zone name.

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<converted-timestamp> String The timestamp converted to UTC

Example 1
This example converts a timestamp to UTC:

convertToUtc('01/01/2018 00:00:00', 'Pacific Standard Time')

And returns this result: "2018-01-01T08:00:00.0000000Z"

Example 2
This example converts a timestamp to UTC:

convertToUtc('01/01/2018 00:00:00', 'Pacific Standard Time', 'D')

And returns this result: "Monday, January 1, 2018"

createArray
Return an array from multiple inputs. For single input arrays, see array().

createArray('<object1>', '<object2>', ...)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N


PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<object1>, <object2>, ... Yes Any, but not mixed At least two items to create
the array

RET URN VA L UE TYPE DESC RIP T IO N

[<object1>, <object2>, ...] Array The array created from all the input
items

Example
This example creates an array from these inputs:

createArray('h', 'e', 'l', 'l', 'o')

And returns this result: ["h", "e", "l", "l", "o"]

dataUri
Return a data uniform resource identifier (URI) for a string.

dataUri('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string to convert

RET URN VA L UE TYPE DESC RIP T IO N

<data-uri> String The data URI for the input string

Example
This example creates a data URI for the "hello" string:

dataUri('hello')

And returns this result: "data:text/plain;charset=utf-8;base64,aGVsbG8="

dataUriToBinary
Return the binary version for a data uniform resource identifier (URI). Use this function rather than
decodeDataUri(). Although both functions work the same way, dataUriBinary() is preferred.

dataUriToBinary('<value>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The data URI to convert

RET URN VA L UE TYPE DESC RIP T IO N

<binary-for-data-uri> String The binary version for the data URI

Example
This example creates a binary version for this data URI:

dataUriToBinary('data:text/plain;charset=utf-8;base64,aGVsbG8=')

And returns this result:


"01100100011000010111010001100001001110100111010001100101011110000111010000101111011100000
1101100011000010110100101101110001110110110001101101000011000010111001001110011011001010111
0100001111010111010101110100011001100010110100111000001110110110001001100001011100110110010
10011011000110100001011000110000101000111010101100111001101100010010001110011100000111101"

dataUriToString
Return the string version for a data uniform resource identifier (URI).

dataUriToString('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The data URI to convert

RET URN VA L UE TYPE DESC RIP T IO N

<string-for-data-uri> String The string version for the data URI

Example
This example creates a string for this data URI:

dataUriToString('data:text/plain;charset=utf-8;base64,aGVsbG8=')

And returns this result: "hello"

dayOfMonth
Return the day of the month from a timestamp.

dayOfMonth('<timestamp>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

RET URN VA L UE TYPE DESC RIP T IO N

<day-of-month> Integer The day of the month from the


specified timestamp

Example
This example returns the number for the day of the month from this timestamp:

dayOfMonth('2018-03-15T13:27:36Z')

And returns this result: 15

dayOfWeek
Return the day of the week from a timestamp.

dayOfWeek('<timestamp>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

RET URN VA L UE TYPE DESC RIP T IO N

<day-of-week> Integer The day of the week from the specified


timestamp where Sunday is 0, Monday
is 1, and so on

Example
This example returns the number for the day of the week from this timestamp:

dayOfWeek('2018-03-15T13:27:36Z')

And returns this result: 3

dayOfYear
Return the day of the year from a timestamp.
dayOfYear('<timestamp>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

RET URN VA L UE TYPE DESC RIP T IO N

<day-of-year> Integer The day of the year from the specified


timestamp

Example
This example returns the number of the day of the year from this timestamp:

dayOfYear('2018-03-15T13:27:36Z')

And returns this result: 74

decodeBase64
Return the string version for a base64-encoded string, effectively decoding the base64 string. Consider using
base64ToString() rather than decodeBase64() . Although both functions work the same way, base64ToString() is
preferred.

decodeBase64('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The base64-encoded string


to decode

RET URN VA L UE TYPE DESC RIP T IO N

<decoded-base64-string> String The string version for a base64-


encoded string

Example
This example creates a string for a base64-encoded string:

decodeBase64('aGVsbG8=')

And returns this result: "hello"

decodeDataUri
Return the binary version for a data uniform resource identifier (URI). Consider using dataUriToBinary(), rather
than decodeDataUri() . Although both functions work the same way, dataUriToBinary() is preferred.

decodeDataUri('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The data URI string to


decode

RET URN VA L UE TYPE DESC RIP T IO N

<binary-for-data-uri> String The binary version for a data URI


string

Example
This example returns the binary version for this data URI:

decodeDataUri('data:text/plain;charset=utf-8;base64,aGVsbG8=')

And returns this result:


"01100100011000010111010001100001001110100111010001100101011110000111010000101111011100000
1101100011000010110100101101110001110110110001101101000011000010111001001110011011001010111
0100001111010111010101110100011001100010110100111000001110110110001001100001011100110110010
10011011000110100001011000110000101000111010101100111001101100010010001110011100000111101"

decodeUriComponent
Return a string that replaces escape characters with decoded versions.

decodeUriComponent('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string with the escape


characters to decode

RET URN VA L UE TYPE DESC RIP T IO N

<decoded-uri> String The updated string with the decoded


escape characters

Example
This example replaces the escape characters in this string with decoded versions:
decodeUriComponent('http%3A%2F%2Fcontoso.com')

And returns this result: "https://contoso.com"

div
Return the integer result from dividing two numbers. To get the remainder result, see mod().

div(<dividend>, <divisor>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<dividend> Yes Integer or Float The number to divide by


the divisor

<divisor> Yes Integer or Float The number that divides


the dividend, but cannot be
0

RET URN VA L UE TYPE DESC RIP T IO N

<quotient-result> Integer The integer result from dividing the


first number by the second number

Example
Both examples divide the first number by the second number:

div(10, 5)
div(11, 5)

And return this result: 2

encodeUriComponent
Return a uniform resource identifier (URI) encoded version for a string by replacing URL-unsafe characters with
escape characters. Consider using uriComponent(), rather than encodeUriComponent() . Although both functions
work the same way, uriComponent() is preferred.

encodeUriComponent('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string to convert to


URI-encoded format
RET URN VA L UE TYPE DESC RIP T IO N

<encoded-uri> String The URI-encoded string with escape


characters

Example
This example creates a URI-encoded version for this string:

encodeUriComponent('https://contoso.com')

And returns this result: "http%3A%2F%2Fcontoso.com"

empty
Check whether a collection is empty. Return true when the collection is empty, or return false when not empty.

empty('<collection>')
empty([<collection>])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String, Array, or Object The collection to check

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the collection is


empty. Return false when not empty.

Example
These examples check whether the specified collections are empty:

empty('')
empty('abc')

And returns these results:


First example: Passes an empty string, so the function returns true .
Second example: Passes the string "abc", so the function returns false .

endsWith
Check whether a string ends with a specific substring. Return true when the substring is found, or return false
when not found. This function is not case-sensitive.

endsWith('<text>', '<searchText>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string to check

<searchText> Yes String The ending substring to


find

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the ending substring


is found. Return false when not found.

Example 1
This example checks whether the "hello world" string ends with the "world" string:

endsWith('hello world', 'world')

And returns this result: true

Example 2
This example checks whether the "hello world" string ends with the "universe" string:

endsWith('hello world', 'universe')

And returns this result: false

equals
Check whether both values, expressions, or objects are equivalent. Return true when both are equivalent, or
return false when they're not equivalent.

equals('<object1>', '<object2>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<object1>, <object2> Yes Various The values, expressions, or


objects to compare

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when both are equivalent.


Return false when not equivalent.

Example
These examples check whether the specified inputs are equivalent.
equals(true, 1)
equals('abc', 'abcd')

And returns these results:


First example: Both values are equivalent, so the function returns true .
Second example: Both values aren't equivalent, so the function returns false .

first
Return the first item from a string or array.

first('<collection>')
first([<collection>])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String or Array The collection where to find


the first item

RET URN VA L UE TYPE DESC RIP T IO N

<first-collection-item> Any The first item in the collection

Example
These examples find the first item in these collections:

first('hello')
first(createArray(0, 1, 2))

And return these results:


First example: "h"
Second example: 0

float
Convert a string version for a floating-point number to an actual floating point number.

float('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string that has a valid


floating-point number to
convert
RET URN VA L UE TYPE DESC RIP T IO N

<float-value> Float The floating-point number for the


specified string

Example
This example creates a string version for this floating-point number:

float('10.333')

And returns this result: 10.333

formatDateTime
Return a timestamp in the specified format.

formatDateTime('<timestamp>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<reformatted-timestamp> String The updated timestamp in the


specified format

Example
This example converts a timestamp to the specified format:

formatDateTime('03/15/2018 12:00:00', 'yyyy-MM-ddTHH:mm:ss')

And returns this result: "2018-03-15T12:00:00"

getFutureTime
Return the current timestamp plus the specified time units.
getFutureTime(<interval>, <timeUnit>, <format>?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<interval> Yes Integer The number of specified


time units to add

<timeUnit> Yes String The unit of time to use with


interval: "Second", "Minute",
"Hour", "Day", "Week",
"Month", "Year"

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The current timestamp plus the


specified number of time units

Example 1
Suppose the current timestamp is "2018-03-01T00:00:00.0000000Z". This example adds five days to that
timestamp:

getFutureTime(5, 'Day')

And returns this result: "2018-03-06T00:00:00.0000000Z"

Example 2
Suppose the current timestamp is "2018-03-01T00:00:00.0000000Z". This example adds five days and converts
the result to "D" format:

getFutureTime(5, 'Day', 'D')

And returns this result: "Tuesday, March 6, 2018"

getPastTime
Return the current timestamp minus the specified time units.

getPastTime(<interval>, <timeUnit>, <format>?)


PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<interval> Yes Integer The number of specified


time units to subtract

<timeUnit> Yes String The unit of time to use with


interval: "Second", "Minute",
"Hour", "Day", "Week",
"Month", "Year"

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The current timestamp minus the


specified number of time units

Example 1
Suppose the current timestamp is "2018-02-01T00:00:00.0000000Z". This example subtracts five days from that
timestamp:

getPastTime(5, 'Day')

And returns this result: "2018-01-27T00:00:00.0000000Z"

Example 2
Suppose the current timestamp is "2018-02-01T00:00:00.0000000Z". This example subtracts five days and
converts the result to "D" format:

getPastTime(5, 'Day', 'D')

And returns this result: "Saturday, January 27, 2018"

greater
Check whether the first value is greater than the second value. Return true when the first value is more, or return
false when less.

greater(<value>, <compareTo>)
greater('<value>', '<compareTo>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Integer, Float, or String The first value to check


whether greater than the
second value

<compareTo> Yes Integer, Float, or String, The comparison value


respectively

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the first value is


greater than the second value. Return
false when the first value is equal to or
less than the second value.

Example
These examples check whether the first value is greater than the second value:

greater(10, 5)
greater('apple', 'banana')

And return these results:


First example: true
Second example: false

greaterOrEquals
Check whether the first value is greater than or equal to the second value. Return true when the first value is
greater or equal, or return false when the first value is less.

greaterOrEquals(<value>, <compareTo>)
greaterOrEquals('<value>', '<compareTo>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Integer, Float, or String The first value to check


whether greater than or
equal to the second value

<compareTo> Yes Integer, Float, or String, The comparison value


respectively

RET URN VA L UE TYPE DESC RIP T IO N


RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the first value is


greater than or equal to the second
value. Return false when the first value
is less than the second value.

Example
These examples check whether the first value is greater or equal than the second value:

greaterOrEquals(5, 5)
greaterOrEquals('apple', 'banana')

And return these results:


First example: true
Second example: false

guid
Generate a globally unique identifier (GUID) as a string, for example, "c2ecc88d-88c8-4096-912c-
d6f2e2b138ce":

guid()

Also, you can specify a different format for the GUID other than the default format, "D", which is 32 digits
separated by hyphens.

guid('<format>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String A single format specifier for


the returned GUID. By
default, the format is "D",
but you can use "N", "D",
"B", "P", or "X".

RET URN VA L UE TYPE DESC RIP T IO N

<GUID-value> String A randomly generated GUID

Example
This example generates the same GUID, but as 32 digits, separated by hyphens, and enclosed in parentheses:

guid('P')
And returns this result: "(c2ecc88d-88c8-4096-912c-d6f2e2b138ce)"

if
Check whether an expression is true or false. Based on the result, return a specified value.

if(<expression>, <valueIfTrue>, <valueIfFalse>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<expression> Yes Boolean The expression to check

<valueIfTrue> Yes Any The value to return when


the expression is true

<valueIfFalse> Yes Any The value to return when


the expression is false

RET URN VA L UE TYPE DESC RIP T IO N

<specified-return-value> Any The specified value that returns based


on whether the expression is true or
false

Example
This example returns "yes" because the specified expression returns true. Otherwise, the example returns
"no" :

if(equals(1, 1), 'yes', 'no')

indexOf
Return the starting position or index value for a substring. This function is not case-sensitive, and indexes start
with the number 0.

indexOf('<text>', '<searchText>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string that has the


substring to find

<searchText> Yes String The substring to find

RET URN VA L UE TYPE DESC RIP T IO N


RET URN VA L UE TYPE DESC RIP T IO N

<index-value> Integer The starting position or index value for


the specified substring.
If the string is not found, return
the number -1.

Example
This example finds the starting index value for the "world" substring in the "hello world" string:

indexOf('hello world', 'world')

And returns this result: 6

int
Return the integer version for a string.

int('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string to convert

RET URN VA L UE TYPE DESC RIP T IO N

<integer-result> Integer The integer version for the specified


string

Example
This example creates an integer version for the string "10":

int('10')

And returns this result: 10

json
Return the JavaScript Object Notation (JSON) type value or object for a string or XML.

json('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String or XML The string or XML to


convert
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

RET URN VA L UE TYPE DESC RIP T IO N

<JSON-result> JSON native type or object The JSON native type value or object
for the specified string or XML. If the
string is null, the function returns an
empty object.

Example 1
This example converts this string to the JSON value:

json('[1, 2, 3]')

And returns this result: [1, 2, 3]

Example 2
This example converts this string to JSON:

json('{"fullName": "Sophia Owen"}')

And returns this result:

{
"fullName": "Sophia Owen"
}

Example 3
This example converts this XML to JSON:

json(xml('<?xml version="1.0"?> <root> <person id='1'> <name>Sophia Owen</name>


<occupation>Engineer</occupation> </person> </root>'))

And returns this result:

{
"?xml": { "@version": "1.0" },
"root": {
"person": [ {
"@id": "1",
"name": "Sophia Owen",
"occupation": "Engineer"
} ]
}
}

intersection
Return a collection that has only the common items across the specified collections. To appear in the result, an
item must appear in all the collections passed to this function. If one or more items have the same name, the last
item with that name appears in the result.

intersection([<collection1>], [<collection2>], ...)


intersection('<collection1>', '<collection2>', ...)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection1>, Yes Array or Object, but not The collections from where
<collection2>, ... both you want only the common
items

RET URN VA L UE TYPE DESC RIP T IO N

<common-items> Array or Object, respectively A collection that has only the common
items across the specified collections

Example
This example finds the common items across these arrays:

intersection(createArray(1, 2, 3), createArray(101, 2, 1, 10), createArray(6, 8, 1, 2))

And returns an array with only these items: [1, 2]

join
Return a string that has all the items from an array and has each character separated by a delimiter.

join([<collection>], '<delimiter>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes Array The array that has the items


to join

<delimiter> Yes String The separator that appears


between each character in
the resulting string

RET URN VA L UE TYPE DESC RIP T IO N

<char1><delimiter><char2> String The resulting string created from all


<delimiter>... the items in the specified array

Example
This example creates a string from all the items in this array with the specified character as the delimiter:
join(createArray('a', 'b', 'c'), '.')

And returns this result: "a.b.c"

last
Return the last item from a collection.

last('<collection>')
last([<collection>])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String or Array The collection where to find


the last item

RET URN VA L UE TYPE DESC RIP T IO N

<last-collection-item> String or Array, respectively The last item in the collection

Example
These examples find the last item in these collections:

last('abcd')
last(createArray(0, 1, 2, 3))

And returns these results:


First example: "d"
Second example: 3

lastIndexOf
Return the starting position or index value for the last occurrence of a substring. This function is not case-
sensitive, and indexes start with the number 0.

lastIndexOf('<text>', '<searchText>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string that has the


substring to find

<searchText> Yes String The substring to find


RET URN VA L UE TYPE DESC RIP T IO N

<ending-index-value> Integer The starting position or index value for


the last occurrence of the specified
substring.
If the string is not found, return
the number -1.

Example
This example finds the starting index value for the last occurrence of the "world" substring in the "hello world"
string:

lastIndexOf('hello world', 'world')

And returns this result: 6

length
Return the number of items in a collection.

length('<collection>')
length([<collection>])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String or Array The collection with the


items to count

RET URN VA L UE TYPE DESC RIP T IO N

<length-or-count> Integer The number of items in the collection

Example
These examples count the number of items in these collections:

length('abcd')
length(createArray(0, 1, 2, 3))

And return this result: 4

less
Check whether the first value is less than the second value. Return true when the first value is less, or return
false when the first value is more.

less(<value>, <compareTo>)
less('<value>', '<compareTo>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Integer, Float, or String The first value to check


whether less than the
second value

<compareTo> Yes Integer, Float, or String, The comparison item


respectively

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the first value is less
than the second value. Return false
when the first value is equal to or
greater than the second value.

Example
These examples check whether the first value is less than the second value.

less(5, 10)
less('banana', 'apple')

And return these results:


First example: true
Second example: false

lessOrEquals
Check whether the first value is less than or equal to the second value. Return true when the first value is less
than or equal, or return false when the first value is more.

lessOrEquals(<value>, <compareTo>)
lessOrEquals('<value>', '<compareTo>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Integer, Float, or String The first value to check


whether less than or equal
to the second value

<compareTo> Yes Integer, Float, or String, The comparison item


respectively

RET URN VA L UE TYPE DESC RIP T IO N


RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the first value is less
than or equal to the second value.
Return false when the first value is
greater than the second value.

Example
These examples check whether the first value is less or equal than the second value.

lessOrEquals(10, 10)
lessOrEquals('apply', 'apple')

And return these results:


First example: true
Second example: false

max
Return the highest value from a list or array with numbers that is inclusive at both ends.

max(<number1>, <number2>, ...)


max([<number1>, <number2>, ...])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<number1>, <number2>, Yes Integer, Float, or both The set of numbers from
... which you want the highest
value

[<number1>, <number2>, Yes Array - Integer, Float, or The array of numbers from
...] both which you want the highest
value

RET URN VA L UE TYPE DESC RIP T IO N

<max-value> Integer or Float The highest value in the specified array


or set of numbers

Example
These examples get the highest value from the set of numbers and the array:

max(1, 2, 3)
max(createArray(1, 2, 3))

And return this result: 3

min
Return the lowest value from a set of numbers or an array.

min(<number1>, <number2>, ...)


min([<number1>, <number2>, ...])

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<number1>, <number2>, Yes Integer, Float, or both The set of numbers from
... which you want the lowest
value

[<number1>, <number2>, Yes Array - Integer, Float, or The array of numbers from
...] both which you want the lowest
value

RET URN VA L UE TYPE DESC RIP T IO N

<min-value> Integer or Float The lowest value in the specified set of


numbers or specified array

Example
These examples get the lowest value in the set of numbers and the array:

min(1, 2, 3)
min(createArray(1, 2, 3))

And return this result: 1

mod
Return the remainder from dividing two numbers. To get the integer result, see div().

mod(<dividend>, <divisor>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<dividend> Yes Integer or Float The number to divide by


the divisor

<divisor> Yes Integer or Float The number that divides


the dividend, but cannot be
0.

RET URN VA L UE TYPE DESC RIP T IO N

<modulo-result> Integer or Float The remainder from dividing the first


number by the second number
Example
This example divides the first number by the second number:

mod(3, 2)

And return this result: 1

mul
Return the product from multiplying two numbers.

mul(<multiplicand1>, <multiplicand2>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<multiplicand1> Yes Integer or Float The number to multiply by


multiplicand2

<multiplicand2> Yes Integer or Float The number that multiples


multiplicand1

RET URN VA L UE TYPE DESC RIP T IO N

<product-result> Integer or Float The product from multiplying the first


number by the second number

Example
These examples multiple the first number by the second number:

mul(1, 2)
mul(1.5, 2)

And return these results:


First example: 2
Second example 3

not
Check whether an expression is false. Return true when the expression is false, or return false when true.

not(<expression>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<expression> Yes Boolean The expression to check


RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the expression is


false. Return false when the expression
is true.

Example 1
These examples check whether the specified expressions are false:

not(false)
not(true)

And return these results:


First example: The expression is false, so the function returns true .
Second example: The expression is true, so the function returns false .

Example 2
These examples check whether the specified expressions are false:

not(equals(1, 2))
not(equals(1, 1))

And return these results:


First example: The expression is false, so the function returns true .
Second example: The expression is true, so the function returns false .

or
Check whether at least one expression is true. Return true when at least one expression is true, or return false
when both are false.

or(<expression1>, <expression2>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<expression1>, Yes Boolean The expressions to check


<expression2>

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when at least one


expression is true. Return false when
both expressions are false.

Example 1
These examples check whether at least one expression is true:
or(true, false)
or(false, false)

And return these results:


First example: At least one expression is true, so the function returns true .
Second example: Both expressions are false, so the function returns false .

Example 2
These examples check whether at least one expression is true:

or(equals(1, 1), equals(1, 2))


or(equals(1, 2), equals(1, 3))

And return these results:


First example: At least one expression is true, so the function returns true .
Second example: Both expressions are false, so the function returns false .

rand
Return a random integer from a specified range, which is inclusive only at the starting end.

rand(<minValue>, <maxValue>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<minValue> Yes Integer The lowest integer in the


range

<maxValue> Yes Integer The integer that follows the


highest integer in the range
that the function can return

RET URN VA L UE TYPE DESC RIP T IO N

<random-result> Integer The random integer returned from the


specified range

Example
This example gets a random integer from the specified range, excluding the maximum value:

rand(1, 5)

And returns one of these numbers as the result: 1 , 2 , 3 , or 4

range
Return an integer array that starts from a specified integer.
range(<startIndex>, <count>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<startIndex> Yes Integer An integer value that starts


the array as the first item

<count> Yes Integer The number of integers in


the array

RET URN VA L UE TYPE DESC RIP T IO N

[<range-result>] Array The array with integers starting from


the specified index

Example
This example creates an integer array that starts from the specified index and has the specified number of
integers:

range(1, 4)

And returns this result: [1, 2, 3, 4]

replace
Replace a substring with the specified string, and return the result string. This function is case-sensitive.

replace('<text>', '<oldText>', '<newText>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string that has the


substring to replace

<oldText> Yes String The substring to replace

<newText> Yes String The replacement string

RET URN VA L UE TYPE DESC RIP T IO N

<updated-text> String The updated string after replacing the


substring
If the substring is not found,
return the original string.

Example
This example finds the "old" substring in "the old string" and replaces "old" with "new":

replace('the old string', 'old', 'new')

And returns this result: "the new string"

skip
Remove items from the front of a collection, and return all the other items.

skip([<collection>], <count>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes Array The collection whose items


you want to remove

<count> Yes Integer A positive integer for the


number of items to remove
at the front

RET URN VA L UE TYPE DESC RIP T IO N

[<updated-collection>] Array The updated collection after removing


the specified items

Example
This example removes one item, the number 0, from the front of the specified array:

skip(createArray(0, 1, 2, 3), 1)

And returns this array with the remaining items: [1,2,3]

split
Return an array that contains substrings, separated by commas, based on the specified delimiter character in the
original string.

split('<text>', '<delimiter>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string to separate into


substrings based on the
specified delimiter in the
original string
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<delimiter> Yes String The character in the original


string to use as the
delimiter

RET URN VA L UE TYPE DESC RIP T IO N

[<substring1>,<substring2>,...] Array An array that contains substrings from


the original string, separated by
commas

Example
This example creates an array with substrings from the specified string based on the specified character as the
delimiter:

split('a_b_c', '_')

And returns this array as the result: ["a","b","c"]

startOfDay
Return the start of the day for a timestamp.

startOfDay('<timestamp>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The specified timestamp but starting


at the zero-hour mark for the day

Example
This example finds the start of the day for this timestamp:
startOfDay('2018-03-15T13:30:30Z')

And returns this result: "2018-03-15T00:00:00.0000000Z"

startOfHour
Return the start of the hour for a timestamp.

startOfHour('<timestamp>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The specified timestamp but starting


at the zero-minute mark for the hour

Example
This example finds the start of the hour for this timestamp:

startOfHour('2018-03-15T13:30:30Z')

And returns this result: "2018-03-15T13:00:00.0000000Z"

startOfMonth
Return the start of the month for a timestamp.

startOfMonth('<timestamp>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The specified timestamp but starting


on the first day of the month at the
zero-hour mark

Example
This example returns the start of the month for this timestamp:

startOfMonth('2018-03-15T13:30:30Z')

And returns this result: "2018-03-01T00:00:00.0000000Z"

startsWith
Check whether a string starts with a specific substring. Return true when the substring is found, or return false
when not found. This function is not case-sensitive.

startsWith('<text>', '<searchText>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string to check

<searchText> Yes String The starting string to find

RET URN VA L UE TYPE DESC RIP T IO N

true or false Boolean Return true when the starting


substring is found. Return false when
not found.

Example 1
This example checks whether the "hello world" string starts with the "hello" substring:
startsWith('hello world', 'hello')

And returns this result: true

Example 2
This example checks whether the "hello world" string starts with the "greetings" substring:

startsWith('hello world', 'greetings')

And returns this result: false

string
Return the string version for a value.

string(<value>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes Any The value to convert

RET URN VA L UE TYPE DESC RIP T IO N

<string-value> String The string version for the specified


value

Example 1
This example creates the string version for this number:

string(10)

And returns this result: "10"

Example 2
This example creates a string for the specified JSON object and uses the backslash character (\) as an escape
character for the double-quotation mark (").

string( { "name": "Sophie Owen" } )

And returns this result: "{ \\"name\\": \\"Sophie Owen\\" }"

sub
Return the result from subtracting the second number from the first number.

sub(<minuend>, <subtrahend>)
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<minuend> Yes Integer or Float The number from which to


subtract the subtrahend

<subtrahend> Yes Integer or Float The number to subtract


from the minuend

RET URN VA L UE TYPE DESC RIP T IO N

<result> Integer or Float The result from subtracting the second


number from the first number

Example
This example subtracts the second number from the first number:

sub(10.3, .3)

And returns this result: 10

substring
Return characters from a string, starting from the specified position, or index. Index values start with the number
0.

substring('<text>', <startIndex>, <length>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string whose characters


you want

<startIndex> Yes Integer A positive number equal to


or greater than 0 that you
want to use as the starting
position or index value

<length> Yes Integer A positive number of


characters that you want in
the substring

RET URN VA L UE TYPE DESC RIP T IO N

<substring-result> String A substring with the specified number


of characters, starting at the specified
index position in the source string

Example
This example creates a five-character substring from the specified string, starting from the index value 6:

substring('hello world', 6, 5)

And returns this result: "world"

subtractFromTime
Subtract a number of time units from a timestamp. See also getPastTime.

subtractFromTime('<timestamp>', <interval>, '<timeUnit>', '<format>'?)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string that contains the


timestamp

<interval> Yes Integer The number of specified


time units to subtract

<timeUnit> Yes String The unit of time to use with


interval: "Second", "Minute",
"Hour", "Day", "Week",
"Month", "Year"

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<updated-timestamp> String The timestamp minus the specified


number of time units

Example 1
This example subtracts one day from this timestamp:

subtractFromTime('2018-01-02T00:00:00Z', 1, 'Day')

And returns this result: "2018-01-01T00:00:00:0000000Z"

Example 2
This example subtracts one day from this timestamp:
subtractFromTime('2018-01-02T00:00:00Z', 1, 'Day', 'D')

And returns this result using the optional "D" format: "Monday, January, 1, 2018"

take
Return items from the front of a collection.

take('<collection>', <count>)
take([<collection>], <count>)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection> Yes String or Array The collection whose items


you want

<count> Yes Integer A positive integer for the


number of items that you
want from the front

RET URN VA L UE TYPE DESC RIP T IO N

<subset> or [<subset>] String or Array, respectively A string or array that has the specified
number of items taken from the front
of the original collection

Example
These examples get the specified number of items from the front of these collections:

take('abcde', 3)
take(createArray(0, 1, 2, 3, 4), 3)

And return these results:


First example: "abc"
Second example: [0, 1, 2]

ticks
Return the ticks property value for a specified timestamp. A tick is a 100-nanosecond interval.

ticks('<timestamp>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<timestamp> Yes String The string for a timestamp


RET URN VA L UE TYPE DESC RIP T IO N

<ticks-number> Integer The number of ticks since the specified


timestamp

toLower
Return a string in lowercase format. If a character in the string doesn't have a lowercase version, that character
stays unchanged in the returned string.

toLower('<text>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string to return in


lowercase format

RET URN VA L UE TYPE DESC RIP T IO N

<lowercase-text> String The original string in lowercase format

Example
This example converts this string to lowercase:

toLower('Hello World')

And returns this result: "hello world"

toUpper
Return a string in uppercase format. If a character in the string doesn't have an uppercase version, that character
stays unchanged in the returned string.

toUpper('<text>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string to return in


uppercase format

RET URN VA L UE TYPE DESC RIP T IO N

<uppercase-text> String The original string in uppercase format

Example
This example converts this string to uppercase:

toUpper('Hello World')

And returns this result: "HELLO WORLD"

trim
Remove leading and trailing whitespace from a string, and return the updated string.

trim('<text>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<text> Yes String The string that has the


leading and trailing
whitespace to remove

RET URN VA L UE TYPE DESC RIP T IO N

<updatedText> String An updated version for the original


string without leading or trailing
whitespace

Example
This example removes the leading and trailing whitespace from the string " Hello World ":

trim(' Hello World ')

And returns this result: "Hello World"

union
Return a collection that has all the items from the specified collections. To appear in the result, an item can
appear in any collection passed to this function. If one or more items have the same name, the last item with
that name appears in the result.

union('<collection1>', '<collection2>', ...)


union([<collection1>], [<collection2>], ...)

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<collection1>, Yes Array or Object, but not The collections from where
<collection2>, ... both you want all the items
RET URN VA L UE TYPE DESC RIP T IO N

<updatedCollection> Array or Object, respectively A collection with all the items from the
specified collections - no duplicates

Example
This example gets all the items from these collections:

union(createArray(1, 2, 3), createArray(1, 2, 10, 101))

And returns this result: [1, 2, 3, 10, 101]

uriComponent
Return a uniform resource identifier (URI) encoded version for a string by replacing URL-unsafe characters with
escape characters. Use this function rather than encodeUriComponent(). Although both functions work the same
way, uriComponent() is preferred.

uriComponent('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string to convert to


URI-encoded format

RET URN VA L UE TYPE DESC RIP T IO N

<encoded-uri> String The URI-encoded string with escape


characters

Example
This example creates a URI-encoded version for this string:

uriComponent('https://contoso.com')

And returns this result: "http%3A%2F%2Fcontoso.com"

uriComponentToBinary
Return the binary version for a uniform resource identifier (URI) component.

uriComponentToBinary('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N


PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The URI-encoded string to


convert

RET URN VA L UE TYPE DESC RIP T IO N

<binary-for-encoded-uri> String The binary version for the URI-


encoded string. The binary content is
base64-encoded and represented by
$content .

Example
This example creates the binary version for this URI-encoded string:

uriComponentToBinary('http%3A%2F%2Fcontoso.com')

And returns this result:


"001000100110100001110100011101000111000000100101001100
11010000010010010100110010010001100010010100110010010001
10011000110110111101101110011101000110111101110011011011 110010111001100011011011110110110100100010"

uriComponentToString
Return the string version for a uniform resource identifier (URI) encoded string, effectively decoding the URI-
encoded string.

uriComponentToString('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The URI-encoded string to


decode

RET URN VA L UE TYPE DESC RIP T IO N

<decoded-uri> String The decoded version for the URI-


encoded string

Example
This example creates the decoded string version for this URI-encoded string:

uriComponentToString('http%3A%2F%2Fcontoso.com')

And returns this result: "https://contoso.com"


utcNow
Return the current timestamp.

utcNow('<format>')

Optionally, you can specify a different format with the <format> parameter.

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<format> No String Either a single format


specifier or a custom format
pattern. The default format
for the timestamp is "o"
(yyyy-MM-
ddTHH:mm:ss:fffffffK), which
complies with ISO 8601 and
preserves time zone
information.

RET URN VA L UE TYPE DESC RIP T IO N

<current-timestamp> String The current date and time

Example 1
Suppose today is April 15, 2018 at 1:00:00 PM. This example gets the current timestamp:

utcNow()

And returns this result: "2018-04-15T13:00:00.0000000Z"

Example 2
Suppose today is April 15, 2018 at 1:00:00 PM. This example gets the current timestamp using the optional "D"
format:

utcNow('D')

And returns this result: "Sunday, April 15, 2018"

xml
Return the XML version for a string that contains a JSON object.

xml('<value>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N


PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<value> Yes String The string with the JSON


object to convert
The JSON object must
have only one root
property, which can't be
an array.
Use the backslash
character (\) as an
escape character for the
double quotation mark
(").

RET URN VA L UE TYPE DESC RIP T IO N

<xml-version> Object The encoded XML for the specified


string or JSON object

Example 1
This example creates the XML version for this string, which contains a JSON object:
xml(json('{ \"name\": \"Sophia Owen\" }'))

And returns this result XML:

<name>Sophia Owen</name>

Example 2
Suppose you have this JSON object:

{
"person": {
"name": "Sophia Owen",
"city": "Seattle"
}
}

This example creates XML for a string that contains this JSON object:
xml(json('{\"person\": {\"name\": \"Sophia Owen\", \"city\": \"Seattle\"}}'))

And returns this result XML:

<person>
<name>Sophia Owen</name>
<city>Seattle</city>
<person>

xpath
Check XML for nodes or values that match an XPath (XML Path Language) expression, and return the matching
nodes or values. An XPath expression, or just "XPath", helps you navigate an XML document structure so that
you can select nodes or compute values in the XML content.

xpath('<xml>', '<xpath>')

PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N

<xml> Yes Any The XML string to search


for nodes or values that
match an XPath expression
value

<xpath> Yes Any The XPath expression used


to find matching XML
nodes or values

RET URN VA L UE TYPE DESC RIP T IO N

<xml-node> XML An XML node when only a single node


matches the specified XPath expression

<value> Any The value from an XML node when


only a single value matches the
specified XPath expression

[<xml-node1>, <xml-node2>, ...] Array An array with XML nodes or values


-or- that match the specified XPath
[<value1>, <value2>, ...] expression

Example 1
Following on Example 1, this example finds nodes that match the <count></count> node and adds those node
values with the sum() function:
xpath(xml(parameters('items')), 'sum(/produce/item/count)')

And returns this result: 30

Example 2
For this example, both expressions find nodes that match the <location></location> node, in the specified
arguments, which include XML with a namespace. The expressions use the backslash character (\) as an escape
character for the double quotation mark (").
Expression 1
xpath(xml(body('Http')), '/*[name()=\"file\"]/*[name()=\"location\"]')

Expression 2
xpath(xml(body('Http')), '/*[local-name()=\"file\" and namespace-uri()=\"http://contoso.com\"]/*
[local-name()=\"location\"]')

Here are the arguments:


This XML, which includes the XML document namespace, xmlns="http://contoso.com" :
<?xml version="1.0"?> <file xmlns="http://contoso.com"> <location>Paris</location> </file>

Either XPath expression here:


/*[name()=\"file\"]/*[name()=\"location\"]

/*[local-name()=\"file\" and namespace-uri()=\"http://contoso.com\"]/*[local-


name()=\"location\"]

Here is the result node that matches the <location></location> node:

<location xmlns="https://contoso.com">Paris</location>

Example 3
Following on Example 3, this example finds the value in the <location></location> node:
xpath(xml(body('Http')), 'string(/*[name()=\"file\"]/*[name()=\"location\"])')

And returns this result: "Paris"

Next steps
For a list of system variables you can use in expressions, see System variables.
System variables supported by Azure Data Factory
7/20/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes system variables supported by Azure Data Factory. You can use these variables in
expressions when defining Data Factory entities.

Pipeline scope
These system variables can be referenced anywhere in the pipeline JSON.

VA RIA B L E N A M E DESC RIP T IO N

@pipeline().DataFactory Name of the data factory the pipeline run is running in

@pipeline().Pipeline Name of the pipeline

@pipeline().RunId ID of the specific pipeline run

@pipeline().TriggerType The type of trigger that invoked the pipeline (for example,
ScheduleTrigger , BlobEventsTrigger ). For a list of
supported trigger types, see Pipeline execution and triggers
in Azure Data Factory. A trigger type of Manual indicates
that the pipeline was triggered manually.

@pipeline().TriggerId ID of the trigger that invoked the pipeline

@pipeline().TriggerName Name of the trigger that invoked the pipeline

@pipeline().TriggerTime Time of the trigger run that invoked the pipeline. This is the
time at which the trigger actually fired to invoke the
pipeline run, and it may differ slightly from the trigger's
scheduled time.

@pipeline().GroupId ID of the group to which pipeline run belongs.

@pipeline()?.TriggeredByPipelineName Name of the pipeline that trigger the pipeline run. Applicable
when the pipeline run is triggered by an ExecutePipeline
activity. Evaluate to Null when used in other circumstances.
Note the question mark after @pipeline()

@pipeline()?.TriggeredByPipelineRunId Run id of the pipeline that trigger the pipeline run.


Applicable when the pipeline run is triggered by an
ExecutePipeline activity. Evaluate to Null when used in other
circumstances. Note the question mark after @pipeline()

NOTE
Trigger-related date/time system variables (in both pipeline and trigger scopes) return UTC dates in ISO 8601 format, for
example, 2017-06-01T22:20:00.4061448Z .
Schedule trigger scope
These system variables can be referenced anywhere in the trigger JSON for triggers of type ScheduleTrigger.

VA RIA B L E N A M E DESC RIP T IO N

@trigger().scheduledTime Time at which the trigger was scheduled to invoke the


pipeline run.

@trigger().startTime Time at which the trigger actually fired to invoke the


pipeline run. This may differ slightly from the trigger's
scheduled time.

Tumbling window trigger scope


These system variables can be referenced anywhere in the trigger JSON for triggers of type
TumblingWindowTrigger.

VA RIA B L E N A M E DESC RIP T IO N

@trigger().outputs.windowStartTime Start of the window associated with the trigger run.

@trigger().outputs.windowEndTime End of the window associated with the trigger run.

@trigger().scheduledTime Time at which the trigger was scheduled to invoke the


pipeline run.

@trigger().startTime Time at which the trigger actually fired to invoke the


pipeline run. This may differ slightly from the trigger's
scheduled time.

Storage event trigger scope


These system variables can be referenced anywhere in the trigger JSON for triggers of type BlobEventsTrigger.

VA RIA B L E N A M E DESC RIP T IO N

@triggerBody().fileName Name of the file whose creation or deletion caused the


trigger to fire.

@triggerBody().folderPath Path to the folder that contains the file specified by


@triggerBody().fileName . The first segment of the folder
path is the name of the Azure Blob Storage container.

@trigger().startTime Time at which the trigger fired to invoke the pipeline run.

Custom event trigger scope


These system variables can be referenced anywhere in the trigger JSON for triggers of type
CustomEventsTrigger.
NOTE
Azure Data Factory expects custom event to be formatted with Azure Event Grid event schema.

VA RIA B L E N A M E DESC RIP T IO N

@triggerBody().event.eventType Type of events that triggered the Custom Event Trigger run.
Event type is customer defined field and take on any values
of string type.

@triggerBody().event.subject Subject of the custom event that caused the trigger to fire.

@triggerBody().event.data._keyName_ Data field in custom event is a free from JSON blob, which
customer can use to send messages and data. Please use
data.keyName to reference each field. For example,
@triggerBody().event.data.callback returns the value for the
callback field stored under data.

@trigger().startTime Time at which the trigger fired to invoke the pipeline run.

Next steps
For information about how these variables are used in expressions, see Expression language & functions.
To use trigger scope system variables in pipeline, see Reference trigger metadata in pipeline
Parameterizing mapping data flows
4/20/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Mapping data flows in Azure Data Factory and Azure Synapse Analytics support the use of parameters. Define
parameters inside of your data flow definition and use them throughout your expressions. The parameter values
are set by the calling pipeline via the Execute Data Flow activity. You have three options for setting the values in
the data flow activity expressions:
Use the pipeline control flow expression language to set a dynamic value
Use the data flow expression language to set a dynamic value
Use either expression language to set a static literal value
Use this capability to make your data flows general-purpose, flexible, and reusable. You can parameterize data
flow settings and expressions with these parameters.

Create parameters in a mapping data flow


To add parameters to your data flow, click on the blank portion of the data flow canvas to see the general
properties. In the settings pane, you will see a tab called Parameter . Select New to generate a new parameter.
For each parameter, you must assign a name, select a type, and optionally set a default value.

Use parameters in a mapping data flow


Parameters can be referenced in any data flow expression. Parameters begin with $ and are immutable. You will
find the list of available parameters inside of the Expression Builder under the Parameters tab.

You can quickly add additional parameters by selecting New parameter and specifying the name and type.
Assign parameter values from a pipeline
Once you've created a data flow with parameters, you can execute it from a pipeline with the Execute Data Flow
Activity. After you add the activity to your pipeline canvas, you will be presented with the available data flow
parameters in the activity's Parameters tab.
When assigning parameter values, you can use either the pipeline expression language or the data flow
expression language based on spark types. Each mapping data flow can have any combination of pipeline and
data flow expression parameters.

Pipeline expression parameters


Pipeline expression parameters allow you to reference system variables, functions, pipeline parameters, and
variables similar to other pipeline activities. When you click Pipeline expression , a side-nav will open allowing
you to enter an expression using the expression builder.
When referenced, pipeline parameters are evaluated and then their value is used in the data flow expression
language. The pipeline expression type doesn't need to match the data flow parameter type.
String literals vs expressions
When assigning a pipeline expression parameter of type string, by default quotes will be added and the value
will be evaluated as a literal. To read the parameter value as a data flow expression, check the expression box
next to the parameter.

If data flow parameter stringParam references a pipeline parameter with value upper(column1) .
If expression is checked, $stringParam evaluates to the value of column1 all uppercase.
If expression is not checked (default behavior), $stringParam evaluates to 'upper(column1)'
Passing in timestamps
In the pipeline expression language, System variables such as pipeline().TriggerTime and functions like
utcNow() return timestamps as strings in format 'yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ'. To convert these into
data flow parameters of type timestamp, use string interpolation to include the desired timestamp in a
toTimestamp() function. For example, to convert the pipeline trigger time into a data flow parameter, you can
use toTimestamp(left('@{pipeline().TriggerTime}', 23), 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') .
NOTE
Data Flows can only support up to 3 millisecond digits. The left() function is used trim off additional digits.

Pipeline parameter example


Say you have an integer parameter intParam that is referencing a pipeline parameter of type String,
@pipeline.parameters.pipelineParam .

@pipeline.parameters.pipelineParam is assigned a value of abs(1) at runtime.

When $intParam is referenced in an expression such as a derived column, it will evaluate abs(1) return 1 .

Data flow expression parameters


Select Data flow expression will open up the data flow expression builder. You will be able to reference
functions, other parameters and any defined schema column throughout your data flow. This expression will be
evaluated as is when referenced.

NOTE
If you pass in an invalid expression or reference a schema column that doesn't exist in that transformation, the parameter
will evaluate to null.
Passing in a column name as a parameter
A common pattern is to pass in a column name as a parameter value. If the column is defined in the data flow
schema, you can reference it directly as a string expression. If the column isn't defined in the schema, use the
byName() function. Remember to cast the column to its appropriate type with a casting function such as
toString() .

For example, if you wanted to map a string column based upon a parameter columnName , you can add a derived
column transformation equal to toString(byName($columnName)) .

Next steps
Execute data flow activity
Control flow expressions
How to use parameters, expressions and functions
in Azure Data Factory
3/26/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this document, we will primarily focus on learning fundamental concepts with various examples to explore
the ability to create parameterized data pipelines within Azure Data Factory. Parameterization and dynamic
expressions are such notable additions to ADF because they can save a tremendous amount of time and allow
for a much more flexible Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) solution, which will
dramatically reduce the cost of solution maintenance and speed up the implementation of new features into
existing pipelines. These gains are because parameterization minimizes the amount of hard coding and
increases the number of reusable objects and processes in a solution.

Azure data factory UI and parameters


If you are new to Azure data factory parameter usage in ADF user interface, please review Data factory UI for
linked services with parameters and Data factory UI for metadata driven pipeline with parameters for visual
explanation.

Parameter and expression concepts


You can use parameters to pass external values into pipelines, datasets, linked services, and data flows. Once the
parameter has been passed into the resource, it cannot be changed. By parameterizing resources, you can reuse
them with different values each time. Parameters can be used individually or as a part of expressions. JSON
values in the definition can be literal or expressions that are evaluated at runtime.
For example:

"name": "value"

or

"name": "@pipeline().parameters.password"

Expressions can appear anywhere in a JSON string value and always result in another JSON value. Here,
password is a pipeline parameter in the expression. If a JSON value is an expression, the body of the expression
is extracted by removing the at-sign (@). If a literal string is needed that starts with @, it must be escaped by
using @@. The following examples show how expressions are evaluated.

JSO N VA L UE RESULT

"parameters" The characters 'parameters' are returned.

"parameters[1]" The characters 'parameters[1]' are returned.

"@@" A 1 character string that contains '@' is returned.


JSO N VA L UE RESULT

" @" A 2 character string that contains ' @' is returned.

Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"

Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as
foo :

JSO N VA L UE RESULT

"@pipeline().parameters.myString" Returns foo as a string.

"@{pipeline().parameters.myString}" Returns foo as a string.

"@pipeline().parameters.myNumber" Returns 42 as a number.

"@{pipeline().parameters.myNumber}" Returns 42 as a string.

"Answer is: @{pipeline().parameters.myNumber}" Returns the string Answer is: 42 .

"@concat('Answer is: ', Returns the string Answer is: 42


string(pipeline().parameters.myNumber))"

"Answer is: @@{pipeline().parameters.myNumber}" Returns the string


Answer is: @{pipeline().parameters.myNumber} .

Examples of using parameters in expressions


Complex expression example
The below example shows a complex example that references a deep sub-field of activity output. To reference a
pipeline parameter that evaluates to a sub-field, use [] syntax instead of dot(.) operator (as in case of subfield1
and subfield2)
@activity('*activityName*').output.*subfield1*.*subfield2*[pipeline().parameters.*subfield3*].*subfield4*

Dynamic content editor


Dynamic content editor automatically escapes characters in your content when you finish editing. For example,
the following content in content editor is a string interpolation with two expression functions.

{
"type": "@{if(equals(1, 2), 'Blob', 'Table' )}",
"name": "@{toUpper('myData')}"
}

Dynamic content editor converts above content to expression


"{ \n \"type\": \"@{if(equals(1, 2), 'Blob', 'Table' )}\",\n \"name\": \"@{toUpper('myData')}\"\n}" . The
result of this expression is a JSON format string showed below.
{
"type": "Table",
"name": "MYDATA"
}

A dataset with parameters


In the following example, the BlobDataset takes a parameter named path . Its value is used to set a value for the
folderPath property by using the expression: dataset().path .

{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@dataset().path"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}

A pipeline with parameters


In the following example, the pipeline takes inputPath and outputPath parameters. The path for the
parameterized blob dataset is set by using values of these parameters. The syntax used here is:
pipeline().parameters.parametername .
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
}
}
}
}

Calling functions within expressions


You can call functions within expressions. The following sections provide information about the functions that
can be used in an expression.
String functions
To work with strings, you can use these string functions and also some collection functions. String functions
work only on strings.

ST RIN G F UN C T IO N TA SK

concat Combine two or more strings, and return the combined


string.

endsWith Check whether a string ends with the specified substring.


ST RIN G F UN C T IO N TA SK

guid Generate a globally unique identifier (GUID) as a string.

indexOf Return the starting position for a substring.

lastIndexOf Return the starting position for the last occurrence of a


substring.

replace Replace a substring with the specified string, and return the
updated string.

split Return an array that contains substrings, separated by


commas, from a larger string based on a specified delimiter
character in the original string.

startsWith Check whether a string starts with a specific substring.

substring Return characters from a string, starting from the specified


position.

toLower Return a string in lowercase format.

toUpper Return a string in uppercase format.

trim Remove leading and trailing whitespace from a string, and


return the updated string.

Collection functions
To work with collections, generally arrays, strings, and sometimes, dictionaries, you can use these collection
functions.

C O L L EC T IO N F UN C T IO N TA SK

contains Check whether a collection has a specific item.

empty Check whether a collection is empty.

first Return the first item from a collection.

intersection Return a collection that has only the common items across
the specified collections.

join Return a string that has all the items from an array,
separated by the specified character.

last Return the last item from a collection.

length Return the number of items in a string or array.

skip Remove items from the front of a collection, and return all
the other items.

take Return items from the front of a collection.


C O L L EC T IO N F UN C T IO N TA SK

union Return a collection that has all the items from the specified
collections.

Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.

LO GIC A L C O M PA RISO N F UN C T IO N TA SK

and Check whether all expressions are true.

equals Check whether both values are equivalent.

greater Check whether the first value is greater than the second
value.

greaterOrEquals Check whether the first value is greater than or equal to the
second value.

if Check whether an expression is true or false. Based on the


result, return a specified value.

less Check whether the first value is less than the second value.

lessOrEquals Check whether the first value is less than or equal to the
second value.

not Check whether an expression is false.

or Check whether at least one expression is true.

Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries

C O N VERSIO N F UN C T IO N TA SK

array Return an array from a single specified input. For multiple


inputs, see createArray.

base64 Return the base64-encoded version for a string.

base64ToBinary Return the binary version for a base64-encoded string.

base64ToString Return the string version for a base64-encoded string.


C O N VERSIO N F UN C T IO N TA SK

binary Return the binary version for an input value.

bool Return the Boolean version for an input value.

coalesce Return the first non-null value from one or more parameters.

createArray Return an array from multiple inputs.

dataUri Return the data URI for an input value.

dataUriToBinary Return the binary version for a data URI.

dataUriToString Return the string version for a data URI.

decodeBase64 Return the string version for a base64-encoded string.

decodeDataUri Return the binary version for a data URI.

decodeUriComponent Return a string that replaces escape characters with decoded


versions.

encodeUriComponent Return a string that replaces URL-unsafe characters with


escape characters.

float Return a floating point number for an input value.

int Return the integer version for a string.

json Return the JavaScript Object Notation (JSON) type value or


object for a string or XML.

string Return the string version for an input value.

uriComponent Return the URI-encoded version for an input value by


replacing URL-unsafe characters with escape characters.

uriComponentToBinary Return the binary version for a URI-encoded string.

uriComponentToString Return the string version for a URI-encoded string.

xml Return the XML version for a string.

xpath Check XML for nodes or values that match an XPath (XML
Path Language) expression, and return the matching nodes
or values.

Math functions
These functions can be used for either types of numbers: integers and floats .
M AT H F UN C T IO N TA SK

add Return the result from adding two numbers.

div Return the result from dividing two numbers.

max Return the highest value from a set of numbers or an array.

min Return the lowest value from a set of numbers or an array.

mod Return the remainder from dividing two numbers.

mul Return the product from multiplying two numbers.

rand Return a random integer from a specified range.

range Return an integer array that starts from a specified integer.

sub Return the result from subtracting the second number from
the first number.

Date functions
DAT E O R T IM E F UN C T IO N TA SK

addDays Add a number of days to a timestamp.

addHours Add a number of hours to a timestamp.

addMinutes Add a number of minutes to a timestamp.

addSeconds Add a number of seconds to a timestamp.

addToTime Add a number of time units to a timestamp. See also


getFutureTime.

convertFromUtc Convert a timestamp from Universal Time Coordinated


(UTC) to the target time zone.

convertTimeZone Convert a timestamp from the source time zone to the


target time zone.

convertToUtc Convert a timestamp from the source time zone to Universal


Time Coordinated (UTC).

dayOfMonth Return the day of the month component from a timestamp.

dayOfWeek Return the day of the week component from a timestamp.

dayOfYear Return the day of the year component from a timestamp.

formatDateTime Return the timestamp as a string in optional format.


DAT E O R T IM E F UN C T IO N TA SK

getFutureTime Return the current timestamp plus the specified time units.
See also addToTime.

getPastTime Return the current timestamp minus the specified time units.
See also subtractFromTime.

startOfDay Return the start of the day for a timestamp.

startOfHour Return the start of the hour for a timestamp.

startOfMonth Return the start of the month for a timestamp.

subtractFromTime Subtract a number of time units from a timestamp. See also


getPastTime.

ticks Return the ticks property value for a specified timestamp.

utcNow Return the current timestamp as a string.

Detailed examples for practice


Detailed Azure data factory copy pipeline with parameters
This Azure Data factory copy pipeline parameter passing tutorial walks you through how to pass parameters
between a pipeline and activity as well as between the activities.
Detailed Mapping data flow pipeline with parameters
Please follow Mapping data flow with parameters for comprehensive example on how to use parameters in data
flow.
Detailed Metadata driven pipeline with parameters
Please follow Metadata driven pipeline with parameters to learn more about how to use parameters to design
metadata driven pipelines. This is a popular use case for parameters.

Next steps
For a list of system variables you can use in expressions, see System variables.
Security considerations for data movement in Azure
Data Factory
7/6/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes basic security infrastructure that data movement services in Azure Data Factory use to help
secure your data. Data Factory management resources are built on Azure security infrastructure and use all
possible security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities
that together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is only available in few regions, the data movement service is available globally to
ensure data compliance, efficiency, and reduced network egress costs.
Azure Data Factory including Azure Integration Runtime and Self-hosted Integration Runtime does not store any
temporary data, cache data or logs except for linked service credentials for cloud data stores, which are
encrypted by using certificates. With Data Factory, you create data-driven workflows to orchestrate movement
of data between supported data stores, and processing of data by using compute services in other regions or in
an on-premises environment. You can also monitor and manage workflows by using SDKs and Azure Monitor.
Data Factory has been certified for:

C SA STA R C ERT IF IC AT IO N

ISO 20000-1:2011

ISO 22301:2012

ISO 27001:2013

ISO 27017:2015

ISO 27018:2014

ISO 9001:2015

SOC 1, 2, 3

HIPAA BAA

HITRUST

If you're interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center. For the latest list of all Azure Compliance offerings check - https://aka.ms/AzureCompliance.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario : In this scenario, both your source and your destination are publicly accessible through the
internet. These include managed cloud storage services such as Azure Storage, Azure Synapse Analytics,
Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce,
and web protocols such as FTP and OData. Find a complete list of supported data sources in Supported data
stores and formats.
Hybrid scenario : In this scenario, either your source or your destination is behind a firewall or inside an on-
premises corporate network. Or, the data store is in a private network or virtual network (most often the
source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this
scenario.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Cloud scenarios
Securing data store credentials
Store encr ypted credentials in an Azure Data Factor y managed store . Data Factory helps protect
your data store credentials by encrypting them with certificates managed by Microsoft. These certificates are
rotated every two years (which includes certificate renewal and the migration of credentials). For more
information about Azure Storage security, see Azure Storage security overview.
Store credentials in Azure Key Vault . You can also store the data store's credential in Azure Key Vault.
Data Factory retrieves the credential during the execution of an activity. For more information, see Store
credential in Azure Key Vault.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data
Factory and a cloud data store are via secure channel HTTPS or TLS.

NOTE
All connections to Azure SQL Database and Azure Synapse Analytics require encryption (SSL/TLS) while data is in transit
to and from the database. When you're authoring a pipeline by using JSON, add the encryption property and set it to
true in the connection string. For Azure Storage, you can use HTTPS in the connection string.

NOTE
To enable encryption in transit while moving data from Oracle follow one of the below options:
1. In Oracle server, go to Oracle Advanced Security (OAS) and configure the encryption settings, which supports Triple-
DES Encryption (3DES) and Advanced Encryption Standard (AES), refer here for details. ADF automatically negotiates
the encryption method to use the one you configure in OAS when establishing connection to Oracle.
2. In ADF, you can add EncryptionMethod=1 in the connection string (in the Linked Service). This will use SSL/TLS as the
encryption method. To use this, you need to disable non-SSL encryption settings in OAS on the Oracle server side to
avoid encryption conflict.

NOTE
TLS version used is 1.2.

Data encryption at rest


Some data stores support encryption of data at rest. We recommend that you enable the data encryption
mechanism for those data stores.
Azure Synapse Analytics
Transparent Data Encryption (TDE) in Azure Synapse Analytics helps protect against the threat of malicious
activity by performing real-time encryption and decryption of your data at rest. This behavior is transparent to
the client. For more information, see Secure a database in Azure Synapse Analytics.
Azure SQL Database
Azure SQL Database also supports transparent data encryption (TDE), which helps protect against the threat of
malicious activity by performing real-time encryption and decryption of the data, without requiring changes to
the application. This behavior is transparent to the client. For more information, see Transparent data encryption
for SQL Database and Data Warehouse.
Azure Data Lake Store
Azure Data Lake Store also provides encryption for data stored in the account. When enabled, Data Lake Store
automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client
that accesses the data. For more information, see Security in Azure Data Lake Store.
Azure Blob storage and Azure Table storage
Azure Blob storage and Azure Table storage support Storage Service Encryption (SSE), which automatically
encrypts your data before persisting to storage and decrypts before retrieval. For more information, see Azure
Storage Service Encryption for Data at Rest.
Amazon S3
Amazon S3 supports both client and server encryption of data at rest. For more information, see Protecting
Data Using Encryption.
Amazon Redshift
Amazon Redshift supports cluster encryption for data at rest. For more information, see Amazon Redshift
Database Encryption.
Salesforce
Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, and custom
fields. For more information, see Understanding the Web Server OAuth Authentication Flow.

Hybrid scenarios
Hybrid scenarios require self-hosted integration runtime to be installed in an on-premises network, inside a
virtual network (Azure), or inside a virtual private cloud (Amazon). The self-hosted integration runtime must be
able to access the local data stores. For more information about self-hosted integration runtime, see How to
create and configure self-hosted integration runtime.
The command channel allows communication between data movement services in Data Factory and self-hosted
integration runtime. The communication contains information related to the activity. The data channel is used for
transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials can be stored within data factory or be referenced by data factory during the runtime from
Azure Key Vault. If storing credentials within data factory, it is always stored encrypted on the self-hosted
integration runtime.
Store credentials locally . If you directly use the Set-AzDataFactor yV2LinkedSer vice cmdlet with
the connection strings and credentials inline in the JSON, the linked service is encrypted and stored on
self-hosted integration runtime. In this case the credentials flow through Azure backend service, which is
extremely secure, to the self-hosted integration machine where it is finally encrypted and stored. The self-
hosted integration runtime uses Windows DPAPI to encrypt the sensitive data and credential information.
Store credentials in Azure Key Vault . You can also store the data store's credential in Azure Key Vault.
Data Factory retrieves the credential during the execution of an activity. For more information, see Store
credential in Azure Key Vault.
Store credentials locally without flowing the credentials through Azure backend to the self-
hosted integration runtime . If you want to encrypt and store credentials locally on the self-hosted
integration runtime without having to flow the credentials through data factory backend, follow the steps
in Encrypt credentials for on-premises data stores in Azure Data Factory. All connectors support this
option. The self-hosted integration runtime uses Windows DPAPI to encrypt the sensitive data and
credential information.
Use the New-AzDataFactor yV2LinkedSer viceEncr yptedCredential cmdlet to encrypt linked service
credentials and sensitive details in the linked service. You can then use the JSON returned (with the
Encr yptedCredential element in the connection string) to create a linked service by using the Set-
AzDataFactor yV2LinkedSer vice cmdlet.
Ports used when encrypting linked service on self-hosted integration runtime
By default, when remote access from intranet is enabled, PowerShell uses port 8060 on the machine with self-
hosted integration runtime for secure communication. If necessary, this port can be changed from the
Integration Runtime Configuration Manager on the Settings tab:
Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Azure ExpressRoute to further secure the communication channel between your
on-premises network and Azure.
Azure Virtual Network is a logical representation of your network in the cloud. You can connect an on-premises
network to your virtual network by setting up IPSec VPN (site-to-site) or ExpressRoute (private peering).
The following table summarizes the network and self-hosted integration runtime configuration
recommendations based on different combinations of source and destination locations for hybrid data
movement.
N ET W O RK IN T EGRAT IO N RUN T IM E
SO URC E DEST IN AT IO N C O N F IGURAT IO N SET UP

On-premises Virtual machines and cloud IPSec VPN (point-to-site or The self-hosted integration
services deployed in virtual site-to-site) runtime should be installed
networks on an Azure virtual machine
in the virtual network.

On-premises Virtual machines and cloud ExpressRoute (private The self-hosted integration
services deployed in virtual peering) runtime should be installed
networks on an Azure virtual machine
in the virtual network.

On-premises Azure-based services that ExpressRoute (Microsoft The self-hosted integration


have a public endpoint peering) runtime can be installed on-
premises or on an Azure
virtual machine.

The following images show the use of self-hosted integration runtime for moving data between an on-premises
database and Azure services by using ExpressRoute and IPSec VPN (with Azure Virtual Network):
Express Route

IPSec VPN
Firewall configurations and allow list setting up for IP addresses

NOTE
You might have to manage ports or set up allow list for domains at the corporate firewall level as required by the
respective data sources. This table only uses Azure SQL Database, Azure Synapse Analytics, and Azure Data Lake Store as
examples.

NOTE
For details about data access strategies through Azure Data Factory, see this article.

Firewall requirements for on-premises/private network


In an enterprise, a corporate firewall runs on the central router of the organization. Windows Firewall runs as a
daemon on the local machine in which the self-hosted integration runtime is installed.
The following table provides outbound port and domain requirements for corporate firewalls:

DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N

*.servicebus.windows.net 443 Required by the self-hosted integration


runtime for interactive authoring.

{datafactory}. 443 Required by the self-hosted integration


{region}.datafactory.azure.net runtime to connect to the Data
or *.frontend.clouddatahub.net Factory service.
For new created Data Factory, please
find the FQDN from your Self-hosted
Integration Runtime key which is in
format {datafactory}.
{region}.datafactory.azure.net. For old
Data factory, if you don't see the
FQDN in your Self-hosted Integration
key, please use
*.frontend.clouddatahub.net instead.

download.microsoft.com 443 Required by the self-hosted integration


runtime for downloading the updates.
If you have disabled auto-update, you
can skip configuring this domain.

*.core.windows.net 443 Used by the self-hosted integration


runtime to connect to the Azure
storage account when you use the
staged copy feature.

*.database.windows.net 1433 Required only when you copy from or


to Azure SQL Database or Azure
Synapse Analytics and optional
otherwise. Use the staged-copy
feature to copy data to SQL Database
or Azure Synapse Analytics without
opening port 1433.

*.azuredatalakestore.net 443 Required only when you copy from or


login.microsoftonline.com/<tenant>/oauth2/token to Azure Data Lake Store and optional
otherwise.
NOTE
You might have to manage ports or set up allow list for domains at the corporate firewall level as required by the
respective data sources. This table only uses Azure SQL Database, Azure Synapse Analytics, and Azure Data Lake Store as
examples.

The following table provides inbound port requirements for Windows Firewall:

IN B O UN D P O RT S DESC RIP T IO N

8060 (TCP) Required by the PowerShell encryption cmdlet as described


in Encrypt credentials for on-premises data stores in Azure
Data Factory, and by the credential manager application to
securely set credentials for on-premises data stores on the
self-hosted integration runtime.

IP configurations and allow list setting up in data stores


Some data stores in the cloud also require that you allow the IP address of the machine accessing the store.
Ensure that the IP address of the self-hosted integration runtime machine is allowed or configured in the firewall
appropriately.
The following cloud data stores require that you allow the IP address of the self-hosted integration runtime
machine. Some of these data stores, by default, might not require allow list.
Azure SQL Database
Azure Synapse Analytics
Azure Data Lake Store
Azure Cosmos DB
Amazon Redshift
Frequently asked questions
Can the self-hosted integration runtime be shared across different data factories?
Yes. More details here.
What are the por t requirements for the self-hosted integration runtime to work?
The self-hosted integration runtime makes HTTP-based connections to access the internet. The outbound ports
443 must be opened for the self-hosted integration runtime to make this connection. Open inbound port 8060
only at the machine level (not the corporate firewall level) for credential manager application. If Azure SQL
Database or Azure Synapse Analytics is used as the source or the destination, you need to open port 1433 as
well. For more information, see the Firewall configurations and allow list setting up for IP addresses section.

Next steps
For information about Azure Data Factory Copy Activity performance, see Copy Activity performance and tuning
guide.
Data access strategies
5/6/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


A vital security goal of an organization is to protect their data stores from random access over the internet, may
it be an on-premise or a Cloud/ SaaS data store.
Typically a cloud data store controls access using the below mechanisms:
Private Link from a Virtual Network to Private Endpoint enabled data sources
Firewall rules that limit connectivity by IP address
Authentication mechanisms that require users to prove their identity
Authorization mechanisms that restrict users to specific actions and data

TIP
With the introduction of Static IP address range, you can now allow list IP ranges for the particular Azure integration
runtime region to ensure you don’t have to allow all Azure IP addresses in your cloud data stores. This way, you can
restrict the IP addresses that are permitted to access the data stores.

NOTE
The IP address ranges are blocked for Azure Integration Runtime and is currently only used for Data Movement, pipeline
and external activities. Dataflows and Azure Integration Runtime that enable Managed Virtual Network now do not use
these IP ranges.

This should work in many scenarios, and we do understand that a unique Static IP address per integration
runtime would be desirable, but this wouldn't be possible using Azure Integration Runtime currently, which is
serverless. If necessary, you can always set up a Self-hosted Integration Runtime and use your Static IP with it.

Data access strategies through Azure Data Factory


Private Link - You can create an Azure Integration Runtime within Azure Data Factory Managed Virtual
Network and it will leverage private endpoints to securely connect to supported data stores. Traffic between
Managed Virtual Network and data sources travels the Microsoft backbone network and are not exposure to
public network.
Trusted Ser vice - Azure Storage (Blob, ADLS Gen2) supports firewall configuration that enables select
trusted Azure platform services to access the storage account securely. Trusted Services enforces Managed
Identity authentication, which ensures no other data factory can connect to this storage unless approved to
do so using it's managed identity. You can find more details in this blog . Hence, this is extremely secure and
recommended.
Unique Static IP - You will need to set up a self-hosted integration runtime to get a Static IP for Data Factory
connectors. This mechanism ensures you can block access from all other IP addresses.
Static IP range - You can use Azure Integration Runtime's IP addresses to allow list it in your storage (say
S3, Salesforce, etc.). It certainly restricts IP addresses that can connect to the data stores but also relies on
Authentication/ Authorization rules.
Ser vice Tag - A service tag represents a group of IP address prefixes from a given Azure service (like Azure
Data Factory). Microsoft manages the address prefixes encompassed by the service tag and automatically
updates the service tag as addresses change, minimizing the complexity of frequent updates to network
security rules. It is useful when filtering data access on IaaS hosted data stores in Virtual Network.
Allow Azure Ser vices - Some services lets you allow all Azure services to connect to it in case you choose
this option.
For more information about supported network security mechanisms on data stores in Azure Integration
Runtime and Self-hosted Integration Runtime, see below two tables.
Azure Integration Runtime

SUP P O RT ED
N ET W O RK
SEC URIT Y
M EC H A N IS A L LO W
DATA M O N DATA P RIVAT E T RUST ED STAT IC IP SERVIC E A Z URE
STO RES STO RES L IN K SERVIC E RA N GE TA GS SERVIC ES

Azure PaaS Azure Yes - Yes - Yes


Data stores Cosmos DB

Azure Data - - Yes* Yes* -


Explorer

Azure Data - - Yes - Yes


Lake Gen1

Azure - - Yes - Yes


Database for
MariaDB,
MySQL,
PostgreSQL

Azure File Yes - Yes - .


Storage

Azure Yes Yes (MSI Yes - .


Storage auth only)
(Blob, ADLS
Gen2)

Azure SQL Yes (only - Yes - Yes


DB, Azure Azure SQL
Synapse DB/DW)
Analytics),
SQL Ml

Azure Key yes Yes Yes - -


Vault (for
fetching
secrets/
connection
string)

Other PaaS/ AWS S3, - - Yes - -


SaaS Data SalesForce,
stores Google
Cloud
Storage, etc.
SUP P O RT ED
N ET W O RK
SEC URIT Y
M EC H A N IS A L LO W
DATA M O N DATA P RIVAT E T RUST ED STAT IC IP SERVIC E A Z URE
STO RES STO RES L IN K SERVIC E RA N GE TA GS SERVIC ES

Azure laaS SQL Server, - - Yes Yes -


Oracle, etc.

On-premise SQL Server, - - Yes - -


laaS Oracle, etc.

*Applicable only when Azure Data Explorer is virtual network injected, and IP range can be applied on
NSG/ Firewall.
Self-hosted Integration Runtime (in Vnet/on-premise)

SUP P O RT ED N ET W O RK
SEC URIT Y M EC H A N ISM
DATA STO RES O N DATA STO RES STAT IC IP T RUST ED SERVIC ES

Azure PaaS Data stores Azure Cosmos DB Yes -

Azure Data Explorer - -

Azure Data Lake Gen1 Yes -

Azure Database for Yes -


MariaDB, MySQL,
PostgreSQL

Azure File Storage Yes -

Azure Storage (Blog, ADLS Yes Yes (MSI auth only)


Gen2)

Azure SQL DB, Azure Yes -


Synapse Analytics), SQL
Ml

Azure Key Vault (for Yes Yes


fetching secrets/
connection string)

Other PaaS/ SaaS Data AWS S3, SalesForce, Yes -


stores Google Cloud Storage,
etc.

Azure laaS SQL Server, Oracle, etc. Yes -

On-premise laaS SQL Server, Oracle, etc. Yes -

Next steps
For more information, see the following related articles:
Supported data stores
Azure Key Vault ‘Trusted Services’
Azure Storage ‘Trusted Microsoft Services’
Managed identity for Data Factory
Azure Integration Runtime IP addresses
5/6/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The IP addresses that Azure Integration Runtime uses depends on the region where your Azure integration
runtime is located. All Azure integration runtimes that are in the same region use the same IP address ranges.

IMPORTANT
Data flows and Azure Integration Runtime which enable Managed Virtual Network don't support the use of fixed IP
ranges.
You can use these IP ranges for Data Movement, Pipeline and External activities executions. These IP ranges can be used
for filtering in data stores/ Network Security Group (NSG)/ Firewalls for inbound access from Azure Integration runtime.

Azure Integration Runtime IP addresses: Specific regions


Allow traffic from the IP addresses listed for the Azure Integration runtime in the specific Azure region where
your resources are located. You can get an IP range list of service tags from the service tags IP range download
link. For example, if the Azure region is AustraliaEast , you can get an IP range list from
DataFactor y.AustraliaEast .

Known issue with Azure Storage


When connecting to Azure Storage account, IP network rules have no effect on requests originating from
the Azure integration runtime in the same region as the storage account. For more details, please refer
this article.
Instead, we suggest using trusted services while connecting to Azure Storage.

Next steps
Security considerations for data movement in Azure Data Factory
Store credential in Azure Key Vault
5/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can store credentials for data stores and computes in an Azure Key Vault. Azure Data Factory retrieves the
credentials when executing an activity that uses the data store/compute.
Currently, all activity types except custom activity support this feature. For connector configuration specifically,
check the "linked service properties" section in each connector topic for details.

Prerequisites
This feature relies on the data factory managed identity. Learn how it works from Managed identity for Data
factory and make sure your data factory have an associated one.

Steps
To reference a credential stored in Azure Key Vault, you need to:
1. Retrieve data factor y managed identity by copying the value of "Managed Identity Object ID" generated
along with your factory. If you use ADF authoring UI, the managed identity object ID will be shown on the
Azure Key Vault linked service creation window; you can also retrieve it from Azure portal, refer to Retrieve
data factory managed identity.
2. Grant the managed identity access to your Azure Key Vault. In your key vault -> Access policies ->
Add Access Policy, search this managed identity to grant Get permission in Secret permissions dropdown. It
allows this designated factory to access secret in key vault.
3. Create a linked ser vice pointing to your Azure Key Vault. Refer to Azure Key Vault linked service.
4. Create data store linked ser vice, inside which reference the corresponding secret stored in key
vault. Refer to reference secret stored in key vault.

Azure Key Vault linked service


The following properties are supported for Azure Key Vault linked service:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property must be set to: Yes


AzureKeyVault .

baseUrl Specify the Azure Key Vault URL. Yes

Using authoring UI:


Select Connections -> Linked Ser vices -> New . In New linked service, search for and select "Azure Key
Vault":
Select the provisioned Azure Key Vault where your credentials are stored. You can do Test Connection to make
sure your AKV connection is valid.
JSON example:

{
"name": "AzureKeyVaultLinkedService",
"properties": {
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://<azureKeyVaultName>.vault.azure.net"
}
}
}

Reference secret stored in key vault


The following properties are supported when you configure a field in linked service referencing a key vault
secret:

P RO P ERT Y DESC RIP T IO N REQ UIRED

type The type property of the field must be Yes


set to: AzureKeyVaultSecret .

secretName The name of secret in Azure Key Vault. Yes


P RO P ERT Y DESC RIP T IO N REQ UIRED

secretVersion The version of secret in Azure Key No


Vault.
If not specified, it always uses the
latest version of the secret.
If specified, then it sticks to the given
version.

store Refers to an Azure Key Vault linked Yes


service that you use to store the
credential.

Using authoring UI:


Select Azure Key Vault for secret fields while creating the connection to your data store/compute. Select the
provisioned Azure Key Vault Linked Service and provide the Secret name . You can optionally provide a secret
version as well.

TIP
For connectors using connection string in linked service like SQL Server, Blob storage, etc., you can choose either to store
only the secret field e.g. password in AKV, or to store the entire connection string in AKV. You can find both options on the
UI.
JSON example: (see the "password" section)
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "<>",
"organizationName": "<>",
"authenticationType": "<>",
"username": "<>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
}
}
}

Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Use Azure Key Vault secrets in pipeline activities
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can store credentials or secret values in an Azure Key Vault and use them during pipeline execution to pass
to your activities.

Prerequisites
This feature relies on the data factory managed identity. Learn how it works from Managed identity for Data
Factory and make sure your data factory has one associated.

Steps
1. Open the properties of your data factory and copy the Managed Identity Application ID value.

2. Open the key vault access policies and add the managed identity permissions to Get and List secrets.
Click Add , then click Save .
3. Navigate to your Key Vault secret and copy the Secret Identifier.

Make a note of your secret URI that you want to get during your data factory pipeline run.
4. In your Data Factory pipeline, add a new Web activity and configure it as follows.

P RO P ERT Y VA L UE

Secure Output True

URL [Your secret URI value]?api-version=7.0

Method GET

Authentication MSI
P RO P ERT Y VA L UE

Resource https://vault.azure.net

IMPORTANT
You must add ?api-version=7.0 to the end of your secret URI.

Cau t i on

Set the Secure Output option to true to prevent the secret value from being logged in plain text. Any
further activities that consume this value should have their Secure Input option set to true.
5. To use the value in another activity, use the following code expression @activity('Web1').output.value .

Next steps
To learn how to use Azure Key Vault to store credentials for data stores and computes, see Store credentials in
Azure Key Vault
Encrypt credentials for on-premises data stores in
Azure Data Factory
5/28/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can encrypt and store credentials for your on-premises data stores (linked services with sensitive
information) on a machine with self-hosted integration runtime.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

You pass a JSON definition file with credentials to the


New-AzDataFactor yV2LinkedSer viceEncr yptedCredential cmdlet to produce an output JSON definition
file with the encrypted credentials. Then, use the updated JSON definition to create the linked services.

Author SQL Server linked service


Create a JSON file named SqlSer verLinkedSer vice.json in any folder with the following content:
Replace <servername> , <databasename> , <username> , and <password> with values for your SQL Server before
saving the file. And, replace <integration runtime name> with the name of your integration runtime.

{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
},
"name": "SqlServerLinkedService"
}
}

Encrypt credentials
To encrypt the sensitive data from the JSON payload on an on-premises self-hosted integration runtime, run
New-AzDataFactor yV2LinkedSer viceEncr yptedCredential , and pass on the JSON payload. This cmdlet
ensures the credentials are encrypted using DPAPI and stored on the self-hosted integration runtime node
locally. The output payload containing the encrypted reference to the credential can be redirected to another
JSON file (in this case 'encryptedLinkedService.json').
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -ResourceGroupName
$ResourceGroupName -Name "SqlServerLinkedService" -DefinitionFile ".\SQLServerLinkedService.json" >
encryptedSQLServerLinkedService.json

Use the JSON with encrypted credentials


Now, use the output JSON file from the previous command containing the encrypted credential to set up the
SqlSer verLinkedSer vice .

Set-AzDataFactoryV2LinkedService -DataFactoryName $dataFactoryName -ResourceGroupName $ResourceGroupName -


Name "EncryptedSqlServerLinkedService" -DefinitionFile ".\encryptedSqlServerLinkedService.json"

Next steps
For information about security considerations for data movement, see Data movement security considerations.
Managed identity for Data Factory
5/28/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article helps you understand what a managed identity is for Data Factory (formerly known as Managed
Service Identity/MSI) and how it works.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Overview
When creating a data factory, a managed identity can be created along with factory creation. The managed
identity is a managed application registered to Azure Active Directory, and represents this specific data factory.
Managed identity for Data Factory benefits the following features:
Store credential in Azure Key Vault, in which case data factory managed identity is used for Azure Key Vault
authentication.
Access data stores or computes using managed identity authentication, including Azure Blob storage, Azure
Data Explorer, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, Azure SQL
Managed Instance, Azure Synapse Analytics, REST, Databricks activity, Web activity, and more. Check the
connector and activity articles for details.

Generate managed identity


Managed identity for Data Factory is generated as follows:
When creating data factory through Azure por tal or PowerShell , managed identity will always be created
automatically.
When creating data factory through SDK , managed identity will be created only if you specify "Identity =
new FactoryIdentity()" in the factory object for creation. See example in .NET quickstart - create data factory.
When creating data factory through REST API , managed identity will be created only if you specify "identity"
section in request body. See example in REST quickstart - create data factory.
If you find your data factory doesn't have a managed identity associated following retrieve managed identity
instruction, you can explicitly generate one by updating the data factory with identity initiator programmatically:
Generate managed identity using PowerShell
Generate managed identity using REST API
Generate managed identity using an Azure Resource Manager template
Generate managed identity using SDK
NOTE
Managed identity cannot be modified. Updating a data factory which already have a managed identity won't have any
impact, the managed identity is kept unchanged.
If you update a data factory which already have a managed identity without specifying "identity" parameter in the
factory object or without specifying "identity" section in REST request body, you will get an error.
When you delete a data factory, the associated managed identity will be deleted along.

Generate managed identity using PowerShell


Call Set-AzDataFactor yV2 command, then you see "Identity" fields being newly generated:

PS C:\WINDOWS\system32> Set-AzDataFactoryV2 -ResourceGroupName <resourceGroupName> -Name <dataFactoryName> -


Location <region>

DataFactoryName : ADFV2DemoFactory
DataFactoryId :
/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/ADFV2De
moFactory
ResourceGroupName : <resourceGroupName>
Location : East US
Tags : {}
Identity : Microsoft.Azure.Management.DataFactory.Models.FactoryIdentity
ProvisioningState : Succeeded

Generate managed identity using REST API


Call below API with "identity" section in the request body:

PATCH
https://management.azure.com/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.D
ataFactory/factories/<data factory name>?api-version=2018-06-01

Request body : add "identity": { "type": "SystemAssigned" }.

{
"name": "<dataFactoryName>",
"location": "<region>",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}

Response : managed identity is created automatically, and "identity" section is populated accordingly.
{
"name": "<dataFactoryName>",
"tags": {},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-26T04:10:01.1135678Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "765ad4ab-XXXX-XXXX-XXXX-51ed985819dc",
"tenantId": "72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factorie
s/ADFV2DemoFactory",
"type": "Microsoft.DataFactory/factories",
"location": "<region>"
}

Generate managed identity using an Azure Resource Manager template


Template : add "identity": { "type": "SystemAssigned" }.

{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"resources": [{
"name": "<dataFactoryName>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "<region>",
"identity": {
"type": "SystemAssigned"
}
}]
}

Generate managed identity using SDK


Call the data factory create_or_update function with Identity=new FactoryIdentity(). Sample code using .NET:

Factory dataFactory = new Factory


{
Location = <region>,
Identity = new FactoryIdentity()
};
client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, dataFactory);

Retrieve managed identity


You can retrieve the managed identity from Azure portal or programmatically. The following sections show
some samples.

TIP
If you don't see the managed identity, generate managed identity by updating your factory.

Retrieve managed identity using Azure portal


You can find the managed identity information from Azure portal -> your data factory -> Properties.
Managed Identity Object ID
Managed Identity Tenant
The managed identity information will also show up when you create linked service, which supports managed
identity authentication, like Azure Blob, Azure Data Lake Storage, Azure Key Vault, etc.
When granting permission, in Azure resource's Access Control (IAM) tab -> Add role assignment -> Assign
access to -> select Data Factory under System assigned managed identity -> select by factory name; or in
general, you can use object ID or data factory name (as managed identity name) to find this identity. If you need
to get managed identity's application ID, you can use PowerShell.
Retrieve managed identity using PowerShell
The managed identity principal ID and tenant ID will be returned when you get a specific data factory as follows.
Use the PrincipalId to grant access:

PS C:\WINDOWS\system32> (Get-AzDataFactoryV2 -ResourceGroupName <resourceGroupName> -Name


<dataFactoryName>).Identity

PrincipalId TenantId
----------- --------
765ad4ab-XXXX-XXXX-XXXX-51ed985819dc 72f988bf-XXXX-XXXX-XXXX-2d7cd011db47

You can get the application ID by copying above principal ID, then running below Azure Active Directory
command with principal ID as parameter.

PS C:\WINDOWS\system32> Get-AzADServicePrincipal -ObjectId 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc

ServicePrincipalNames : {76f668b3-XXXX-XXXX-XXXX-1b3348c75e02,
https://identity.azure.net/P86P8g6nt1QxfPJx22om8MOooMf/Ag0Qf/nnREppHkU=}
ApplicationId : 76f668b3-XXXX-XXXX-XXXX-1b3348c75e02
DisplayName : ADFV2DemoFactory
Id : 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc
Type : ServicePrincipal

Retrieve managed identity using REST API


The managed identity principal ID and tenant ID will be returned when you get a specific data factory as follows.
Call below API in the request:

GET
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Mic
rosoft.DataFactory/factories/{factoryName}?api-version=2018-06-01

Response : You will get response like shown in below example. The "identity" section is populated accordingly.
{
"name":"<dataFactoryName>",
"identity":{
"type":"SystemAssigned",
"principalId":"554cff9e-XXXX-XXXX-XXXX-90c7d9ff2ead",
"tenantId":"72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},

"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>",
"type":"Microsoft.DataFactory/factories",
"properties":{
"provisioningState":"Succeeded",
"createTime":"2020-02-12T02:22:50.2384387Z",
"version":"2018-06-01",
"factoryStatistics":{
"totalResourceCount":0,
"maxAllowedResourceCount":0,
"factorySizeInGbUnits":0,
"maxAllowedFactorySizeInGbUnits":0
}
},
"eTag":"\"03006b40-XXXX-XXXX-XXXX-5e43617a0000\"",
"location":"<region>",
"tags":{

}
}

TIP
To retrieve the managed identity from an ARM template, add an outputs section in the ARM JSON:

{
"outputs":{
"managedIdentityObjectId":{
"type":"string",
"value":"[reference(resourceId('Microsoft.DataFactory/factories',
parameters('<dataFactoryName>')), '2018-06-01', 'Full').identity.principalId]"
}
}
}

Next steps
See the following topics that introduce when and how to use data factory managed identity:
Store credential in Azure Key Vault
Copy data from/to Azure Data Lake Store using managed identities for Azure resources authentication
See Managed Identities for Azure Resources Overview for more background on managed identities for Azure
resources, which data factory managed identity is based upon.
Encrypt Azure Data Factory with customer-
managed keys
4/2/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory encrypts data at rest, including entity definitions and any data cached while runs are in
progress. By default, data is encrypted with a randomly generated Microsoft-managed key that is uniquely
assigned to your data factory. For extra security guarantees, you can now enable Bring Your Own Key (BYOK)
with customer-managed keys feature in Azure Data Factory. When you specify a customer-managed key, Data
Factory uses both the factory system key and the CMK to encrypt customer data. Missing either would result in
Deny of Access to data and factory.
Azure Key Vault is required to store customer-managed keys. You can either create your own keys and store
them in a key vault, or you can use the Azure Key Vault APIs to generate keys. Key vault and Data Factory must
be in the same Azure Active Directory (Azure AD) tenant and in the same region, but they may be in different
subscriptions. For more information about Azure Key Vault, see What is Azure Key Vault?

About customer-managed keys


The following diagram shows how Data Factory uses Azure Active Directory and Azure Key Vault to make
requests using the customer-managed key:

The following list explains the numbered steps in the diagram:


1. An Azure Key Vault admin grants permissions to encryption keys to the managed identity that's associated
with the Data Factory
2. A Data Factory admin enables customer-managed key feature in the factory
3. Data Factory uses the managed identity that's associated with the factory to authenticate access to Azure Key
Vault via Azure Active Directory
4. Data Factory wraps the factory encryption key with the customer key in Azure Key Vault
5. For read/write operations, Data Factory sends requests to Azure Key Vault to unwrap the account encryption
key to perform encryption and decryption operations
There are two ways of adding Customer Managed Key encryption to data factories. One is during factory
creation time in Azure portal, and the other is post factory creation, in Data Factory UI.

Prerequisites - configure Azure Key Vault and generate keys


Enable Soft Delete and Do Not Purge on Azure Key Vault
Using customer-managed keys with Data Factory requires two properties to be set on the Key Vault, Soft
Delete and Do Not Purge . These properties can be enabled using either PowerShell or Azure CLI on a new or
existing key vault. To learn how to enable these properties on an existing key vault, see Azure Key Vault recovery
management with soft delete and purge protection
If you are creating a new Azure Key Vault through Azure portal, Soft Delete and Do Not Purge can be enabled
as follows:

Grant Data Factory access to Azure Key Vault


Make sure Azure Key Vault and Azure Data Factory are in the same Azure Active Directory (Azure AD) tenant and
in the same region. From Azure Key Vault access control, grant data factory following permissions: Get, Unwrap
Key, and Wrap Key. These permissions are required to enable customer-managed keys in Data Factory.
If you want to add customer managed key encryption after factory creation in Data Factory UI, ensure
data factory's managed service identity (MSI) has the three permissions to Key Vault
If you want to add customer managed key encryption during factory creation time in Azure portal,
ensure the user-assigned managed identity (UA-MI) has the three permissions to Key Vault
Generate or upload customer-managed key to Azure Key Vault
You can either create your own keys and store them in a key vault. Or you can use the Azure Key Vault APIs to
generate keys. Only 2048-bit RSA keys are supported with Data Factory encryption. For more information, see
About keys, secrets, and certificates.
Enable customer-managed keys
Post factory creation in Data Factory UI
This section walks through the process to add customer managed key encryption in Data Factory UI, after
factory is created.

NOTE
A customer-managed key can only be configured on an empty data Factory. The data factory can't contain any resources
such as linked services, pipelines and data flows. It is recommended to enable customer-managed key right after factory
creation.

IMPORTANT
This approach does not work with managed virtual network enabled factories. Please consider the alternative route, if you
want encrypt such factories.

1. Make sure that data factory's Managed Service Identity (MSI) has Get, Unwrap Key and Wrap Key
permissions to Key Vault.
2. Ensure the Data Factory is empty. The data factory can't contain any resources such as linked services,
pipelines, and data flows. For now, deploying customer-managed key to a non-empty factory will result in
an error.
3. To locate the key URI in the Azure portal, navigate to Azure Key Vault, and select the Keys setting. Select
the wanted key, then select the key to view its versions. Select a key version to view the settings
4. Copy the value of the Key Identifier field, which provides the URI

5. Launch Azure Data Factory portal, and using the navigation bar on the left, jump to Data Factory
Management Portal
6. Click on the Customer manged key icon
7. Enter the URI for customer-managed key that you copied before
8. Click Save and customer-manged key encryption is enabled for Data Factory
During factory creation in Azure portal
This section walks through steps to add customer managed key encryption in Azure portal, during factory
deployment.
To encrypt the factory, Data Factory needs to first retrieve customer-managed key from Key Vault. Since factory
deployment is still in progress, Managed Service Identity (MSI) isn't available yet to authenticate with Key Vault.
As such, to use this approach, customer needs to assign a user-assigned managed identity (UA-MI) to data
factory. We will assume the roles defined in the UA-MI and authenticate with Key Vault.
To learn more about user-assigned managed identity, see Managed identity types and Role assignment for user
assigned managed identity.
1. Make sure that User-assigned Managed Identity (UA-MI) has Get, Unwrap Key and Wrap Key permissions
to Key Vault
2. Under Advanced tab, check the box for Enable encryption using a customer managed key

3. Provide the url for the customer managed key stored in Key Vault
4. Select an appropriate user assigned managed identity to authenticate with Key Vault
5. Continue with factory deployment

Update Key Version


When you create a new version of a key, update data factory to use the new version. Follow similar steps as
described in section Data Factory UI, including:
1. Locate the URI for the new key version through Azure Key Vault Portal
2. Navigate to Customer-managed key setting
3. Replace and paste in the URI for the new key
4. Click Save and Data Factory will now encrypt with the new key version

Use a Different Key


To change key used for Data Factory encryption, you have to manually update the settings in Data Factory.
Follow similar steps as described in section Data Factory UI, including:
1. Locate the URI for the new key through Azure Key Vault Portal
2. Navigate to Customer manged key setting
3. Replace and paste in the URI for the new key
4. Click Save and Data Factory will now encrypt with the new key

Disable Customer-managed Keys


By design, once the customer-managed key feature is enabled, you can't remove the extra security step. We will
always expect a customer provided key to encrypt factory and data.

Customer managed key and continuous integration and continuous


deployment
By default, CMK configuration is not included in the factory Azure Resource Manager (ARM) template. To include
the customer managed key encryption settings in ARM template for continuous integration (CI/CD):
1. Ensure the factory is in Git mode
2. Navigate to management portal - customer managed key section
3. Check Include in ARM template option
The following settings will be added in ARM template. These properties can be parameterized in Continuous
Integration and Delivery pipelines by editing the Azure Resource Manager parameters configuration

NOTE
Adding the encryption setting to the ARM templates adds a factory-level setting that will override other factory level
settings, such as git configurations, in other environments. If you have these settings enabled in an elevated environment
such as UAT or PROD, please refer to Global Parameters in CI/CD.

Next steps
Go through the tutorials to learn about using Data Factory in more scenarios.
Azure Data Factory Managed Virtual Network
(preview)
7/20/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article will explain Managed Virtual Network and Managed Private endpoints in Azure Data Factory.

Managed virtual network


When you create an Azure Integration Runtime (IR) within Azure Data Factory Managed Virtual Network (VNET),
the integration runtime will be provisioned with the managed Virtual Network and will leverage private
endpoints to securely connect to supported data stores.
Creating an Azure IR within managed Virtual Network ensures that data integration process is isolated and
secure.
Benefits of using Managed Virtual Network:
With a Managed Virtual Network, you can offload the burden of managing the Virtual Network to Azure
Data Factory. You don't need to create a subnet for Azure Integration Runtime that could eventually use many
private IPs from your Virtual Network and would require prior network infrastructure planning.
It does not require deep Azure networking knowledge to do data integrations securely. Instead getting
started with secure ETL is much simplified for data engineers.
Managed Virtual Network along with Managed private endpoints protects against data exfiltration.

IMPORTANT
Currently, the managed Virtual Network is only supported in the same region as Azure Data Factory region.

NOTE
As Azure Data Factory managed Virtual Network is still in public preview, there is no SLA guarantee.

NOTE
Existing public Azure integration runtime can't switch to Azure integration runtime in Azure Data Factory managed virtual
network and vice versa.
Managed private endpoints
Managed private endpoints are private endpoints created in the Azure Data Factory Managed Virtual Network
establishing a private link to Azure resources. Azure Data Factory manages these private endpoints on your
behalf.

Azure Data Factory supports private links. Private link enables you to access Azure (PaaS) services (such as
Azure Storage, Azure Cosmos DB, Azure Synapse Analytics).
When you use a private link, traffic between your data stores and managed Virtual Network traverses entirely
over the Microsoft backbone network. Private Link protects against data exfiltration risks. You establish a private
link to a resource by creating a private endpoint.
Private endpoint uses a private IP address in the managed Virtual Network to effectively bring the service into it.
Private endpoints are mapped to a specific resource in Azure and not the entire service. Customers can limit
connectivity to a specific resource approved by their organization. Learn more about private links and private
endpoints.
NOTE
It's recommended that you create Managed private endpoints to connect to all your Azure data sources.

WARNING
If a PaaS data store (Blob, ADLS Gen2, Azure Synapse Analytics) has a private endpoint already created against it, and
even if it allows access from all networks, ADF would only be able to access it using a managed private endpoint. If a
private endpoint does not already exist, you must create one in such scenarios.

A private endpoint connection is created in a "Pending" state when you create a managed private endpoint in
Azure Data Factory. An approval workflow is initiated. The private link resource owner is responsible to approve
or reject the connection.

If the owner approves the connection, the private link is established. Otherwise, the private link won't be
established. In either case, the Managed private endpoint will be updated with the status of the connection.
Only a Managed private endpoint in an approved state can send traffic to a given private link resource.

Interactive Authoring
Interactive authoring capabilities is used for functionalities like test connection, browse folder list and table list,
get schema, and preview data. You can enable interactive authoring when creating or editing an Azure
Integration Runtime which is in ADF-managed virtual network. The backend service will pre-allocate compute
for interactive authoring functionalities. Otherwise, the compute will be allocated every time any interactive
operation is performed which will take more time. The Time To Live (TTL) for interactive authoring is 60 minutes,
which means it will automatically become disabled after 60 minutes of the last interactive authoring operation.
Activity execution time using managed virtual network
By design, Azure integration runtime in managed virtual network takes longer queue time than public Azure
integration runtime as we are not reserving one compute node per data factory, so there is a warm up for each
activity to start, and it occurs primarily on virtual network join rather than Azure integration runtime. For non-
copy activities including pipeline activity and external activity, there is a 60 minutes Time To Live (TTL) when you
trigger them at the first time. Within TTL, the queue time is shorter because the node is already warmed up.

NOTE
Copy activity doesn't have TTL support yet.

Create managed virtual network via Azure PowerShell


$subscriptionId = ""
$resourceGroupName = ""
$factoryName = ""
$managedPrivateEndpointName = ""
$integrationRuntimeName = ""
$apiVersion = "2018-06-01"
$privateLinkResourceId = ""

$vnetResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/managedVirtualNetworks/default"
$privateEndpointResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/managedVirtualNetworks/default/managedprivateendpoints/${managedPrivateEndpointName}"
$integrationRuntimeResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/integrationRuntimes/${integrationRuntimeName}"

# Create managed Virtual Network resource


New-AzResource -ApiVersion "${apiVersion}" -ResourceId "${vnetResourceId}" -Properties @{}

# Create managed private endpoint resource


New-AzResource -ApiVersion "${apiVersion}" -ResourceId "${privateEndpointResourceId}" -Properties @{
privateLinkResourceId = "${privateLinkResourceId}"
groupId = "blob"
}

# Create integration runtime resource enabled with VNET


New-AzResource -ApiVersion "${apiVersion}" -ResourceId "${integrationRuntimeResourceId}" -Properties @{
type = "Managed"
typeProperties = @{
computeProperties = @{
location = "AutoResolve"
dataFlowProperties = @{
computeType = "General"
coreCount = 8
timeToLive = 0
}
}
}
managedVirtualNetwork = @{
type = "ManagedVirtualNetworkReference"
referenceName = "default"
}
}

Limitations and known issues


Supported Data Sources
Below data sources have native Private Endpoint support and can be connected through private link from ADF
Managed Virtual Network.
Azure Blob Storage (not including Storage account V1)
Azure Table Storage (not including Storage account V1)
Azure Files (not including Storage account V1)
Azure Data Lake Gen2
Azure SQL Database (not including Azure SQL Managed Instance)
Azure Synapse Analytics
Azure CosmosDB SQL
Azure Key Vault
Azure Private Link Service
Azure Search
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Azure Machine Learning

NOTE
You still can access all data sources that are supported by Data Factory through public network.

NOTE
Because Azure SQL Managed Instance doesn't support native Private Endpoint right now, you can access it from
managed Virtual Network using Private Linked Service and Load Balancer. Please see How to access SQL Managed
Instance from Data Factory Managed VNET using Private Endpoint.

On premises Data Sources


To access on premises data sources from managed Virtual Network using Private Endpoint, please see this
tutorial How to access on premises SQL Server from Data Factory Managed VNET using Private Endpoint.
Azure Data Factory Managed Virtual Network is available in the following Azure regions:
Australia East
Australia Southeast
Brazil South
Canada Central
Canada East
Central India
Central US
China East2
China North2
East Asia
East US
East US2
France Central
Germany West Central
Japan East
Japan West
Korea Central
North Central US
North Europe
Norway East
South Africa North
South Central US
South East Asia
Switzerland North
UAE North
US Gov Arizona
US Gov Texas
US Gov Virginia
UK South
UK West
West Central US
West Europe
West US
West US2
Outbound communications through public endpoint from ADF Managed Virtual Network
All ports are opened for outbound communications.
Azure Storage and Azure Data Lake Gen2 are not supported to be connected through public endpoint from
ADF Managed Virtual Network.
Linked Service creation of Azure Key Vault
When you create a Linked Service for Azure Key Vault, there is no Azure Integration Runtime reference. So
you can't create Private Endpoint during Linked Service creation of Azure Key Vault. But when you create
Linked Service for data stores which references Azure Key Vault Linked Service and this Linked Service
references Azure Integration Runtime with Managed Virtual Network enabled, then you are able to create a
Private Endpoint for the Azure Key Vault Linked Service during the creation.
Test connection operation for Linked Service of Azure Key Vault only validates the URL format, but doesn't
do any network operation.
The column Using private endpoint is always shown as blank even if you create Private Endpoint for
Azure Key Vault.

Next steps
Tutorial: Build a copy pipeline using managed Virtual Network and private endpoints
Tutorial: Build mapping dataflow pipeline using managed Virtual Network and private endpoints
Azure Private Link for Azure Data Factory
6/18/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


By using Azure Private Link, you can connect to various platforms as a service (PaaS) deployments in Azure via a
private endpoint. A private endpoint is a private IP address within a specific virtual network and subnet. For a list
of PaaS deployments that support Private Link functionality, see Private Link documentation.

Secure communication between customer networks and Azure Data


Factory
You can set up an Azure virtual network as a logical representation of your network in the cloud. Doing so
provides the following benefits:
You help protect your Azure resources from attacks in public networks.
You let the networks and Data Factory securely communicate with each other.
You can also connect an on-premises network to your virtual network by setting up an Internet Protocol security
(IPsec) VPN (site-to-site) connection or an Azure ExpressRoute (private peering) connection.
You can also install a self-hosted integration runtime on an on-premises machine or a virtual machine in the
virtual network. Doing so lets you:
Run copy activities between a cloud data store and a data store in a private network.
Dispatch transform activities against compute resources in an on-premises network or an Azure virtual
network.
Several communication channels are required between Azure Data Factory and the customer virtual network, as
shown in the following table:

DO M A IN P O RT DESC RIP T IO N

adf.azure.com 443 A control plane, required by Data


Factory authoring and monitoring.

*.{region}.datafactory.azure.net 443 Required by the self-hosted integration


runtime to connect to the Data
Factory service.

*.servicebus.windows.net 443 Required by the self-hosted integration


runtime for interactive authoring.

download.microsoft.com 443 Required by the self-hosted integration


runtime for downloading the updates.

With the support of Private Link for Azure Data Factory, you can:
Create a private endpoint in your virtual network.
Enable the private connection to a specific data factory instance.
The communications to Azure Data Factory service go through Private Link and help provide secure private
connectivity.
Enabling the Private Link service for each of the preceding communication channels offers the following
functionality:
Suppor ted :
You can author and monitor the data factory in your virtual network, even if you block all outbound
communications.
The command communications between the self-hosted integration runtime and the Azure Data
Factory service can be performed securely in a private network environment. The traffic between the
self-hosted integration runtime and the Azure Data Factory service goes through Private Link.
Not currently suppor ted :
Interactive authoring that uses a self-hosted integration runtime, such as test connection, browse
folder list and table list, get schema, and preview data, goes through Private Link.
The new version of the self-hosted integration runtime which can be automatically downloaded from
Microsoft Download Center if you enable Auto-Update , is not supported at this time .

NOTE
For functionality that's not currently supported, you still need to configure the previously mentioned domain and
port in the virtual network or your corporate firewall.

NOTE
Connecting to Azure Data Factory via private endpoint is only applicable to self-hosted integration runtime in data
factory. It's not supported in Synapse.

WARNING
If you enable Private Link in Azure Data Factory and block public access at the same time, make sure when you create a
linked service, your credentials are stored in an Azure key vault. Otherwise, the credentials won't work.
DNS changes for private endpoints
When you create a private endpoint, the DNS CNAME resource record for the Data Factory is updated to an alias
in a subdomain with the prefix 'privatelink'. By default, we also create a private DNS zone, corresponding to the
'privatelink' subdomain, with the DNS A resource records for the private endpoints.
When you resolve the data factory endpoint URL from outside the VNet with the private endpoint, it resolves to
the public endpoint of the data factory service. When resolved from the VNet hosting the private endpoint, the
storage endpoint URL resolves to the private endpoint's IP address.
For the illustrated example above, the DNS resource records for the Data Factory 'DataFactoryA', when resolved
from outside the VNet hosting the private endpoint, will be:

NAME TYPE VA L UE

DataFactoryA. CNAME DataFactoryA.


{region}.datafactory.azure.net {region}.privatelink.datafactory.azure.ne
t

DataFactoryA. CNAME < data factory service public endpoint


{region}.privatelink.datafactory.azure.n >
et

< data factory service public endpoint A < data factory service public IP
> address >

The DNS resource records for DataFactoryA, when resolved in the VNet hosting the private endpoint, will be:

NAME TYPE VA L UE

DataFactoryA. CNAME DataFactoryA.


{region}.datafactory.azure.net {region}.privatelink.datafactory.azure.ne
t

DataFactoryA. A < private endpoint IP address >


{region}.privatelink.datafactory.azure.n
et

If you are using a custom DNS server on your network, clients must be able to resolve the FQDN for the Data
Factory endpoint to the private endpoint IP address. You should configure your DNS server to delegate your
private link subdomain to the private DNS zone for the VNet, or configure the A records for ' DataFactoryA.
{region}.privatelink.datafactory.azure.net' with the private endpoint IP address.
For more information on configuring your own DNS server to support private endpoints, refer to the following
articles:
Name resolution for resources in Azure virtual networks
DNS configuration for private endpoints

Set up a private endpoint link for Azure Data Factory


In this section you will set up a private endpoint link for Azure Data Factory.
You can choose whether to connect your Self-Hosted Integration Runtime (SHIR) to Azure Data Factory via
public endpoint or private endpoint during the data factory creation step, shown here:
You can change the selection anytime after creation from the data factory portal page on the Networking blade.
After you enable private endpoints there, you must also add a private endpoint to the data factory.
A private endpoint requires a virtual network and subnet for the link, and a virtual machine within the subnet,
which will be used to run the Self-Hosted Integration Runtime (SHIR), connecting via the private endpoint link.
Create the virtual network
If you do not have an existing virtual network to use with your private endpoint link, you must create a one, and
assign a subnet.
1. Sign into the Azure portal at https://portal.azure.com.
2. On the upper-left side of the screen, select Create a resource > Networking > Vir tual network or
search for Vir tual network in the search box.
3. In Create vir tual network , enter or select this information in the Basics tab:

SET T IN G VA L UE

Project Details

Subscription Select your Azure subscription

Resource Group Select a resource group for your virtual network

Instance details

Name Enter a name for your virtual network

Region IMPORTANT: Select the same region your private


endpoint will use

4. Select the IP Addresses tab or select the Next: IP Addresses button at the bottom of the page.
5. In the IP Addresses tab, enter this information:

SET T IN G VA L UE

IPv4 address space Enter 10.1.0.0/16

6. Under Subnet name , select the word default .


7. In Edit subnet , enter this information:

SET T IN G VA L UE

Subnet name Enter a name for your subnet

Subnet address range Enter 10.1.0.0/24

8. Select Save .
9. Select the Review + create tab or select the Review + create button.
10. Select Create .
Create a virtual machine for the Self-Hosted Integration Runtime (SHIR )
You must also create or assign an existing virtual machine to run the Self-Hosted Integration Runtime in the new
subnet created above.
1. On the upper-left side of the portal, select Create a resource > Compute > Vir tual machine or
search for Vir tual machine in the search box.
2. In Create a vir tual machine , type or select the values in the Basics tab:

SET T IN G VA L UE

Project Details

Subscription Select your Azure subscription

Resource Group Select a resource group

Instance details

Virtual machine name Enter a name for the virtual machine

Region Select the region used above for your virtual network

Availability Options Select No infrastructure redundancy required

Image Select Windows Ser ver 2019 Datacenter - Gen1


(or any other Windows image that supports the Self-
Hosted Integration Runtime)

Azure Spot instance Select No

Size Choose VM size or take default setting


SET T IN G VA L UE

Administrator account

Username Enter a username

Password Enter a password

Confirm password Reenter password

3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:

SET T IN G VA L UE

Network interface

Virtual network Select the virtual network created above.

Subnet Select the subnet created above.

Public IP Select None .

NIC network security group Basic

Public inbound ports Select None .

5. Select Review + create .


6. Review the settings, and then select Create .

NOTE
Azure provides an ephemeral IP for Azure Virtual Machines which aren't assigned a public IP address, or are in the
backend pool of an internal Basic Azure Load Balancer. The ephemeral IP mechanism provides an outbound IP address
that isn't configurable.
The ephemeral IP is disabled when a public IP address is assigned to the virtual machine or the virtual machine is placed
in the backend pool of a Standard Load Balancer with or without outbound rules. If a Azure Virtual Network NAT gateway
resource is assigned to the subnet of the virtual machine, the ephemeral IP is disabled.
For more information on outbound connections in Azure, see Using Source Network Address Translation (SNAT) for
outbound connections.

Create the private endpoint


Finally, you must create the private endpoint in your data factory.
1. On the Azure portal page for your data factory, select the Networking blade and the Private endpoint
connections tab, and then select + Private endpoint .
2. In the Basics tab of Create a private endpoint , enter, or select this information:

SET T IN G VA L UE

Project details

Subscription Select your subscription

Resource group Select a resource group

Instance details

Name Enter a name for your endpoint

Region Select the region of the virtual network created above

3. Select the Resource tab or the Next: Resource button at the bottom of the page.
4. In Resource , enter or select this information:

SET T IN G VA L UE

Connection method Select Connect to an Azure resource in my


director y

Subscription Select your subscription

Resource type Select Microsoft.Datafactor y/factories

Resource Select your data factory


SET T IN G VA L UE

Target sub-resource If you want to use the private endpoint for command
communications between the self-hosted integration
runtime and the Azure Data Factory service, select
datafactor y as Target sub-resource . If you want to
use the private endpoint for authoring and monitoring
the data factory in your virtual network, select por tal as
Target sub-resource .

5. Select the Configuration tab or the Next: Configuration button at the bottom of the screen.
6. In Configuration , enter or select this information:

SET T IN G VA L UE

Networking

Virtual network Select the virtual network created above.

Subnet Select the subnet created above.

Private DNS integration

Integrate with private DNS zone Leave the default of Yes .

Subscription Select your subscription.

Private DNS zones Leave the default of (New)


privatelink .azurewebsites.net .

7. Select Review + create .


8. Select Create .

NOTE
Disabling public network access is applicable only to the self-hosted integration runtime, not to Azure Integration Runtime
and SQL Server Integration Services (SSIS) Integration Runtime.

NOTE
You can still access the Azure Data Factory portal through a public network after you create private endpoint for portal.

Next steps
Create a data factory by using the Azure Data Factory UI
Introduction to Azure Data Factory
Visual authoring in Azure Data Factory
Visually monitor Azure Data Factory
4/22/2021 • 5 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Once you've created and published a pipeline in Azure Data Factory, you can associate it with a trigger or
manually kick off an ad hoc run. You can monitor all of your pipeline runs natively in the Azure Data Factory user
experience. To open the monitoring experience, select the Monitor & Manage tile in the data factory blade of
the Azure portal. If you're already in the ADF UX, click on the Monitor icon on the left sidebar.
By default, all data factory runs are displayed in the browser's local time zone. If you change the time zone, all
the date/time fields snap to the one that you selected.

Monitor pipeline runs


The default monitoring view is list of triggered pipeline runs in the selected time period. You can change the
time range and filter by status, pipeline name, or annotation. Hover over the specific pipeline run to get run-
specific actions such as rerun and the consumption report.

The pipeline run grid contains the following columns:

C O L UM N N A M E DESC RIP T IO N

Pipeline Name Name of the pipeline

Run Start Start date and time for the pipeline run (MM/DD/YYYY,
HH:MM:SS AM/PM)

Run End End date and time for the pipeline run (MM/DD/YYYY,
HH:MM:SS AM/PM)

Duration Run duration (HH:MM:SS)

Triggered By The name of the trigger that started the pipeline

Status Failed , Succeeded , In Progress , Canceled , or Queued

Annotations Filterable tags associated with a pipeline

Parameters Parameters for the pipeline run (name/value pairs)


C O L UM N N A M E DESC RIP T IO N

Error If the pipeline failed, the run error

Run ID ID of the pipeline run

You need to manually select the Refresh button to refresh the list of pipeline and activity runs. Autorefresh is
currently not supported.

To view the results of a debug run, select the Debug tab.

Monitor activity runs


To get a detailed view of the individual activity runs of a specific pipeline run, click on the pipeline name.
The list view shows activity runs that correspond to each pipeline run. Hover over the specific activity run to get
run-specific information such as the JSON input, JSON output, and detailed activity-specific monitoring
experiences.

C O L UM N N A M E DESC RIP T IO N

Activity Name Name of the activity inside the pipeline

Activity Type Type of the activity, such as Copy , ExecuteDataFlow , or


AzureMLExecutePipeline

Actions Icons that allow you to see JSON input information, JSON
output information, or detailed activity-specific monitoring
experiences

Run Start Start date and time for the activity run (MM/DD/YYYY,
HH:MM:SS AM/PM)

Duration Run duration (HH:MM:SS)

Status Failed , Succeeded , In Progress , or Canceled

Integration Runtime Which Integration Runtime the activity was run on


C O L UM N N A M E DESC RIP T IO N

User Properties User-defined properties of the activity

Error If the activity failed, the run error

Run ID ID of the activity run

If an activity failed, you can see the detailed error message by clicking on the icon in the error column.

Promote user properties to monitor


Promote any pipeline activity property as a user property so that it becomes an entity that you monitor. For
example, you can promote the Source and Destination properties of the copy activity in your pipeline as user
properties.

NOTE
You can only promote up to five pipeline activity properties as user properties.

After you create the user properties, you can monitor them in the monitoring list views.
If the source for the copy activity is a table name, you can monitor the source table name as a column in the list
view for activity runs.

Rerun pipelines and activities


To rerun a pipeline that has previously ran from the start, hover over the specific pipeline run and select Rerun .
If you select multiple pipelines, you can use the Rerun button to run them all.

If you wish to rerun starting at a specific point, you can do so from the activity runs view. Select the activity you
wish to start from and select Rerun from activity .
Rerun from failed activity
If an activity fails, times out, or is canceled, you can rerun the pipeline from that failed activity by selecting
Rerun from failed activity .

View rerun history


You can view the rerun history for all the pipeline runs in the list view.

You can also view rerun history for a particular pipeline run.
Monitor consumption
You can see the resources consumed by a pipeline run by clicking the consumption icon next to the run.

Clicking the icon opens a consumption report of resources used by that pipeline run.
You can plug these values into the Azure pricing calculator to estimate the cost of the pipeline run. For more
information on Azure Data Factory pricing, see Understanding pricing.

NOTE
These values returned by the pricing calculator is an estimate. It doesn't reflect the exact amount you will be billed by
Azure Data Factory

Gantt views
A Gantt chart is a view that allows you to see the run history over a time range. By switching to a Gantt view,
you will see all pipeline runs grouped by name displayed as bars relative to how long the run took. You can also
group by annotations/tags that you've create on your pipeline. The Gantt view is also available at the activity run
level.
The length of the bar informs the duration of the pipeline. You can also select the bar to see more details.

Alerts
You can raise alerts on supported metrics in Data Factory. Select Monitor > Aler ts & metrics on the Data
Factory monitoring page to get started.

For a seven-minute introduction and demonstration of this feature, watch the following video:

Create alerts
1. Select New aler t rule to create a new alert.
2. Specify the rule name and select the alert severity.

3. Select the alert criteria.


You can create alerts on various metrics, including those for ADF entity count/size,
activity/pipeline/trigger runs, Integration Runtime (IR) CPU utilization/memory/node count/queue, as
well as for SSIS package executions and SSIS IR start/stop operations.
4. Configure the alert logic. You can create an alert for the selected metric for all pipelines and
corresponding activities. You can also select a particular activity type, activity name, pipeline name, or
failure type.
5. Configure email, SMS, push, and voice notifications for the alert. Create an action group, or choose an
existing one, for the alert notifications.
6. Create the alert rule.
Next steps
To learn about monitoring and managing pipelines, see the Monitor and manage pipelines programmatically
article.
Monitor and Alert Data Factory by using Azure
Monitor
4/22/2021 • 27 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Cloud applications are complex and have many moving parts. Monitors provide data to help ensure that your
applications stay up and running in a healthy state. Monitors also help you avoid potential problems and
troubleshoot past ones. You can use monitoring data to gain deep insights about your applications. This
knowledge helps you improve application performance and maintainability. It also helps you automate actions
that otherwise require manual intervention.
Azure Monitor provides base-level infrastructure metrics and logs for most Azure services. Azure diagnostic
logs are emitted by a resource and provide rich, frequent data about the operation of that resource. Azure Data
Factory (ADF) can write diagnostic logs in Azure Monitor. For a seven-minute introduction and demonstration of
this feature, watch the following video:

For more information, see Azure Monitor overview.

Keeping Azure Data Factory metrics and pipeline-run data


Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a
longer time. With Monitor, you can route diagnostic logs for analysis to multiple different targets.
Storage Account : Save your diagnostic logs to a storage account for auditing or manual inspection. You can
use the diagnostic settings to specify the retention time in days.
Event Hub : Stream the logs to Azure Event Hubs. The logs become input to a partner service/custom
analytics solution like Power BI.
Log Analytics : Analyze the logs with Log Analytics. The Data Factory integration with Azure Monitor is
useful in the following scenarios:
You want to write complex queries on a rich set of metrics that are published by Data Factory to
Monitor. You can create custom alerts on these queries via Monitor.
You want to monitor across data factories. You can route data from multiple data factories to a single
Monitor workspace.
You can also use a storage account or event-hub namespace that isn't in the subscription of the resource that
emits logs. The user who configures the setting must have appropriate Azure role-based access control (Azure
RBAC) access to both subscriptions.

Configure diagnostic settings and workspace


Create or add diagnostic settings for your data factory.
1. In the portal, go to Monitor. Select Settings > Diagnostic settings .
2. Select the data factory for which you want to set a diagnostic setting.
3. If no settings exist on the selected data factory, you're prompted to create a setting. Select Turn on
diagnostics .

If there are existing settings on the data factory, you see a list of settings already configured on the data
factory. Select Add diagnostic setting .
4. Give your setting a name, select Send to Log Analytics , and then select a workspace from Log
Analytics Workspace .
In Azure-Diagnostics mode, diagnostic logs flow into the AzureDiagnostics table.
In Resource-Specific mode, diagnostic logs from Azure Data Factory flow into the following tables:
ADFActivityRun
ADFPipelineRun
ADFTriggerRun
ADFSSISIntegrationRuntimeLogs
ADFSSISPackageEventMessageContext
ADFSSISPackageEventMessages
ADFSSISPackageExecutableStatistics
ADFSSISPackageExecutionComponentPhases
ADFSSISPackageExecutionDataStatistics
You can select various logs relevant to your workloads to send to Log Analytics tables. For
example, if you don't use SQL Server Integration Services (SSIS) at all, you need not select any
SSIS logs. If you want to log SSIS Integration Runtime (IR) start/stop/maintenance operations, you
can select SSIS IR logs. If you invoke SSIS package executions via T-SQL on SQL Server
Management Studio (SSMS), SQL Server Agent, or other designated tools, you can select SSIS
package logs. If you invoke SSIS package executions via Execute SSIS Package activities in ADF
pipelines, you can select all logs.
If you select AllMetrics, various ADF metrics will be made available for you to monitor or raise
alerts on, including the metrics for ADF activity, pipeline, and trigger runs, as well as for SSIS IR
operations and SSIS package executions.

NOTE
Because an Azure log table can't have more than 500 columns, we highly recommended you select Resource-
Specific mode. For more information, see AzureDiagnostics Logs reference.

5. Select Save .
After a few moments, the new setting appears in your list of settings for this data factory. Diagnostic logs are
streamed to that workspace as soon as new event data is generated. Up to 15 minutes might elapse between
when an event is emitted and when it appears in Log Analytics.

Install Azure Data Factory Analytics solution from Azure Marketplace


This solution provides you a summary of overall health of your Data Factory, with options to drill into details
and to troubleshoot unexpected behavior patterns. With rich, out of the box views you can get insights into key
processing including:
At a glance summary of data factory pipeline, activity and trigger runs
Ability to drill into data factory activity runs by type
Summary of data factory top pipeline, activity errors
1. Go to Azure Marketplace , choose Analytics filter, and search for Azure Data Factor y Analytics
(Preview)

2. Details about Azure Data Factor y Analytics (Preview)

3. Select Create and then create or select the Log Analytics Workspace .
Monitor Data Factory metrics
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics
workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration

You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts,
and take other actions.
NOTE
Azure Data Factory Analytics (Preview) sends diagnostic logs to Resource-specific destination tables. You can write queries
against the following tables: ADFPipelineRun, ADFTriggerRun, and ADFActivityRun.

Data Factory Metrics


With Monitor, you can gain visibility into the performance and health of your Azure workloads. The most
important type of Monitor data is the metric, which is also called the performance counter. Metrics are emitted
by most Azure resources. Monitor provides several ways to configure and consume these metrics for
monitoring and troubleshooting.
Here are some of the metrics emitted by Azure Data Factory version 2:

M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N

ActivityCancelledRun Cancelled activity Count Total The total number of


s runs metrics activity runs that
were cancelled within
a minute window.

ActivityFailedRuns Failed activity runs Count Total The total number of


metrics activity runs that
failed within a minute
window.

ActivitySucceededRu Succeeded activity Count Total The total number of


ns runs metrics activity runs that
succeeded within a
minute window.

PipelineCancelledRun Cancelled pipeline Count Total The total number of


s runs metrics pipeline runs that
were cancelled within
a minute window.

PipelineFailedRuns Failed pipeline runs Count Total The total number of


metrics pipeline runs that
failed within a minute
window.

PipelineSucceededRu Succeeded pipeline Count Total The total number of


ns runs metrics pipeline runs that
succeeded within a
minute window.

TriggerCancelledRuns Cancelled trigger Count Total The total number of


runs metrics trigger runs that
were cancelled within
a minute window.

TriggerFailedRuns Failed trigger runs Count Total The total number of


metrics trigger runs that
failed within a minute
window.

TriggerSucceededRun Succeeded trigger Count Total The total number of


s runs metrics trigger runs that
succeeded within a
minute window.
M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N

SSISIntegrationRunti Cancelled SSIS Count Total The total number of


meStartCancelled integration runtime SSIS integration
start metrics runtime starts that
were cancelled within
a minute window.

SSISIntegrationRunti Failed SSIS Count Total The total number of


meStartFailed integration runtime SSIS integration
start metrics runtime starts that
failed within a minute
window.

SSISIntegrationRunti Succeeded SSIS Count Total The total number of


meStartSucceeded integration runtime SSIS integration
start metrics runtime starts that
succeeded within a
minute window.

SSISIntegrationRunti Stuck SSIS integration Count Total The total number of


meStopStuck runtime stop metrics SSIS integration
runtime stops that
were stuck within a
minute window.

SSISIntegrationRunti Succeeded SSIS Count Total The total number of


meStopSucceeded integration runtime SSIS integration
stop metrics runtime stops that
succeeded within a
minute window.

SSISPackageExecution Cancelled SSIS Count Total The total number of


Cancelled package execution SSIS package
metrics executions that were
cancelled within a
minute window.

SSISPackageExecution Failed SSIS package Count Total The total number of


Failed execution metrics SSIS package
executions that failed
within a minute
window.

SSISPackageExecution Succeeded SSIS Count Total The total number of


Succeeded package execution SSIS package
metrics executions that
succeeded within a
minute window.

To access the metrics, complete the instructions in Azure Monitor data platform.

NOTE
Only events from completed, triggered activity and pipeline runs are emitted. In progress and debug runs are not
emitted. On the other hand, events from all SSIS package executions are emitted, including those that are completed and
in progress, regardless of their invocation methods. For example, you can invoke package executions on Azure-enabled
SQL Server Data Tools (SSDT), via T-SQL on SSMS, SQL Server Agent, or other designated tools, and as triggered or
debug runs of Execute SSIS Package activities in ADF pipelines.

Data Factory Alerts


Sign in to the Azure portal and select Monitor > Aler ts to create alerts.
Create Alerts
1. Select + New Aler t rule to create a new alert.

2. Define the alert condition.

NOTE
Make sure to select All in the Filter by resource type drop-down list.
3. Define the alert details.

4. Define the action group.


Set up diagnostic logs via the Azure Monitor REST API
Diagnostic settings
Use diagnostic settings to configure diagnostic logs for non-compute resources. The settings for a resource
control have the following features:
They specify where diagnostic logs are sent. Examples include an Azure storage account, an Azure event hub,
or Monitor logs.
They specify which log categories are sent.
They specify how long each log category should be kept in a storage account.
A retention of zero days means logs are kept forever. Otherwise, the value can be any number of days from 1
through 2,147,483,647.
If retention policies are set but storing logs in a storage account is disabled, the retention policies have no
effect. For example, this condition can happen when only Event Hubs or Monitor logs options are selected.
Retention policies are applied per day. The boundary between days occurs at midnight Coordinated Universal
Time (UTC). At the end of a day, logs from days that are beyond the retention policy are deleted. For example,
if you have a retention policy of one day, at the beginning of today the logs from before yesterday are
deleted.
Enable diagnostic logs via the Azure Monitor REST API
Create or update a diagnostics setting in the Monitor REST API
R e q u e st
PUT
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}

Header s

Replace {api-version} with 2016-09-01 .


Replace {resource-id} with the ID of the resource for which you want to edit diagnostic settings. For more
information, see Using Resource groups to manage your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to the JSON web token that you got from Azure Active Directory (Azure AD). For
more information, see Authenticating requests.
Body

{
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<stor
ageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHub
Name>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspace
s/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}

P RO P ERT Y TYPE DESC RIP T IO N

storageAccountId String The resource ID of the storage account


to which you want to send diagnostic
logs.

ser viceBusRuleId String The service-bus rule ID of the service-


bus namespace in which you want to
have Event Hubs created for streaming
diagnostic logs. The rule ID has the
format
{service bus resource
ID}/authorizationrules/{key
name}
.

workspaceId String The workspace ID of the workspace


where the logs will be saved.

metrics Parameter values of the pipeline run to A JSON object that maps parameter
be passed to the invoked pipeline names to argument values.

logs Complex Type The name of a diagnostic-log category


for a resource type. To get the list of
diagnostic-log categories for a
resource, perform a GET diagnostic-
settings operation.

categor y String An array of log categories and their


retention policies.

timeGrain String The granularity of metrics, which are


captured in ISO 8601 duration format.
The property value must be PT1M ,
which specifies one minute.

enabled Boolean Specifies whether collection of the


metric or log category is enabled for
this resource.
P RO P ERT Y TYPE DESC RIP T IO N

retentionPolicy Complex Type Describes the retention policy for a


metric or log category. This property is
used for storage accounts only.

days Int The number of days to keep the


metrics or logs. If the property value is
0, the logs are kept forever. This
property is used for storage accounts
only.

R e sp o n se

200 OK.

{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<sto
rageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHu
bName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspac
es/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

Get information about diagnostics settings in the Monitor REST API


R e q u e st

GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}

Header s

Replace {api-version} with 2016-09-01 .


Replace {resource-id} with the ID of the resource for which you want to edit diagnostic settings. For more
information, see Using Resource groups to manage your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON web token that you got from Azure AD. For more information, see
Authenticating requests.
R e sp o n se

200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/auth
orizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

For more information, see Diagnostic Settings.

Schema of logs and events


Monitor schema
Activity-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.

activityRunId String The ID of the activity run. 3a171e1f-b36e-4b80-


8a54-5625394f4354

pipelineRunId String The ID of the pipeline run. 9f6069d6-e522-4608-


9f99-21807bfc3c70

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the ActivityRuns


diagnostic logs. Set the
property value to
ActivityRuns .

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the activity MyActivity - Succeeded


with its status. If the activity
is the start heartbeat, the
property value is
MyActivity - . If the
activity is the end
heartbeat, the property
value is
MyActivity - Succeeded .

pipelineName String The name of the pipeline. MyPipeline

activityName String The name of the activity. MyActivity

star t String The start time of the 2017-06-


activity runs in timespan 26T20:55:29.5007959Z
UTC format.

end String The end time of the activity 2017-06-


runs in timespan UTC 26T20:55:29.5007959Z
format. If the diagnostic log
shows that an activity has
started but not yet ended,
the property value is
1601-01-01T00:00:00Z .

Pipeline-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

runId String The ID of the pipeline run. 9f6069d6-e522-4608-


9f99-21807bfc3c70

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the PipelineRuns


diagnostic logs. Set the
property value to
PipelineRuns .

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the pipeline MyPipeline - Succeeded .


along with its status. After
the pipeline run is finished,
the property value is
Pipeline - Succeeded .

pipelineName String The name of the pipeline. MyPipeline

star t String The start time of the 2017-06-


activity runs in timespan 26T20:55:29.5007959Z
UTC format. .

end String The end time of the activity 2017-06-


runs in timespan UTC 26T20:55:29.5007959Z
format. If the diagnostic log
shows an activity has
started but not yet ended,
the property value is
1601-01-01T00:00:00Z .

status String The final status of the Succeeded


pipeline run. Possible
property values are
Succeeded and Failed .

Trigger-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.

triggerId String The ID of the trigger run. 08587023010602533858661257311

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the PipelineRuns


diagnostic logs. Set the
property value to
PipelineRuns .
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the trigger MyTrigger - Succeeded


with its final status, which
indicates whether the
trigger successfully fired. If
the heartbeat was
successful, the property
value is
MyTrigger - Succeeded .

triggerName String The name of the trigger. MyTrigger

triggerType String The type of the trigger. ScheduleTrigger


Possible property values are
Manual Trigger and
Schedule Trigger .

triggerEvent String The event of the trigger. ScheduleTime - 2017-


07-06T01:50:25Z

star t String The start time of the trigger 2017-06-


firing in timespan UTC 26T20:55:29.5007959Z
format.

status String The final status showing Succeeded


whether the trigger
successfully fired. Possible
property values are
Succeeded and Failed .

SSIS integration runtime log attributes


Here are the log attributes of SSIS IR start/stop/maintenance operations.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"resultType": "",
"properties": {
"message": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String The name of your SSIS IR Start/Stop/Maintenance


operation

categor y String The category of diagnostic SSISIntegrationRuntimeLogs


logs

correlationId String The unique ID for tracking a f13b159b-515f-4885-9dfa-


particular operation a664e949f785Deprovision0059035558

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

resultType String The result of your SSIS IR Started/InProgress/Succeeded/Failed


operation

message String The output message of The stopping of your


your SSIS IR operation SSIS integration
runtime has succeeded.

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS event message context log attributes


Here are the log attributes of conditions related to event messages that are generated by SSIS package
executions on your SSIS IR. They convey similar information as SSIS catalog (SSISDB) event message context
table or view that shows run-time values of many SSIS package properties. They're generated when you select
Basic/Verbose logging level and useful for debugging/compliance checking.
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"operationId": "",
"contextDepth": "",
"packagePath": "",
"contextType": "",
"contextSourceName": "",
"contextSourceId": "",
"propertyName": "",
"propertyValue": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageEventMessageContext
SSISPackageEventMessageContext

categor y String The category of diagnostic SSISPackageEventMessageContext


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

operationId String The unique ID for tracking a 1 (1 signifies operations


particular operation in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

contextDepth String The depth of your event 0 (0 signifies the context


message context before package execution
starts, 1 signifies the
context when an error
occurs, and it increases as
the context is further from
the error)

packagePath String The path of package object \Package


as your event message
context source

contextType String The type of package object 60 (see more context


as your event message types)
context source

contextSourceName String The name of package object MyPackage


as your event message
context source

contextSourceId String The unique ID of package {E2CF27FB-EA48-41E9-


object as your event AF6F-3FE938B4ADE1}
message context source

proper tyName String The name of package DelayValidation


property for your event
message context source

proper tyValue String The value of package False


property for your event
message context source

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS event messages log attributes


Here are the log attributes of event messages that are generated by SSIS package executions on your SSIS IR.
They convey similar information as SSISDB event messages table or view that shows the detailed text/metadata
of event messages. They're generated at any logging level except None .
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"operationId": "",
"messageTime": "",
"messageType": "",
"messageSourceType": "",
"message": "",
"packageName": "",
"eventName": "",
"messageSourceName": "",
"messageSourceId": "",
"subcomponentName": "",
"packagePath": "",
"executionPath": "",
"threadId": ""
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageEventMessages
SSISPackageEventMessages

categor y String The category of diagnostic SSISPackageEventMessages


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

operationId String The unique ID for tracking a 1 (1 signifies operations


particular operation in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

messageTime String The time when your event 2017-06-


message is created in UTC 28T21:00:27.3534352Z
format

messageType String The type of your event 70 (see more message


message types)

messageSourceType String The type of your event 20 (see more message


message source source types)

message String The text of your event MyPackage:Validation


message has started.

packageName String The name of your executed MyPackage.dtsx


package file

eventName String The name of related run- OnPreValidate


time event

messageSourceName String The name of package Data Flow Task


component as your event
message source

messageSourceId String The unique ID of package {1a45a5a4-3df9-4f02-


component as your event b818-ebf583829ad2}
message source

subcomponentName String The name of data flow SSIS.Pipeline


component as your event
message source

packagePath String The path of package object \Package\Data Flow


as your event message Task
source

executionPath String The full path from parent \Transformation\Data


package to executed Flow Task
component (This path also captures
component iterations)
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

threadId String The unique ID of thread {1a45a5a4-3df9-4f02-


executed when your event b818-ebf583829ad2}
message is logged

SSIS executable statistics log attributes


Here are the log attributes of executable statistics that are generated by SSIS package executions on your SSIS
IR, where executables are containers or tasks in the control flow of packages. They convey similar information as
SSISDB executable statistics table or view that shows a row for each running executable, including its iterations.
They're generated at any logging level except None and useful for identifying task-level bottlenecks/failures.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"executionPath": "",
"startTime": "",
"endTime": "",
"executionDuration": "",
"executionResult": "",
"executionValue": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutableStatistics
SSISPackageExecutableStatistics

categor y String The category of diagnostic SSISPackageExecutableStatistics


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

executionPath String The full path from parent \Transformation\Data


package to executed Flow Task
component (This path also captures
component iterations)

star tTime String The time when executable 2017-06-


enters pre-execute phase in 28T21:00:27.3534352Z
UTC format

endTime String The time when executable 2017-06-


enters post-execute phase 28T21:00:27.3534352Z
in UTC format

executionDuration String The running time of 1,125


executable in milliseconds

executionResult String The result of running 0 (0 signifies success, 1


executable signifies failure, 2 signifies
completion, and 3 signifies
cancelation)

executionValue String The user-defined value 1


returned by running
executable

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS execution component phases log attributes


Here are the log attributes of run-time statistics for data flow components that are generated by SSIS package
executions on your SSIS IR. They convey similar information as SSISDB execution component phases table or
view that shows the time spent by data flow components in all their execution phases. They're generated when
you select Performance/Verbose logging level and useful for capturing data flow execution statistics.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"subcomponentName": "",
"phase": "",
"startTime": "",
"endTime": "",
"executionPath": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutionComponentPhases
SSISPackageExecutionComponentPhases

categor y String The category of diagnostic SSISPackageExecutionComponentPhases


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

packageName String The name of your executed MyPackage.dtsx


package file

taskName String The name of executed data Data Flow Task


flow task

subcomponentName String The name of data flow Derived Column


component

phase String The name of execution AcquireConnections


phase

star tTime String The time when execution 2017-06-


phase starts in UTC format 28T21:00:27.3534352Z

endTime String The time when execution 2017-06-


phase ends in UTC format 28T21:00:27.3534352Z

executionPath String The path of execution for \Transformation\Data


data flow task Flow Task

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS execution data statistics log attributes


Here are the log attributes of data movements through each leg of data flow pipelines, from upstream to
downstream components, that are generated by SSIS package executions on your SSIS IR. They convey similar
information as SSISDB execution data statistics table or view that shows row counts of data moved through data
flow tasks. They're generated when you select Verbose logging level and useful for computing data flow
throughput.
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"dataflowPathIdString": "",
"dataflowPathName": "",
"sourceComponentName": "",
"destinationComponentName": "",
"rowsSent": "",
"createdTime": "",
"executionPath": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutionDataStatistics
SSISPackageExecutionDataStatistics

categor y String The category of diagnostic SSISPackageExecutionDataStatistics


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

packageName String The name of your executed MyPackage.dtsx


package file

taskName String The name of executed data Data Flow Task


flow task

dataflowPathIdString String The unique ID for tracking Paths[SQLDB Table3.ADO


data flow path NET Source Output]

dataflowPathName String The name of data flow path ADO NET Source Output

sourceComponentName String The name of data flow SQLDB Table3


component that sends data

destinationComponentN String The name of data flow Derived Column


ame component that receives
data

rowsSent String The number of rows sent by 500


source component

createdTime String The time when row values 2017-06-


are obtained in UTC format 28T21:00:27.3534352Z

executionPath String The path of execution for \Transformation\Data


data flow task Flow Task

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

Log Analytics schema


Log Analytics inherits the schema from Monitor with the following exceptions:
The first letter in each column name is capitalized. For example, the column name "correlationId" in
Monitor is "CorrelationId" in Log Analytics.
There's no "Level" column.
The dynamic "properties" column is preserved as the following dynamic JSON blob type.
A Z URE M O N ITO R C O L UM N LO G A N A LY T IC S C O L UM N TYPE

$.properties.UserProperties UserProperties Dynamic

$.properties.Annotations Annotations Dynamic

$.properties.Input Input Dynamic

$.properties.Output Output Dynamic

$.properties.Error.errorCode ErrorCode int

$.properties.Error.message ErrorMessage string

$.properties.Error Error Dynamic

$.properties.Predecessors Predecessors Dynamic

$.properties.Parameters Parameters Dynamic

$.properties.SystemParameters SystemParameters Dynamic

$.properties.Tags Tags Dynamic

Monitor SSIS operations with Azure Monitor


To lift & shift your SSIS workloads, you can provision SSIS IR in ADF that supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
Once provisioned, you can check SSIS IR operational status using Azure PowerShell or on the Monitor hub of
ADF portal. With Project Deployment Model, SSIS package execution logs are stored in SSISDB internal tables or
views, so you can query, analyze, and visually present them using designated tools like SSMS. With Package
Deployment Model, SSIS package execution logs can be stored in file system or Azure Files as CSV files that you
still need to parse and process using other designated tools before you can query, analyze, and visually present
them.
Now with Azure Monitor integration, you can query, analyze, and visually present all metrics and logs generated
from SSIS IR operations and SSIS package executions on Azure portal. Additionally, you can also raise alerts on
them.
Configure diagnostic settings and workspace for SSIS operations
To send all metrics and logs generated from SSIS IR operations and SSIS package executions to Azure Monitor,
you need to configure diagnostics settings and workspace for your ADF.
SSIS operational metrics
SSIS operational metrics are performance counters or numerical values that describe the status of SSIS IR start
and stop operations, as well as SSIS package executions at a particular point in time. They're part of ADF metrics
in Azure Monitor.
When you configure diagnostic settings and workspace for your ADF on Azure Monitor, selecting the AllMetrics
check box will make SSIS operational metrics available for interactive analysis using Azure Metrics Explorer,
presentation on Azure dashboard, and near-real time alerts.

SSIS operational alerts


To raise alerts on SSIS operational metrics from ADF portal, select the Aler ts & metrics page of ADF Monitor
hub and follow the step-by-step instructions provided.

To raise alerts on SSIS operational metrics from Azure portal, select the Aler ts page of Azure Monitor hub and
follow the step-by-step instructions provided.

SSIS operational logs


SSIS operational logs are events generated by SSIS IR operations and SSIS package executions that provide
enough context on any identified issues and are useful for root cause analysis.
When you configure diagnostic settings and workspace for your ADF on Azure Monitor, you can select the
relevant SSIS operational logs and send them to Log Analytics that's based on Azure Data Explorer. In there,
they'll be made available for analysis using rich query language, presentation on Azure dashboard, and near-
real time alerts.

The schemas and content of SSIS package execution logs in Azure Monitor and Log Analytics are similar to the
schemas of SSISDB internal tables or views.

A Z URE M O N ITO R LO G C AT EGO RIES LO G A N A LY T IC S TA B L ES SSISDB IN T ERN A L TA B L ES/ VIEW S

SSISIntegrationRuntimeLogs ADFSSISIntegrationRuntimeLogs

SSISPackageEventMessageContext ADFSSISPackageEventMessageContext [internal].


[event_message_context]
A Z URE M O N ITO R LO G C AT EGO RIES LO G A N A LY T IC S TA B L ES SSISDB IN T ERN A L TA B L ES/ VIEW S

SSISPackageEventMessages ADFSSISPackageEventMessages [internal].[event_messages]

SSISPackageExecutableStatistics ADFSSISPackageExecutableStatistics [internal].


[executable_statistics]

SSISPackageExecutionComponentPhases ADFSSISPackageExecutionComponentPhases [internal].


[execution_component_phases]

SSISPackageExecutionDataStatistics ADFSSISPackageExecutionDataStatistics [internal].


[execution_data_statistics]

For more info on SSIS operational log attributes/properties, see Azure Monitor and Log Analytics schemas for
ADF.
Your selected SSIS package execution logs are always sent to Log Analytics regardless of their invocation
methods. For example, you can invoke package executions on Azure-enabled SSDT, via T-SQL on SSMS, SQL
Server Agent, or other designated tools, and as triggered or debug runs of Execute SSIS Package activities in
ADF pipelines.
When querying SSIS IR operation logs on Logs Analytics, you can use OperationName and ResultType
properties that are set to Start/Stop/Maintenance and Started/InProgress/Succeeded/Failed , respectively.

When querying SSIS package execution logs on Logs Analytics, you can join them using
OperationId /ExecutionId /CorrelationId properties. OperationId /ExecutionId are always set to 1 for all
operations/executions related to packages not stored in SSISDB/invoked via T-SQL.

Next steps
Monitor and manage pipelines programmatically
Monitor and Alert Data Factory by using Azure
Monitor
4/22/2021 • 27 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Cloud applications are complex and have many moving parts. Monitors provide data to help ensure that your
applications stay up and running in a healthy state. Monitors also help you avoid potential problems and
troubleshoot past ones. You can use monitoring data to gain deep insights about your applications. This
knowledge helps you improve application performance and maintainability. It also helps you automate actions
that otherwise require manual intervention.
Azure Monitor provides base-level infrastructure metrics and logs for most Azure services. Azure diagnostic
logs are emitted by a resource and provide rich, frequent data about the operation of that resource. Azure Data
Factory (ADF) can write diagnostic logs in Azure Monitor. For a seven-minute introduction and demonstration of
this feature, watch the following video:

For more information, see Azure Monitor overview.

Keeping Azure Data Factory metrics and pipeline-run data


Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a
longer time. With Monitor, you can route diagnostic logs for analysis to multiple different targets.
Storage Account : Save your diagnostic logs to a storage account for auditing or manual inspection. You can
use the diagnostic settings to specify the retention time in days.
Event Hub : Stream the logs to Azure Event Hubs. The logs become input to a partner service/custom
analytics solution like Power BI.
Log Analytics : Analyze the logs with Log Analytics. The Data Factory integration with Azure Monitor is
useful in the following scenarios:
You want to write complex queries on a rich set of metrics that are published by Data Factory to
Monitor. You can create custom alerts on these queries via Monitor.
You want to monitor across data factories. You can route data from multiple data factories to a single
Monitor workspace.
You can also use a storage account or event-hub namespace that isn't in the subscription of the resource that
emits logs. The user who configures the setting must have appropriate Azure role-based access control (Azure
RBAC) access to both subscriptions.

Configure diagnostic settings and workspace


Create or add diagnostic settings for your data factory.
1. In the portal, go to Monitor. Select Settings > Diagnostic settings .
2. Select the data factory for which you want to set a diagnostic setting.
3. If no settings exist on the selected data factory, you're prompted to create a setting. Select Turn on
diagnostics .

If there are existing settings on the data factory, you see a list of settings already configured on the data
factory. Select Add diagnostic setting .
4. Give your setting a name, select Send to Log Analytics , and then select a workspace from Log
Analytics Workspace .
In Azure-Diagnostics mode, diagnostic logs flow into the AzureDiagnostics table.
In Resource-Specific mode, diagnostic logs from Azure Data Factory flow into the following tables:
ADFActivityRun
ADFPipelineRun
ADFTriggerRun
ADFSSISIntegrationRuntimeLogs
ADFSSISPackageEventMessageContext
ADFSSISPackageEventMessages
ADFSSISPackageExecutableStatistics
ADFSSISPackageExecutionComponentPhases
ADFSSISPackageExecutionDataStatistics
You can select various logs relevant to your workloads to send to Log Analytics tables. For
example, if you don't use SQL Server Integration Services (SSIS) at all, you need not select any
SSIS logs. If you want to log SSIS Integration Runtime (IR) start/stop/maintenance operations, you
can select SSIS IR logs. If you invoke SSIS package executions via T-SQL on SQL Server
Management Studio (SSMS), SQL Server Agent, or other designated tools, you can select SSIS
package logs. If you invoke SSIS package executions via Execute SSIS Package activities in ADF
pipelines, you can select all logs.
If you select AllMetrics, various ADF metrics will be made available for you to monitor or raise
alerts on, including the metrics for ADF activity, pipeline, and trigger runs, as well as for SSIS IR
operations and SSIS package executions.

NOTE
Because an Azure log table can't have more than 500 columns, we highly recommended you select Resource-
Specific mode. For more information, see AzureDiagnostics Logs reference.

5. Select Save .
After a few moments, the new setting appears in your list of settings for this data factory. Diagnostic logs are
streamed to that workspace as soon as new event data is generated. Up to 15 minutes might elapse between
when an event is emitted and when it appears in Log Analytics.

Install Azure Data Factory Analytics solution from Azure Marketplace


This solution provides you a summary of overall health of your Data Factory, with options to drill into details
and to troubleshoot unexpected behavior patterns. With rich, out of the box views you can get insights into key
processing including:
At a glance summary of data factory pipeline, activity and trigger runs
Ability to drill into data factory activity runs by type
Summary of data factory top pipeline, activity errors
1. Go to Azure Marketplace , choose Analytics filter, and search for Azure Data Factor y Analytics
(Preview)

2. Details about Azure Data Factor y Analytics (Preview)

3. Select Create and then create or select the Log Analytics Workspace .
Monitor Data Factory metrics
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics
workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration

You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts,
and take other actions.
NOTE
Azure Data Factory Analytics (Preview) sends diagnostic logs to Resource-specific destination tables. You can write queries
against the following tables: ADFPipelineRun, ADFTriggerRun, and ADFActivityRun.

Data Factory Metrics


With Monitor, you can gain visibility into the performance and health of your Azure workloads. The most
important type of Monitor data is the metric, which is also called the performance counter. Metrics are emitted
by most Azure resources. Monitor provides several ways to configure and consume these metrics for
monitoring and troubleshooting.
Here are some of the metrics emitted by Azure Data Factory version 2:

M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N

ActivityCancelledRun Cancelled activity Count Total The total number of


s runs metrics activity runs that
were cancelled within
a minute window.

ActivityFailedRuns Failed activity runs Count Total The total number of


metrics activity runs that
failed within a minute
window.

ActivitySucceededRu Succeeded activity Count Total The total number of


ns runs metrics activity runs that
succeeded within a
minute window.

PipelineCancelledRun Cancelled pipeline Count Total The total number of


s runs metrics pipeline runs that
were cancelled within
a minute window.

PipelineFailedRuns Failed pipeline runs Count Total The total number of


metrics pipeline runs that
failed within a minute
window.

PipelineSucceededRu Succeeded pipeline Count Total The total number of


ns runs metrics pipeline runs that
succeeded within a
minute window.

TriggerCancelledRuns Cancelled trigger Count Total The total number of


runs metrics trigger runs that
were cancelled within
a minute window.

TriggerFailedRuns Failed trigger runs Count Total The total number of


metrics trigger runs that
failed within a minute
window.

TriggerSucceededRun Succeeded trigger Count Total The total number of


s runs metrics trigger runs that
succeeded within a
minute window.
M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N

SSISIntegrationRunti Cancelled SSIS Count Total The total number of


meStartCancelled integration runtime SSIS integration
start metrics runtime starts that
were cancelled within
a minute window.

SSISIntegrationRunti Failed SSIS Count Total The total number of


meStartFailed integration runtime SSIS integration
start metrics runtime starts that
failed within a minute
window.

SSISIntegrationRunti Succeeded SSIS Count Total The total number of


meStartSucceeded integration runtime SSIS integration
start metrics runtime starts that
succeeded within a
minute window.

SSISIntegrationRunti Stuck SSIS integration Count Total The total number of


meStopStuck runtime stop metrics SSIS integration
runtime stops that
were stuck within a
minute window.

SSISIntegrationRunti Succeeded SSIS Count Total The total number of


meStopSucceeded integration runtime SSIS integration
stop metrics runtime stops that
succeeded within a
minute window.

SSISPackageExecution Cancelled SSIS Count Total The total number of


Cancelled package execution SSIS package
metrics executions that were
cancelled within a
minute window.

SSISPackageExecution Failed SSIS package Count Total The total number of


Failed execution metrics SSIS package
executions that failed
within a minute
window.

SSISPackageExecution Succeeded SSIS Count Total The total number of


Succeeded package execution SSIS package
metrics executions that
succeeded within a
minute window.

To access the metrics, complete the instructions in Azure Monitor data platform.

NOTE
Only events from completed, triggered activity and pipeline runs are emitted. In progress and debug runs are not
emitted. On the other hand, events from all SSIS package executions are emitted, including those that are completed and
in progress, regardless of their invocation methods. For example, you can invoke package executions on Azure-enabled
SQL Server Data Tools (SSDT), via T-SQL on SSMS, SQL Server Agent, or other designated tools, and as triggered or
debug runs of Execute SSIS Package activities in ADF pipelines.

Data Factory Alerts


Sign in to the Azure portal and select Monitor > Aler ts to create alerts.
Create Alerts
1. Select + New Aler t rule to create a new alert.

2. Define the alert condition.

NOTE
Make sure to select All in the Filter by resource type drop-down list.
3. Define the alert details.

4. Define the action group.


Set up diagnostic logs via the Azure Monitor REST API
Diagnostic settings
Use diagnostic settings to configure diagnostic logs for non-compute resources. The settings for a resource
control have the following features:
They specify where diagnostic logs are sent. Examples include an Azure storage account, an Azure event hub,
or Monitor logs.
They specify which log categories are sent.
They specify how long each log category should be kept in a storage account.
A retention of zero days means logs are kept forever. Otherwise, the value can be any number of days from 1
through 2,147,483,647.
If retention policies are set but storing logs in a storage account is disabled, the retention policies have no
effect. For example, this condition can happen when only Event Hubs or Monitor logs options are selected.
Retention policies are applied per day. The boundary between days occurs at midnight Coordinated Universal
Time (UTC). At the end of a day, logs from days that are beyond the retention policy are deleted. For example,
if you have a retention policy of one day, at the beginning of today the logs from before yesterday are
deleted.
Enable diagnostic logs via the Azure Monitor REST API
Create or update a diagnostics setting in the Monitor REST API
R e q u e st
PUT
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}

Header s

Replace {api-version} with 2016-09-01 .


Replace {resource-id} with the ID of the resource for which you want to edit diagnostic settings. For more
information, see Using Resource groups to manage your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to the JSON web token that you got from Azure Active Directory (Azure AD). For
more information, see Authenticating requests.
Body

{
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<stor
ageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHub
Name>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspace
s/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}

P RO P ERT Y TYPE DESC RIP T IO N

storageAccountId String The resource ID of the storage account


to which you want to send diagnostic
logs.

ser viceBusRuleId String The service-bus rule ID of the service-


bus namespace in which you want to
have Event Hubs created for streaming
diagnostic logs. The rule ID has the
format
{service bus resource
ID}/authorizationrules/{key
name}
.

workspaceId String The workspace ID of the workspace


where the logs will be saved.

metrics Parameter values of the pipeline run to A JSON object that maps parameter
be passed to the invoked pipeline names to argument values.

logs Complex Type The name of a diagnostic-log category


for a resource type. To get the list of
diagnostic-log categories for a
resource, perform a GET diagnostic-
settings operation.

categor y String An array of log categories and their


retention policies.

timeGrain String The granularity of metrics, which are


captured in ISO 8601 duration format.
The property value must be PT1M ,
which specifies one minute.

enabled Boolean Specifies whether collection of the


metric or log category is enabled for
this resource.
P RO P ERT Y TYPE DESC RIP T IO N

retentionPolicy Complex Type Describes the retention policy for a


metric or log category. This property is
used for storage accounts only.

days Int The number of days to keep the


metrics or logs. If the property value is
0, the logs are kept forever. This
property is used for storage accounts
only.

R e sp o n se

200 OK.

{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<sto
rageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHu
bName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspac
es/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

Get information about diagnostics settings in the Monitor REST API


R e q u e st

GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}

Header s

Replace {api-version} with 2016-09-01 .


Replace {resource-id} with the ID of the resource for which you want to edit diagnostic settings. For more
information, see Using Resource groups to manage your Azure resources.
Set the Content-Type header to application/json .
Set the authorization header to a JSON web token that you got from Azure AD. For more information, see
Authenticating requests.
R e sp o n se

200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/auth
orizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}

For more information, see Diagnostic Settings.

Schema of logs and events


Monitor schema
Activity-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.

activityRunId String The ID of the activity run. 3a171e1f-b36e-4b80-


8a54-5625394f4354

pipelineRunId String The ID of the pipeline run. 9f6069d6-e522-4608-


9f99-21807bfc3c70

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the ActivityRuns


diagnostic logs. Set the
property value to
ActivityRuns .

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the activity MyActivity - Succeeded


with its status. If the activity
is the start heartbeat, the
property value is
MyActivity - . If the
activity is the end
heartbeat, the property
value is
MyActivity - Succeeded .

pipelineName String The name of the pipeline. MyPipeline

activityName String The name of the activity. MyActivity

star t String The start time of the 2017-06-


activity runs in timespan 26T20:55:29.5007959Z
UTC format.

end String The end time of the activity 2017-06-


runs in timespan UTC 26T20:55:29.5007959Z
format. If the diagnostic log
shows that an activity has
started but not yet ended,
the property value is
1601-01-01T00:00:00Z .

Pipeline-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

runId String The ID of the pipeline run. 9f6069d6-e522-4608-


9f99-21807bfc3c70

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the PipelineRuns


diagnostic logs. Set the
property value to
PipelineRuns .

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the pipeline MyPipeline - Succeeded .


along with its status. After
the pipeline run is finished,
the property value is
Pipeline - Succeeded .

pipelineName String The name of the pipeline. MyPipeline

star t String The start time of the 2017-06-


activity runs in timespan 26T20:55:29.5007959Z
UTC format. .

end String The end time of the activity 2017-06-


runs in timespan UTC 26T20:55:29.5007959Z
format. If the diagnostic log
shows an activity has
started but not yet ended,
the property value is
1601-01-01T00:00:00Z .

status String The final status of the Succeeded


pipeline run. Possible
property values are
Succeeded and Failed .

Trigger-run log attributes

{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

Level String The level of the diagnostic 4


logs. For activity-run logs,
set the property value to 4.

correlationId String The unique ID for tracking a 319dc6b4-f348-405e-


particular request. b8d7-aafc77b73e77

time String The time of the event in the 2017-06-


timespan UTC format 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z
.

triggerId String The ID of the trigger run. 08587023010602533858661257311

resourceId String The ID associated with the /SUBSCRIPTIONS/<subID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICROSOFT.DATA


data-factory resource.

categor y String The category of the PipelineRuns


diagnostic logs. Set the
property value to
PipelineRuns .
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

level String The level of the diagnostic Informational


logs. Set the property value
to Informational .

operationName String The name of the trigger MyTrigger - Succeeded


with its final status, which
indicates whether the
trigger successfully fired. If
the heartbeat was
successful, the property
value is
MyTrigger - Succeeded .

triggerName String The name of the trigger. MyTrigger

triggerType String The type of the trigger. ScheduleTrigger


Possible property values are
Manual Trigger and
Schedule Trigger .

triggerEvent String The event of the trigger. ScheduleTime - 2017-


07-06T01:50:25Z

star t String The start time of the trigger 2017-06-


firing in timespan UTC 26T20:55:29.5007959Z
format.

status String The final status showing Succeeded


whether the trigger
successfully fired. Possible
property values are
Succeeded and Failed .

SSIS integration runtime log attributes


Here are the log attributes of SSIS IR start/stop/maintenance operations.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"resultType": "",
"properties": {
"message": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String The name of your SSIS IR Start/Stop/Maintenance


operation

categor y String The category of diagnostic SSISIntegrationRuntimeLogs


logs

correlationId String The unique ID for tracking a f13b159b-515f-4885-9dfa-


particular operation a664e949f785Deprovision0059035558

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

resultType String The result of your SSIS IR Started/InProgress/Succeeded/Failed


operation

message String The output message of The stopping of your


your SSIS IR operation SSIS integration
runtime has succeeded.

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS event message context log attributes


Here are the log attributes of conditions related to event messages that are generated by SSIS package
executions on your SSIS IR. They convey similar information as SSIS catalog (SSISDB) event message context
table or view that shows run-time values of many SSIS package properties. They're generated when you select
Basic/Verbose logging level and useful for debugging/compliance checking.
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"operationId": "",
"contextDepth": "",
"packagePath": "",
"contextType": "",
"contextSourceName": "",
"contextSourceId": "",
"propertyName": "",
"propertyValue": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageEventMessageContext
SSISPackageEventMessageContext

categor y String The category of diagnostic SSISPackageEventMessageContext


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

operationId String The unique ID for tracking a 1 (1 signifies operations


particular operation in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

contextDepth String The depth of your event 0 (0 signifies the context


message context before package execution
starts, 1 signifies the
context when an error
occurs, and it increases as
the context is further from
the error)

packagePath String The path of package object \Package


as your event message
context source

contextType String The type of package object 60 (see more context


as your event message types)
context source

contextSourceName String The name of package object MyPackage


as your event message
context source

contextSourceId String The unique ID of package {E2CF27FB-EA48-41E9-


object as your event AF6F-3FE938B4ADE1}
message context source

proper tyName String The name of package DelayValidation


property for your event
message context source

proper tyValue String The value of package False


property for your event
message context source

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS event messages log attributes


Here are the log attributes of event messages that are generated by SSIS package executions on your SSIS IR.
They convey similar information as SSISDB event messages table or view that shows the detailed text/metadata
of event messages. They're generated at any logging level except None .
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"operationId": "",
"messageTime": "",
"messageType": "",
"messageSourceType": "",
"message": "",
"packageName": "",
"eventName": "",
"messageSourceName": "",
"messageSourceId": "",
"subcomponentName": "",
"packagePath": "",
"executionPath": "",
"threadId": ""
}
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageEventMessages
SSISPackageEventMessages

categor y String The category of diagnostic SSISPackageEventMessages


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

operationId String The unique ID for tracking a 1 (1 signifies operations


particular operation in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

messageTime String The time when your event 2017-06-


message is created in UTC 28T21:00:27.3534352Z
format

messageType String The type of your event 70 (see more message


message types)

messageSourceType String The type of your event 20 (see more message


message source source types)

message String The text of your event MyPackage:Validation


message has started.

packageName String The name of your executed MyPackage.dtsx


package file

eventName String The name of related run- OnPreValidate


time event

messageSourceName String The name of package Data Flow Task


component as your event
message source

messageSourceId String The unique ID of package {1a45a5a4-3df9-4f02-


component as your event b818-ebf583829ad2}
message source

subcomponentName String The name of data flow SSIS.Pipeline


component as your event
message source

packagePath String The path of package object \Package\Data Flow


as your event message Task
source

executionPath String The full path from parent \Transformation\Data


package to executed Flow Task
component (This path also captures
component iterations)
P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

threadId String The unique ID of thread {1a45a5a4-3df9-4f02-


executed when your event b818-ebf583829ad2}
message is logged

SSIS executable statistics log attributes


Here are the log attributes of executable statistics that are generated by SSIS package executions on your SSIS
IR, where executables are containers or tasks in the control flow of packages. They convey similar information as
SSISDB executable statistics table or view that shows a row for each running executable, including its iterations.
They're generated at any logging level except None and useful for identifying task-level bottlenecks/failures.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"executionPath": "",
"startTime": "",
"endTime": "",
"executionDuration": "",
"executionResult": "",
"executionValue": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutableStatistics
SSISPackageExecutableStatistics

categor y String The category of diagnostic SSISPackageExecutableStatistics


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

executionPath String The full path from parent \Transformation\Data


package to executed Flow Task
component (This path also captures
component iterations)

star tTime String The time when executable 2017-06-


enters pre-execute phase in 28T21:00:27.3534352Z
UTC format

endTime String The time when executable 2017-06-


enters post-execute phase 28T21:00:27.3534352Z
in UTC format

executionDuration String The running time of 1,125


executable in milliseconds

executionResult String The result of running 0 (0 signifies success, 1


executable signifies failure, 2 signifies
completion, and 3 signifies
cancelation)

executionValue String The user-defined value 1


returned by running
executable

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS execution component phases log attributes


Here are the log attributes of run-time statistics for data flow components that are generated by SSIS package
executions on your SSIS IR. They convey similar information as SSISDB execution component phases table or
view that shows the time spent by data flow components in all their execution phases. They're generated when
you select Performance/Verbose logging level and useful for capturing data flow execution statistics.

{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"subcomponentName": "",
"phase": "",
"startTime": "",
"endTime": "",
"executionPath": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutionComponentPhases
SSISPackageExecutionComponentPhases

categor y String The category of diagnostic SSISPackageExecutionComponentPhases


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

packageName String The name of your executed MyPackage.dtsx


package file

taskName String The name of executed data Data Flow Task


flow task

subcomponentName String The name of data flow Derived Column


component

phase String The name of execution AcquireConnections


phase

star tTime String The time when execution 2017-06-


phase starts in UTC format 28T21:00:27.3534352Z

endTime String The time when execution 2017-06-


phase ends in UTC format 28T21:00:27.3534352Z

executionPath String The path of execution for \Transformation\Data


data flow task Flow Task

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

SSIS execution data statistics log attributes


Here are the log attributes of data movements through each leg of data flow pipelines, from upstream to
downstream components, that are generated by SSIS package executions on your SSIS IR. They convey similar
information as SSISDB execution data statistics table or view that shows row counts of data moved through data
flow tasks. They're generated when you select Verbose logging level and useful for computing data flow
throughput.
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"dataflowPathIdString": "",
"dataflowPathName": "",
"sourceComponentName": "",
"destinationComponentName": "",
"rowsSent": "",
"createdTime": "",
"executionPath": ""
},
"resourceId": ""
}

P RO P ERT Y TYPE DESC RIP T IO N EXA M P L E

time String The time of event in UTC 2017-06-


format: 28T21:00:27.3534352Z
YYYY-MM-
DDTHH:MM:SS.00000Z

operationName String This is set to mysqlmissisir-


YourSSISIRName- SSISPackageExecutionDataStatistics
SSISPackageExecutionDataStatistics

categor y String The category of diagnostic SSISPackageExecutionDataStatistics


logs

correlationId String The unique ID for tracking a e55700df-4caf-4e7c-


particular operation bfb8-78ac7d2f28a0

dataFactor yName String The name of your ADF MyADFv2

integrationRuntimeNam String The name of your SSIS IR MySSISIR


e

level String The level of diagnostic logs Informational

executionId String The unique ID for tracking a 1 (1 signifies executions


particular execution in related to packages not
SSISDB stored in SSISDB/invoked
via T-SQL)

packageName String The name of your executed MyPackage.dtsx


package file

taskName String The name of executed data Data Flow Task


flow task

dataflowPathIdString String The unique ID for tracking Paths[SQLDB Table3.ADO


data flow path NET Source Output]

dataflowPathName String The name of data flow path ADO NET Source Output

sourceComponentName String The name of data flow SQLDB Table3


component that sends data

destinationComponentN String The name of data flow Derived Column


ame component that receives
data

rowsSent String The number of rows sent by 500


source component

createdTime String The time when row values 2017-06-


are obtained in UTC format 28T21:00:27.3534352Z

executionPath String The path of execution for \Transformation\Data


data flow task Flow Task

resourceId String The unique ID of your ADF /SUBSCRIPTIONS/<subscriptionID>/RESOURCEGROUPS/<resourceGroupName>/PROVIDERS/MICRO


resource

Log Analytics schema


Log Analytics inherits the schema from Monitor with the following exceptions:
The first letter in each column name is capitalized. For example, the column name "correlationId" in
Monitor is "CorrelationId" in Log Analytics.
There's no "Level" column.
The dynamic "properties" column is preserved as the following dynamic JSON blob type.
A Z URE M O N ITO R C O L UM N LO G A N A LY T IC S C O L UM N TYPE

$.properties.UserProperties UserProperties Dynamic

$.properties.Annotations Annotations Dynamic

$.properties.Input Input Dynamic

$.properties.Output Output Dynamic

$.properties.Error.errorCode ErrorCode int

$.properties.Error.message ErrorMessage string

$.properties.Error Error Dynamic

$.properties.Predecessors Predecessors Dynamic

$.properties.Parameters Parameters Dynamic

$.properties.SystemParameters SystemParameters Dynamic

$.properties.Tags Tags Dynamic

Monitor SSIS operations with Azure Monitor


To lift & shift your SSIS workloads, you can provision SSIS IR in ADF that supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
Once provisioned, you can check SSIS IR operational status using Azure PowerShell or on the Monitor hub of
ADF portal. With Project Deployment Model, SSIS package execution logs are stored in SSISDB internal tables or
views, so you can query, analyze, and visually present them using designated tools like SSMS. With Package
Deployment Model, SSIS package execution logs can be stored in file system or Azure Files as CSV files that you
still need to parse and process using other designated tools before you can query, analyze, and visually present
them.
Now with Azure Monitor integration, you can query, analyze, and visually present all metrics and logs generated
from SSIS IR operations and SSIS package executions on Azure portal. Additionally, you can also raise alerts on
them.
Configure diagnostic settings and workspace for SSIS operations
To send all metrics and logs generated from SSIS IR operations and SSIS package executions to Azure Monitor,
you need to configure diagnostics settings and workspace for your ADF.
SSIS operational metrics
SSIS operational metrics are performance counters or numerical values that describe the status of SSIS IR start
and stop operations, as well as SSIS package executions at a particular point in time. They're part of ADF metrics
in Azure Monitor.
When you configure diagnostic settings and workspace for your ADF on Azure Monitor, selecting the AllMetrics
check box will make SSIS operational metrics available for interactive analysis using Azure Metrics Explorer,
presentation on Azure dashboard, and near-real time alerts.

SSIS operational alerts


To raise alerts on SSIS operational metrics from ADF portal, select the Aler ts & metrics page of ADF Monitor
hub and follow the step-by-step instructions provided.

To raise alerts on SSIS operational metrics from Azure portal, select the Aler ts page of Azure Monitor hub and
follow the step-by-step instructions provided.

SSIS operational logs


SSIS operational logs are events generated by SSIS IR operations and SSIS package executions that provide
enough context on any identified issues and are useful for root cause analysis.
When you configure diagnostic settings and workspace for your ADF on Azure Monitor, you can select the
relevant SSIS operational logs and send them to Log Analytics that's based on Azure Data Explorer. In there,
they'll be made available for analysis using rich query language, presentation on Azure dashboard, and near-
real time alerts.

The schemas and content of SSIS package execution logs in Azure Monitor and Log Analytics are similar to the
schemas of SSISDB internal tables or views.

A Z URE M O N ITO R LO G C AT EGO RIES LO G A N A LY T IC S TA B L ES SSISDB IN T ERN A L TA B L ES/ VIEW S

SSISIntegrationRuntimeLogs ADFSSISIntegrationRuntimeLogs

SSISPackageEventMessageContext ADFSSISPackageEventMessageContext [internal].


[event_message_context]
A Z URE M O N ITO R LO G C AT EGO RIES LO G A N A LY T IC S TA B L ES SSISDB IN T ERN A L TA B L ES/ VIEW S

SSISPackageEventMessages ADFSSISPackageEventMessages [internal].[event_messages]

SSISPackageExecutableStatistics ADFSSISPackageExecutableStatistics [internal].


[executable_statistics]

SSISPackageExecutionComponentPhases ADFSSISPackageExecutionComponentPhases [internal].


[execution_component_phases]

SSISPackageExecutionDataStatistics ADFSSISPackageExecutionDataStatistics [internal].


[execution_data_statistics]

For more info on SSIS operational log attributes/properties, see Azure Monitor and Log Analytics schemas for
ADF.
Your selected SSIS package execution logs are always sent to Log Analytics regardless of their invocation
methods. For example, you can invoke package executions on Azure-enabled SSDT, via T-SQL on SSMS, SQL
Server Agent, or other designated tools, and as triggered or debug runs of Execute SSIS Package activities in
ADF pipelines.
When querying SSIS IR operation logs on Logs Analytics, you can use OperationName and ResultType
properties that are set to Start/Stop/Maintenance and Started/InProgress/Succeeded/Failed , respectively.

When querying SSIS package execution logs on Logs Analytics, you can join them using
OperationId /ExecutionId /CorrelationId properties. OperationId /ExecutionId are always set to 1 for all
operations/executions related to packages not stored in SSISDB/invoked via T-SQL.

Next steps
Monitor and manage pipelines programmatically
Programmatically monitor an Azure data factory
7/6/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to monitor a pipeline in a data factory by using different software development kits
(SDKs).

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Data range
Data Factory only stores pipeline run data for 45 days. When you query programmatically for data about Data
Factory pipeline runs - for example, with the PowerShell command Get-AzDataFactoryV2PipelineRun - there are
no maximum dates for the optional LastUpdatedAfter and LastUpdatedBefore parameters. But if you query for
data for the past year, for example, you won't get an error but only pipeline run data from the last 45 days.
If you want to keep pipeline run data for more than 45 days, set up your own diagnostic logging with Azure
Monitor.

Pipeline run information


For pipeline run properties, refer to PipelineRun API reference. A pipeline run has different status during its
lifecycle, the possible values of run status are listed below:
Queued
InProgress
Succeeded
Failed
Canceling
Cancelled

.NET
For a complete walk-through of creating and monitoring a pipeline using .NET SDK, see Create a data factory
and pipeline using .NET.
1. Add the following code to continuously check the status of the pipeline run until it finishes copying the
data.
// Monitor the pipeline run
Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress" || pipelineRun.Status == "Queued")
System.Threading.Thread.Sleep(15000);
else
break;
}

2. Add the following code to that retrieves copy activity run details, for example, size of the data
read/written.

// Check the copy activity run details


Console.WriteLine("Checking copy activity run details...");

RunFilterParameters filterParams = new RunFilterParameters(


DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10));
ActivityRunsQueryResponse queryResponse = client.ActivityRuns.QueryByPipelineRun(
resourceGroup, dataFactoryName, runResponse.RunId, filterParams);
if (pipelineRun.Status == "Succeeded")
Console.WriteLine(queryResponse.Value.First().Output);
else
Console.WriteLine(queryResponse.Value.First().Error);
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();

For complete documentation on .NET SDK, see Data Factory .NET SDK reference.

Python
For a complete walk-through of creating and monitoring a pipeline using Python SDK, see Create a data factory
and pipeline using Python.
To monitor the pipeline run, add the following code:

# Monitor the pipeline run


time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(
rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
filter_params = RunFilterParameters(
last_updated_after=datetime.now() - timedelta(1), last_updated_before=datetime.now() + timedelta(1))
query_response = adf_client.activity_runs.query_by_pipeline_run(
rg_name, df_name, pipeline_run.run_id, filter_params)
print_activity_run_details(query_response.value[0])

For complete documentation on Python SDK, see Data Factory Python SDK reference.

REST API
For a complete walk-through of creating and monitoring a pipeline using REST API, see Create a data factory
and pipeline using REST API.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Micro
soft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"

if ( ($response.Status -eq "InProgress") -or ($response.Status -eq "Queued") ) {


Start-Sleep -Seconds 15
}
else {
$response | ConvertTo-Json
break
}
}

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.

$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}/queryActivityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader
$response | ConvertTo-Json

For complete documentation on REST API, see Data Factory REST API reference.

PowerShell
For a complete walk-through of creating and monitoring a pipeline using PowerShell, see Create a data factory
and pipeline using PowerShell.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.

while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId

if ($run) {
if ( ($run.Status -ne "InProgress") -and ($run.Status -ne "Queued") ) {
Write-Output ("Pipeline run finished. The status is: " + $run.Status)
$run
break
}
Write-Output ("Pipeline is running...status: " + $run.Status)
}

Start-Sleep -Seconds 30
}

2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
$result

Write-Host "Activity 'Output' section:" -foregroundcolor "Yellow"


$result.Output -join "`r`n"

Write-Host "\nActivity 'Error' section:" -foregroundcolor "Yellow"


$result.Error -join "`r`n"

For complete documentation on PowerShell cmdlets, see Data Factory PowerShell cmdlet reference.

Next steps
See Monitor pipelines using Azure Monitor article to learn about using Azure Monitor to monitor Data Factory
pipelines.
Monitor an integration runtime in Azure Data
Factory
5/28/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Integration runtime is the compute infrastructure used by Azure Data Factory (ADF) to provide various data
integration capabilities across different network environments. There are three types of integration runtimes
offered by Data Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SQL Server Integration Services (SSIS) integration runtime

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

To get the status of an instance of integration runtime (IR), run the following PowerShell command:

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName MyDataFactory -ResourceGroupName MyResourceGroup -


Name MyAzureIR -Status

The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.

Azure integration runtime


The compute resource for an Azure integration runtime is fully managed elastically in Azure. The following table
provides descriptions for properties returned by the Get-AzDataFactor yV2IntegrationRuntime command:
Properties
The following table provides descriptions of properties returned by the cmdlet for an Azure integration runtime:

P RO P ERT Y DESC RIP T IO N

Name Name of the Azure integration runtime.

State Status of the Azure integration runtime.

Location Location of the Azure integration runtime. For details about


location of an Azure integration runtime, see Introduction to
integration runtime.

DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.

ResourceGroupName Name of the resource group that the data factory belongs
to.

Description Description of the integration runtime.

Status
The following table provides possible statuses of an Azure integration runtime:

STAT US C O M M EN T S/ SC EN A RIO S

Online The Azure integration runtime is online and ready to be


used.

Offline The Azure integration runtime is offline due to an internal


error.
Self-hosted integration runtime
This section provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime
cmdlet.

NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in
the runtime.

Properties
The following table provides descriptions of monitoring Properties for each node :

P RO P ERT Y DESC RIP T IO N

Name Name of the self-hosted integration runtime and nodes


associated with it. Node is an on-premises Windows machine
that has the self-hosted integration runtime installed on it.

Status The status of the overall self-hosted integration runtime and


each node. Example: Online/Offline/Limited/etc. For
information about these statuses, see the next section.

Version The version of self-hosted integration runtime and each


node. The version of the self-hosted integration runtime is
determined based on version of majority of nodes in the
group. If there are nodes with different versions in the self-
hosted integration runtime setup, only the nodes with the
same version number as the logical self-hosted integration
runtime function properly. Others are in the limited mode
and need to be manually updated (only in case auto-update
fails).

Available memory Available memory on a self-hosted integration runtime


node. This value is a near real-time snapshot.

CPU utilization CPU utilization of a self-hosted integration runtime node.


This value is a near real-time snapshot.

Networking (In/Out) Network utilization of a self-hosted integration runtime


node. This value is a near real-time snapshot.

Concurrent Jobs (Running/ Limit) Running . Number of jobs or tasks running on each node.
This value is a near real-time snapshot.

Limit . Limit signifies the maximum concurrent jobs for each


node. This value is defined based on the machine size. You
can increase the limit to scale up concurrent job execution in
advanced scenarios, when activities are timing out even
when CPU, memory, or network is under-utilized. This
capability is also available with a single-node self-hosted
integration runtime.

Role There are two types of roles in a multi-node self-hosted


integration runtime – dispatcher and worker. All nodes are
workers, which means they can all be used to execute jobs.
There is only one dispatcher node, which is used to pull
tasks/jobs from cloud services and dispatch them to
different worker nodes. The dispatcher node is also a worker
node.

Some settings of the properties make more sense when there are two or more nodes in the self-hosted
integration runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you
run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see
low resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:

STAT US DESC RIP T IO N

Online Node is connected to the Data Factory service.

Offline Node is offline.

Upgrading The node is being auto-updated.

Limited Due to a connectivity issue. May be due to HTTP port 8060


issue, service bus connectivity issue, or a credential sync
issue.

Inactive Node is in a configuration different from the configuration of


other majority nodes.

A node can be inactive when it cannot connect to other nodes.


Status (overall self-hosted integration runtime )
The following table provides possible statuses of a self-hosted integration runtime. This status depends on
statuses of all nodes that belong to the runtime.

STAT US DESC RIP T IO N

Need Registration No node is registered to this self-hosted integration runtime


yet.

Online All nodes are online.

Offline No node is online.

Limited Not all nodes in this self-hosted integration runtime are in a


healthy state. This status is a warning that some nodes
might be down. This status could be due to a credential sync
issue on dispatcher/worker node.

Use the Get-AzDataFactor yV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of
the cmdlet.

Get-AzDataFactoryV2IntegrationRuntimeMetric -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName | ConvertTo-Json

Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}

]
}

Azure-SSIS integration runtime


Azure-SSIS IR is a fully managed cluster of Azure virtual machines (VMs or nodes) dedicated to run your SSIS
packages. You can invoke SSIS package executions on Azure-SSIS IR using various methods, for example via
Azure-enabled SQL Server Data Tools (SSDT), AzureDTExec command line utility, T-SQL on SQL Server
Management Studio (SSMS)/SQL Server Agent, and Execute SSIS Package activities in ADF pipelines. Azure-SSIS
IR doesn't run any other ADF activities. Once provisioned, you can monitor its overall/node-specific properties
and statuses via Azure PowerShell, Azure portal, and Azure Monitor.
Monitor the Azure-SSIS integration runtime with Azure PowerShell
Use the following Azure PowerShell cmdlet to monitor the overall/node-specific properties and statuses of
Azure-SSIS IR.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Status

Properties
The following table provides descriptions of properties returned by the above cmdlet for an Azure-SSIS IR.

P RO P ERT Y / STAT US DESC RIP T IO N

CreateTime The UTC time when your Azure-SSIS IR was created.

Nodes The allocated/available nodes of your Azure-SSIS IR with


node-specific statuses
(starting/available/recycling/unavailable) and actionable
errors.

OtherErrors The non-node-specific actionable errors on your Azure-SSIS


IR.

LastOperation The result of last start/stop operation on your Azure-SSIS IR


with actionable error(s) if it failed.

State The overall status (initial/starting/started/stopping/stopped)


of your Azure-SSIS IR.

Location The location of your Azure-SSIS IR.

NodeSize The size of each node in your Azure-SSIS IR.

NodeCount The number of nodes in your Azure-SSIS IR.


P RO P ERT Y / STAT US DESC RIP T IO N

MaxParallelExecutionsPerNode The maximum number of parallel executions per node in


your Azure-SSIS IR.

CatalogServerEndpoint The endpoint of your existing Azure SQL Database server or


managed instance to host SSIS catalog (SSISDB).

CatalogAdminUserName The admin username for your existing Azure SQL Database
server or managed instance. ADF uses this information to
prepare and manage SSISDB on your behalf.

CatalogAdminPassword The admin password for your existing Azure SQL Database
server or managed instance.

CatalogPricingTier The pricing tier for SSISDB hosted by Azure SQL Database
server. Not applicable to Azure SQL Managed Instance
hosting SSISDB.

VNetId The virtual network resource ID for your Azure-SSIS IR to


join.

Subnet The subnet name for your Azure-SSIS IR to join.

ID The resource ID of your Azure-SSIS IR.

Type The IR type (Managed/Self-Hosted) of your Azure-SSIS IR.

ResourceGroupName The name of your Azure Resource Group, in which your ADF
and Azure-SSIS IR were created.

DataFactoryName The name of your ADF.

Name The name of your Azure-SSIS IR.

Description The description of your Azure-SSIS IR.

Status (per Azure-SSIS IR node)


The following table provides possible statuses of an Azure-SSIS IR node:

N O DE- SP EC IF IC STAT US DESC RIP T IO N

Starting This node is being prepared.

Available This node is ready for you to deploy/execute SSIS packages.

Recycling This node is being repaired/restarting.

Unavailable This node isn't ready for you to deploy/execute SSIS


packages and has actionable errors/issues that you could
resolve.

Status (overall Azure-SSIS IR)


The following table provides possible overall statuses of an Azure-SSIS IR. The overall status in turn depends on
the combined statuses of all nodes that belong to the Azure-SSIS IR.

O VERA L L STAT US DESC RIP T IO N

Initial The nodes of your Azure-SSIS IR haven't been


allocated/prepared.

Starting The nodes of your Azure-SSIS IR are being


allocated/prepared and billing has started.

Started The nodes of your Azure-SSIS IR have been


allocated/prepared and they are ready for you to
deploy/execute SSIS packages.

Stopping The nodes of your Azure-SSIS IR are being released.


O VERA L L STAT US DESC RIP T IO N

Stopped The nodes of your Azure-SSIS IR have been released and


billing has stopped.

Monitor the Azure-SSIS integration runtime in Azure portal


To monitor your Azure-SSIS IR in Azure portal, go to the Integration runtimes page of Monitor hub on ADF
UI, where you can see all of your integration runtimes.

Next, select the name of your Azure-SSIS IR to open its monitoring page, where you can see its overall/node-
specific properties and statuses. On this page, depending on how you configure the general, deployment, and
advanced settings of your Azure-SSIS IR, you'll find various informational/functional tiles.
The TYPE and REGION informational tiles show the type and region of your Azure-SSIS IR, respectively.
The NODE SIZE informational tile shows the SKU (SSIS edition_VM tier_VM series), number of CPU cores, and
size of RAM per node for your Azure-SSIS IR.
The RUNNING / REQUESTED NODE(S) informational tile compares the number of nodes currently running
to the total number of nodes previously requested for your Azure-SSIS IR.
The DUAL STANDBY PAIR / ROLE informational tile shows the name of your dual standby Azure-SSIS IR pair
that works in sync with Azure SQL Database/Managed Instance failover group for business continuity and
disaster recovery (BCDR) and the current primary/secondary role of your Azure-SSIS IR. When SSISDB failover
occurs, your primary and secondary Azure-SSIS IRs will swap roles (see Configuring your Azure-SSIS IR for
BCDR).
The functional tiles are described in more details below.

STATUS tile
On the STATUS tile of your Azure-SSIS IR monitoring page, you can see its overall status, for example Running
or Stopped . Selecting the Running status pops up a window with live Stop button to stop your Azure-SSIS IR.
Selecting the Stopped status pops up a window with live Star t button to start your Azure-SSIS IR. The pop-up
window also has an Execute SSIS package button to auto-generate an ADF pipeline with Execute SSIS
Package activity that runs on your Azure-SSIS IR (see Running SSIS packages as Execute SSIS Package activities
in ADF pipelines) and a Resource ID text box, from which you can copy your Azure-SSIS IR resource ID (
/subscriptions/YourAzureSubscripton/resourcegroups/YourResourceGroup/providers/Microsoft.DataFactory/factories/YourADF/integrationruntimes/YourAzur
). The suffix of your Azure-SSIS IR resource ID that contains your ADF and Azure-SSIS IR names forms a cluster
ID that can be used to purchase additional premium/licensed SSIS components from independent software
vendors (ISVs) and bind them to your Azure-SSIS IR (see Installing premium/licensed components on your
Azure-SSIS IR).
SSISDB SERVER ENDPOINT tile
If you use Project Deployment Model where packages are stored in SSISDB hosted by your Azure SQL Database
server or managed instance, you'll see the SSISDB SERVER ENDPOINT tile on your Azure-SSIS IR monitoring
page (see Configuring your Azure-SSIS IR deployment settings). On this tile, you can select a link designating
your Azure SQL Database server or managed instance to pop up a window, where you can copy the server
endpoint from a text box and use it when connecting from SSMS to deploy, configure, run, and manage your
packages. On the pop-up window, you can also select the See your Azure SQL Database or managed
instance settings link to reconfigure/resize your SSISDB in Azure portal.

PROXY / STAGING tile


If you download, install, and configure Self-Hosted IR (SHIR) as a proxy for your Azure-SSIS IR to access data on
premises, you'll see the PROXY / STAGING tile on your Azure-SSIS IR monitoring page (see Configuring SHIR
as a proxy for your Azure-SSIS IR). On this tile, you can select a link designating your SHIR to open its
monitoring page. You can also select another link designating your Azure Blob Storage for staging to
reconfigure its linked service.
VALIDATE VNET / SUBNET tile
If you join your Azure-SSIS IR to a VNet, you'll see the VALIDATE VNET / SUBNET tile on your Azure-SSIS IR
monitoring page (see Joining your Azure-SSIS IR to a VNet). On this tile, you can select a link designating your
VNet and subnet to pop up a window, where you can copy your VNet resource ID (
/subscriptions/YourAzureSubscripton/resourceGroups/YourResourceGroup/providers/Microsoft.Network/virtualNetworks/YourARMVNet
) and subnet name from text boxes, as well as validate your VNet and subnet configurations to ensure that the
required inbound/outbound network traffics and management of your Azure-SSIS IR aren't obstructed.
DIAGNOSE CONNECTIVITY tile
On the DIAGNOSE CONNECTIVITY tile of your Azure-SSIS IR monitoring page, you can select the Test
connection link to pop up a window, where you can check the connections between your Azure-SSIS IR and
relevant package/configuration/data stores, as well as management services, via their fully qualified domain
name (FQDN)/IP address and designated port (see Testing connections from your Azure-SSIS IR).

STATIC PUBLIC IP ADDRESSES tile


If you bring your own static public IP addresses for Azure-SSIS IR, you'll see the STATIC PUBLIC IP
ADDRESSES tile on your Azure-SSIS IR monitoring page (see Bringing your own static public IP addresses for
Azure-SSIS IR). On this tile, you can select links designating your first/second static public IP addresses for
Azure-SSIS IR to pop up a window, where you can copy their resource ID (
/subscriptions/YourAzureSubscripton/resourceGroups/YourResourceGroup/providers/Microsoft.Network/publicIPAddresses/YourPublicIPAddress
) from a text box. On the pop-up window, you can also select the See your first/second static public IP
address settings link to manage your first/second static public IP address in Azure portal.

PACKAGE STORES tile


If you use Package Deployment Model where packages are stored in file system/Azure Files/SQL Server
database (MSDB) hosted by your Azure SQL Managed Instance and managed via Azure-SSIS IR package stores,
you'll see the PACKAGE STORES tile on your Azure-SSIS IR monitoring page (see Configuring your Azure-SSIS
IR deployment settings). On this tile, you can select a link designating the number of package stores attached to
your Azure-SSIS IR to pop up a window, where you can reconfigure the relevant linked services for your Azure-
SSIS IR package stores on top of file system/Azure Files/MSDB hosted by your Azure SQL Managed Instance.

ERROR(S) tile
If there are issues with the starting/stopping/maintenance/upgrade of your Azure-SSIS IR, you'll see an
additional ERROR(S) tile on your Azure-SSIS IR monitoring page. On this tile, you can select a link designating
the number of errors generated by your Azure-SSIS IR to pop up a window, where you can see those errors in
more details and copy them to find the recommended solutions in our troubleshooting guide (see
Troubleshooting your Azure-SSIS IR).

Monitor the Azure-SSIS integration runtime with Azure Monitor


To monitor your Azure-SSIS IR with Azure Monitor, see Monitoring SSIS operations with Azure Monitor.
More info about the Azure-SSIS integration runtime
See the following articles to learn more about Azure-SSIS integration runtime:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general, including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create your Azure-
SSIS IR and use Azure SQL Database to host the SSIS catalog (SSISDB).
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Managed Instance to host SSISDB.
Manage an Azure-SSIS IR. This article shows you how to start, stop, or delete your Azure-SSIS IR. It also
shows you how to scale it out by adding more nodes.
Join an Azure-SSIS IR to a virtual network. This article provides instructions on joining your Azure-SSIS IR to
a virtual network.

Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Monitor an integration runtime in Azure Data
Factory
5/28/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Integration runtime is the compute infrastructure used by Azure Data Factory (ADF) to provide various data
integration capabilities across different network environments. There are three types of integration runtimes
offered by Data Factory:
Azure integration runtime
Self-hosted integration runtime
Azure-SQL Server Integration Services (SSIS) integration runtime

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

To get the status of an instance of integration runtime (IR), run the following PowerShell command:

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName MyDataFactory -ResourceGroupName MyResourceGroup -


Name MyAzureIR -Status

The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.

Azure integration runtime


The compute resource for an Azure integration runtime is fully managed elastically in Azure. The following table
provides descriptions for properties returned by the Get-AzDataFactor yV2IntegrationRuntime command:
Properties
The following table provides descriptions of properties returned by the cmdlet for an Azure integration runtime:

P RO P ERT Y DESC RIP T IO N

Name Name of the Azure integration runtime.

State Status of the Azure integration runtime.

Location Location of the Azure integration runtime. For details about


location of an Azure integration runtime, see Introduction to
integration runtime.

DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.

ResourceGroupName Name of the resource group that the data factory belongs
to.

Description Description of the integration runtime.

Status
The following table provides possible statuses of an Azure integration runtime:

STAT US C O M M EN T S/ SC EN A RIO S

Online The Azure integration runtime is online and ready to be


used.

Offline The Azure integration runtime is offline due to an internal


error.
Self-hosted integration runtime
This section provides descriptions for properties returned by the Get-AzDataFactoryV2IntegrationRuntime
cmdlet.

NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in
the runtime.

Properties
The following table provides descriptions of monitoring Properties for each node :

P RO P ERT Y DESC RIP T IO N

Name Name of the self-hosted integration runtime and nodes


associated with it. Node is an on-premises Windows machine
that has the self-hosted integration runtime installed on it.

Status The status of the overall self-hosted integration runtime and


each node. Example: Online/Offline/Limited/etc. For
information about these statuses, see the next section.

Version The version of self-hosted integration runtime and each


node. The version of the self-hosted integration runtime is
determined based on version of majority of nodes in the
group. If there are nodes with different versions in the self-
hosted integration runtime setup, only the nodes with the
same version number as the logical self-hosted integration
runtime function properly. Others are in the limited mode
and need to be manually updated (only in case auto-update
fails).

Available memory Available memory on a self-hosted integration runtime


node. This value is a near real-time snapshot.

CPU utilization CPU utilization of a self-hosted integration runtime node.


This value is a near real-time snapshot.

Networking (In/Out) Network utilization of a self-hosted integration runtime


node. This value is a near real-time snapshot.

Concurrent Jobs (Running/ Limit) Running . Number of jobs or tasks running on each node.
This value is a near real-time snapshot.

Limit . Limit signifies the maximum concurrent jobs for each


node. This value is defined based on the machine size. You
can increase the limit to scale up concurrent job execution in
advanced scenarios, when activities are timing out even
when CPU, memory, or network is under-utilized. This
capability is also available with a single-node self-hosted
integration runtime.

Role There are two types of roles in a multi-node self-hosted


integration runtime – dispatcher and worker. All nodes are
workers, which means they can all be used to execute jobs.
There is only one dispatcher node, which is used to pull
tasks/jobs from cloud services and dispatch them to
different worker nodes. The dispatcher node is also a worker
node.

Some settings of the properties make more sense when there are two or more nodes in the self-hosted
integration runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you
run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see
low resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:

STAT US DESC RIP T IO N

Online Node is connected to the Data Factory service.

Offline Node is offline.

Upgrading The node is being auto-updated.

Limited Due to a connectivity issue. May be due to HTTP port 8060


issue, service bus connectivity issue, or a credential sync
issue.

Inactive Node is in a configuration different from the configuration of


other majority nodes.

A node can be inactive when it cannot connect to other nodes.


Status (overall self-hosted integration runtime )
The following table provides possible statuses of a self-hosted integration runtime. This status depends on
statuses of all nodes that belong to the runtime.

STAT US DESC RIP T IO N

Need Registration No node is registered to this self-hosted integration runtime


yet.

Online All nodes are online.

Offline No node is online.

Limited Not all nodes in this self-hosted integration runtime are in a


healthy state. This status is a warning that some nodes
might be down. This status could be due to a credential sync
issue on dispatcher/worker node.

Use the Get-AzDataFactor yV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of
the cmdlet.

Get-AzDataFactoryV2IntegrationRuntimeMetric -name $integrationRuntimeName -ResourceGroupName


$resourceGroupName -DataFactoryName $dataFactoryName | ConvertTo-Json

Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}

]
}

Azure-SSIS integration runtime


Azure-SSIS IR is a fully managed cluster of Azure virtual machines (VMs or nodes) dedicated to run your SSIS
packages. You can invoke SSIS package executions on Azure-SSIS IR using various methods, for example via
Azure-enabled SQL Server Data Tools (SSDT), AzureDTExec command line utility, T-SQL on SQL Server
Management Studio (SSMS)/SQL Server Agent, and Execute SSIS Package activities in ADF pipelines. Azure-SSIS
IR doesn't run any other ADF activities. Once provisioned, you can monitor its overall/node-specific properties
and statuses via Azure PowerShell, Azure portal, and Azure Monitor.
Monitor the Azure-SSIS integration runtime with Azure PowerShell
Use the following Azure PowerShell cmdlet to monitor the overall/node-specific properties and statuses of
Azure-SSIS IR.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Status

Properties
The following table provides descriptions of properties returned by the above cmdlet for an Azure-SSIS IR.

P RO P ERT Y / STAT US DESC RIP T IO N

CreateTime The UTC time when your Azure-SSIS IR was created.

Nodes The allocated/available nodes of your Azure-SSIS IR with


node-specific statuses
(starting/available/recycling/unavailable) and actionable
errors.

OtherErrors The non-node-specific actionable errors on your Azure-SSIS


IR.

LastOperation The result of last start/stop operation on your Azure-SSIS IR


with actionable error(s) if it failed.

State The overall status (initial/starting/started/stopping/stopped)


of your Azure-SSIS IR.

Location The location of your Azure-SSIS IR.

NodeSize The size of each node in your Azure-SSIS IR.

NodeCount The number of nodes in your Azure-SSIS IR.


P RO P ERT Y / STAT US DESC RIP T IO N

MaxParallelExecutionsPerNode The maximum number of parallel executions per node in


your Azure-SSIS IR.

CatalogServerEndpoint The endpoint of your existing Azure SQL Database server or


managed instance to host SSIS catalog (SSISDB).

CatalogAdminUserName The admin username for your existing Azure SQL Database
server or managed instance. ADF uses this information to
prepare and manage SSISDB on your behalf.

CatalogAdminPassword The admin password for your existing Azure SQL Database
server or managed instance.

CatalogPricingTier The pricing tier for SSISDB hosted by Azure SQL Database
server. Not applicable to Azure SQL Managed Instance
hosting SSISDB.

VNetId The virtual network resource ID for your Azure-SSIS IR to


join.

Subnet The subnet name for your Azure-SSIS IR to join.

ID The resource ID of your Azure-SSIS IR.

Type The IR type (Managed/Self-Hosted) of your Azure-SSIS IR.

ResourceGroupName The name of your Azure Resource Group, in which your ADF
and Azure-SSIS IR were created.

DataFactoryName The name of your ADF.

Name The name of your Azure-SSIS IR.

Description The description of your Azure-SSIS IR.

Status (per Azure-SSIS IR node)


The following table provides possible statuses of an Azure-SSIS IR node:

N O DE- SP EC IF IC STAT US DESC RIP T IO N

Starting This node is being prepared.

Available This node is ready for you to deploy/execute SSIS packages.

Recycling This node is being repaired/restarting.

Unavailable This node isn't ready for you to deploy/execute SSIS


packages and has actionable errors/issues that you could
resolve.

Status (overall Azure-SSIS IR)


The following table provides possible overall statuses of an Azure-SSIS IR. The overall status in turn depends on
the combined statuses of all nodes that belong to the Azure-SSIS IR.

O VERA L L STAT US DESC RIP T IO N

Initial The nodes of your Azure-SSIS IR haven't been


allocated/prepared.

Starting The nodes of your Azure-SSIS IR are being


allocated/prepared and billing has started.

Started The nodes of your Azure-SSIS IR have been


allocated/prepared and they are ready for you to
deploy/execute SSIS packages.

Stopping The nodes of your Azure-SSIS IR are being released.


O VERA L L STAT US DESC RIP T IO N

Stopped The nodes of your Azure-SSIS IR have been released and


billing has stopped.

Monitor the Azure-SSIS integration runtime in Azure portal


To monitor your Azure-SSIS IR in Azure portal, go to the Integration runtimes page of Monitor hub on ADF
UI, where you can see all of your integration runtimes.

Next, select the name of your Azure-SSIS IR to open its monitoring page, where you can see its overall/node-
specific properties and statuses. On this page, depending on how you configure the general, deployment, and
advanced settings of your Azure-SSIS IR, you'll find various informational/functional tiles.
The TYPE and REGION informational tiles show the type and region of your Azure-SSIS IR, respectively.
The NODE SIZE informational tile shows the SKU (SSIS edition_VM tier_VM series), number of CPU cores, and
size of RAM per node for your Azure-SSIS IR.
The RUNNING / REQUESTED NODE(S) informational tile compares the number of nodes currently running
to the total number of nodes previously requested for your Azure-SSIS IR.
The DUAL STANDBY PAIR / ROLE informational tile shows the name of your dual standby Azure-SSIS IR pair
that works in sync with Azure SQL Database/Managed Instance failover group for business continuity and
disaster recovery (BCDR) and the current primary/secondary role of your Azure-SSIS IR. When SSISDB failover
occurs, your primary and secondary Azure-SSIS IRs will swap roles (see Configuring your Azure-SSIS IR for
BCDR).
The functional tiles are described in more details below.

STATUS tile
On the STATUS tile of your Azure-SSIS IR monitoring page, you can see its overall status, for example Running
or Stopped . Selecting the Running status pops up a window with live Stop button to stop your Azure-SSIS IR.
Selecting the Stopped status pops up a window with live Star t button to start your Azure-SSIS IR. The pop-up
window also has an Execute SSIS package button to auto-generate an ADF pipeline with Execute SSIS
Package activity that runs on your Azure-SSIS IR (see Running SSIS packages as Execute SSIS Package activities
in ADF pipelines) and a Resource ID text box, from which you can copy your Azure-SSIS IR resource ID (
/subscriptions/YourAzureSubscripton/resourcegroups/YourResourceGroup/providers/Microsoft.DataFactory/factories/YourADF/integrationruntimes/YourAzur
). The suffix of your Azure-SSIS IR resource ID that contains your ADF and Azure-SSIS IR names forms a cluster
ID that can be used to purchase additional premium/licensed SSIS components from independent software
vendors (ISVs) and bind them to your Azure-SSIS IR (see Installing premium/licensed components on your
Azure-SSIS IR).
SSISDB SERVER ENDPOINT tile
If you use Project Deployment Model where packages are stored in SSISDB hosted by your Azure SQL Database
server or managed instance, you'll see the SSISDB SERVER ENDPOINT tile on your Azure-SSIS IR monitoring
page (see Configuring your Azure-SSIS IR deployment settings). On this tile, you can select a link designating
your Azure SQL Database server or managed instance to pop up a window, where you can copy the server
endpoint from a text box and use it when connecting from SSMS to deploy, configure, run, and manage your
packages. On the pop-up window, you can also select the See your Azure SQL Database or managed
instance settings link to reconfigure/resize your SSISDB in Azure portal.

PROXY / STAGING tile


If you download, install, and configure Self-Hosted IR (SHIR) as a proxy for your Azure-SSIS IR to access data on
premises, you'll see the PROXY / STAGING tile on your Azure-SSIS IR monitoring page (see Configuring SHIR
as a proxy for your Azure-SSIS IR). On this tile, you can select a link designating your SHIR to open its
monitoring page. You can also select another link designating your Azure Blob Storage for staging to
reconfigure its linked service.
VALIDATE VNET / SUBNET tile
If you join your Azure-SSIS IR to a VNet, you'll see the VALIDATE VNET / SUBNET tile on your Azure-SSIS IR
monitoring page (see Joining your Azure-SSIS IR to a VNet). On this tile, you can select a link designating your
VNet and subnet to pop up a window, where you can copy your VNet resource ID (
/subscriptions/YourAzureSubscripton/resourceGroups/YourResourceGroup/providers/Microsoft.Network/virtualNetworks/YourARMVNet
) and subnet name from text boxes, as well as validate your VNet and subnet configurations to ensure that the
required inbound/outbound network traffics and management of your Azure-SSIS IR aren't obstructed.
DIAGNOSE CONNECTIVITY tile
On the DIAGNOSE CONNECTIVITY tile of your Azure-SSIS IR monitoring page, you can select the Test
connection link to pop up a window, where you can check the connections between your Azure-SSIS IR and
relevant package/configuration/data stores, as well as management services, via their fully qualified domain
name (FQDN)/IP address and designated port (see Testing connections from your Azure-SSIS IR).

STATIC PUBLIC IP ADDRESSES tile


If you bring your own static public IP addresses for Azure-SSIS IR, you'll see the STATIC PUBLIC IP
ADDRESSES tile on your Azure-SSIS IR monitoring page (see Bringing your own static public IP addresses for
Azure-SSIS IR). On this tile, you can select links designating your first/second static public IP addresses for
Azure-SSIS IR to pop up a window, where you can copy their resource ID (
/subscriptions/YourAzureSubscripton/resourceGroups/YourResourceGroup/providers/Microsoft.Network/publicIPAddresses/YourPublicIPAddress
) from a text box. On the pop-up window, you can also select the See your first/second static public IP
address settings link to manage your first/second static public IP address in Azure portal.

PACKAGE STORES tile


If you use Package Deployment Model where packages are stored in file system/Azure Files/SQL Server
database (MSDB) hosted by your Azure SQL Managed Instance and managed via Azure-SSIS IR package stores,
you'll see the PACKAGE STORES tile on your Azure-SSIS IR monitoring page (see Configuring your Azure-SSIS
IR deployment settings). On this tile, you can select a link designating the number of package stores attached to
your Azure-SSIS IR to pop up a window, where you can reconfigure the relevant linked services for your Azure-
SSIS IR package stores on top of file system/Azure Files/MSDB hosted by your Azure SQL Managed Instance.

ERROR(S) tile
If there are issues with the starting/stopping/maintenance/upgrade of your Azure-SSIS IR, you'll see an
additional ERROR(S) tile on your Azure-SSIS IR monitoring page. On this tile, you can select a link designating
the number of errors generated by your Azure-SSIS IR to pop up a window, where you can see those errors in
more details and copy them to find the recommended solutions in our troubleshooting guide (see
Troubleshooting your Azure-SSIS IR).

Monitor the Azure-SSIS integration runtime with Azure Monitor


To monitor your Azure-SSIS IR with Azure Monitor, see Monitoring SSIS operations with Azure Monitor.
More info about the Azure-SSIS integration runtime
See the following articles to learn more about Azure-SSIS integration runtime:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general, including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create your Azure-
SSIS IR and use Azure SQL Database to host the SSIS catalog (SSISDB).
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Managed Instance to host SSISDB.
Manage an Azure-SSIS IR. This article shows you how to start, stop, or delete your Azure-SSIS IR. It also
shows you how to scale it out by adding more nodes.
Join an Azure-SSIS IR to a virtual network. This article provides instructions on joining your Azure-SSIS IR to
a virtual network.

Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Reconfigure the Azure-SSIS integration runtime
3/5/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to reconfigure an existing Azure-SSIS integration runtime. To create an Azure-SSIS
integration runtime (IR) in Azure Data Factory, see Create an Azure-SSIS integration runtime.

Data Factory UI
You can use Data Factory UI to stop, edit/reconfigure, or delete an Azure-SSIS IR.
1. Open Data Factory UI by selecting the Author & Monitor tile on the home page of your data factory.
2. Select the Manage hub below Home , Edit , and Monitor hubs to show the Connections pane.
To reconfigure an Azure -SSIS IR
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .

You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.

Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

After you provision and start an instance of Azure-SSIS integration runtime, you can reconfigure it by running a
sequence of Stop - Set - Start PowerShell cmdlets consecutively. For example, the following PowerShell
script changes the number of nodes allocated for the Azure-SSIS integration runtime instance to five.
Reconfigure an Azure -SSIS IR
1. First, stop the Azure-SSIS integration runtime by using the Stop-AzDataFactoryV2IntegrationRuntime
cmdlet. This command releases all of its nodes and stops billing.

Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName

2. Next, reconfigure the Azure-SSIS IR by using the Set-AzDataFactoryV2IntegrationRuntime cmdlet. The


following sample command scales out an Azure-SSIS integration runtime to five nodes.

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -NodeCount 5

3. Then, start the Azure-SSIS integration runtime by using the Start-AzDataFactoryV2IntegrationRuntime


cmdlet. This command allocates all of its nodes for running SSIS packages.

Start-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName

Delete an Azure -SSIS IR


1. First, list all existing Azure SSIS IRs under your data factory.

Get-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -ResourceGroupName


$ResourceGroupName -Status

2. Next, stop all existing Azure SSIS IRs in your data factory.

Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Force

3. Next, remove all existing Azure SSIS IRs in your data factory one by one.

Remove-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -


ResourceGroupName $ResourceGroupName -Force

4. Finally, remove your data factory.

Remove-AzDataFactoryV2 -Name $DataFactoryName -ResourceGroupName $ResourceGroupName -Force

5. If you had created a new resource group, remove the resource group.

Remove-AzResourceGroup -Name $ResourceGroupName -Force

Next steps
For more information about Azure-SSIS runtime, see the following topics:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses Azure SQL Database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Managed Instance and joining the IR to a virtual network.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Copy or clone a data factory in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to copy or clone a data factory in Azure Data Factory.

Use cases for cloning a data factory


Here are some of the circumstances in which you may find it useful to copy or clone a data factory:
Move Data Factor y to a new region. If you want to move your Data Factory to a different region, the
best way is to create a copy in the targeted region, and delete the existing one.
Renaming Data Factor y . Azure doesn't support renaming resources. If you want to rename a data
factory, you can clone the data factory with a different name, and delete the existing one.
Debugging changes when the debug features aren't sufficient. In most scenarios, you can use Debug. In
others, testing out changes in a cloned sandbox environment makes more sense. For instance, how your
parameterized ETL pipelines would behave when a trigger fires upon file arrival versus over Tumbling
time window, may not be easily testable through Debug alone. In these cases, you may want to clone a
sandbox environment for experimenting. Since Azure Data Factory charges primarily by the number of
runs, a second factory doesn't lead to any additional charges.

How to clone a data factory


1. As a prerequisite, first you need to create your target data factory from the Azure portal.
2. If you are in GIT mode:
a. Every time you publish from the portal, the factory's Resource Manager template is saved into GIT in
the adf_publish branch
b. Connect the new factory to the same repository and build from adf_publish branch. Resources, such
as pipelines, datasets, and triggers, will carry through
3. If you are in Live mode:
a. Data Factory UI lets you export the entire payload of your data factory into a Resource Manager
template file and a parameter file. They can be accessed from the ARM template \ Expor t Resource
Manager template button in the portal.
b. You may make appropriate changes to the parameter file and swap in new values for the new factory
c. Next, you can deploy it via standard Resource Manager template deployment methods.
4. If you have a SelfHosted IntegrationRuntime in your source factory, you need to precreate it with the
same name in the target factory. If you want to share the SelfHosted Integration Runtime between
different factories, you can use the pattern published here on sharing SelfHosted IR.
5. For security reasons, the generated Resource Manager template won't contain any secret information, for
example passwords for linked services. Hence, you need to provide the credentials as deployment
parameters. If manually inputting credential isn't desirable for your settings, please consider retrieving
the connection strings and passwords from Azure Key Vault instead. See more

Next steps
Review the guidance for creating a data factory in the Azure portal in Create a data factory by using the Azure
Data Factory UI.
How to create and configure Azure Integration
Runtime
7/2/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data
integration capabilities across different network environments. For more information about IR, see Integration
runtime.
Azure IR provides a fully managed compute to natively perform data movement and dispatch data
transformation activities to compute services like HDInsight. It is hosted in Azure environment and supports
connecting to resources in public network environment with public accessible endpoints.
This document introduces how you can create and configure Azure Integration Runtime.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Default Azure IR
By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is autoresolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure
IR when you would like to explicitly define the location of the IR, or if you would like to virtually group the
activity executions on different IRs for management purpose.

Create Azure IR
To create and set up an Azure IR, you can use the following procedures.
Create an Azure IR via Azure PowerShell
Integration Runtime can be created using the Set-AzDataFactor yV2IntegrationRuntime PowerShell cmdlet.
To create an Azure IR, you specify the name, location, and type to the command. Here is a sample command to
create an Azure IR with location set to "West Europe":

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName "SampleV2DataFactory1" -Name "MySampleAzureIR" -


ResourceGroupName "ADFV2SampleRG" -Type Managed -Location "West Europe"

For Azure IR, the type must be set to Managed . You do not need to specify compute details because it is fully
managed elastically in cloud. Specify compute details like node size and node count when you would like to
create Azure-SSIS IR. For more information, see Create and Configure Azure-SSIS IR.
You can configure an existing Azure IR to change its location using the Set-AzDataFactoryV2IntegrationRuntime
PowerShell cmdlet. For more information about the location of an Azure IR, see Introduction to integration
runtime.
Create an Azure IR via Azure Data Factory UI
Use the following steps to create an Azure IR using Azure Data Factory UI.
1. On the home page of Azure Data Factory UI, select the Manage tab from the leftmost pane.

2. Select Integration runtimes on the left pane, and then select +New .

3. On the Integration runtime setup page, select Azure, Self-Hosted , and then select Continue .
4. On the following page, select Azure to create an Azure IR, and then select Continue .
5. Enter a name for your Azure IR, and select Create .
6. You'll see a pop-up notification when the creation completes. On the Integration runtimes page, make
sure that you see the newly created IR in the list.

Use Azure IR
Once an Azure IR is created, you can reference it in your Linked Service definition. Below is a sample of how you
can reference the Azure Integration Runtime created above from an Azure Storage Linked Service:
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myaccountname;AccountKey=..."
},
"connectVia": {
"referenceName": "MySampleAzureIR",
"type": "IntegrationRuntimeReference"
}
}
}

Next steps
See the following articles on how to create other types of integration runtimes:
Create self-hosted integration runtime
Create Azure-SSIS integration runtime
Create and configure a self-hosted integration
runtime
7/7/2021 • 23 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The integration runtime (IR) is the compute infrastructure that Azure Data Factory uses to provide data-
integration capabilities across different network environments. For details about IR, see Integration runtime
overview.
A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private
network. It also can dispatch transform activities against compute resources in an on-premises network or an
Azure virtual network. The installation of a self-hosted integration runtime needs an on-premises machine or a
virtual machine inside a private network.
This article describes how you can create and configure a self-hosted IR.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Considerations for using a self-hosted IR


You can use a single self-hosted integration runtime for multiple on-premises data sources. You can also
share it with another data factory within the same Azure Active Directory (Azure AD) tenant. For more
information, see Sharing a self-hosted integration runtime.
You can install only one instance of a self-hosted integration runtime on any single machine. If you have two
data factories that need to access on-premises data sources, either use the self-hosted IR sharing feature to
share the self-hosted IR, or install the self-hosted IR on two on-premises computers, one for each data
factory.
The self-hosted integration runtime doesn't need to be on the same machine as the data source. However,
having the self-hosted integration runtime close to the data source reduces the time for the self-hosted
integration runtime to connect to the data source. We recommend that you install the self-hosted integration
runtime on a machine that differs from the one that hosts the on-premises data source. When the self-hosted
integration runtime and data source are on different machines, the self-hosted integration runtime doesn't
compete with the data source for resources.
You can have multiple self-hosted integration runtimes on different machines that connect to the same on-
premises data source. For example, if you have two self-hosted integration runtimes that serve two data
factories, the same on-premises data source can be registered with both data factories.
Use a self-hosted integration runtime to support data integration within an Azure virtual network.
Treat your data source as an on-premises data source that is behind a firewall, even when you use Azure
ExpressRoute. Use the self-hosted integration runtime to connect the service to the data source.
Use the self-hosted integration runtime even if the data store is in the cloud on an Azure Infrastructure as a
Service (IaaS) virtual machine.
Tasks might fail in a self-hosted integration runtime that you installed on a Windows server for which FIPS-
compliant encryption is enabled. To work around this problem, you have two options: store credentials/secret
values in an Azure Key Vault or disable FIPS-compliant encryption on the server. To disable FIPS-compliant
encryption, change the following registry subkey's value from 1 (enabled) to 0 (disabled):
HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled . If you use the self-hosted
integration runtime as a proxy for SSIS integration runtime, FIPS-compliant encryption can be enabled and
will be used when moving data from on premises to Azure Blob Storage as a staging area.

Command flow and data flow


When you move data between on-premises and the cloud, the activity uses a self-hosted integration runtime to
transfer the data between an on-premises data source and the cloud.
Here is a high-level summary of the data-flow steps for copying with a self-hosted IR:

1. A data developer first creates a self-hosted integration runtime within an Azure data factory by using the
Azure portal or the PowerShell cmdlet. Then the data developer creates a linked service for an on-
premises data store, specifying the self-hosted integration runtime instance that the service should use to
connect to data stores.
2. The self-hosted integration runtime node encrypts the credentials by using Windows Data Protection
Application Programming Interface (DPAPI) and saves the credentials locally. If multiple nodes are set for
high availability, the credentials are further synchronized across other nodes. Each node encrypts the
credentials by using DPAPI and stores them locally. Credential synchronization is transparent to the data
developer and is handled by the self-hosted IR.
3. Azure Data Factory communicates with the self-hosted integration runtime to schedule and manage jobs.
Communication is via a control channel that uses a shared Azure Relay connection. When an activity job
needs to be run, Data Factory queues the request along with any credential information. It does so in case
credentials aren't already stored on the self-hosted integration runtime. The self-hosted integration
runtime starts the job after it polls the queue.
4. The self-hosted integration runtime copies data between an on-premises store and cloud storage. The
direction of the copy depends on how the copy activity is configured in the data pipeline. For this step, the
self-hosted integration runtime directly communicates with cloud-based storage services like Azure Blob
storage over a secure HTTPS channel.

Prerequisites
The supported versions of Windows are:
Windows 8.1
Windows 10
Windows Server 2012
Windows Server 2012 R2
Windows Server 2016
Windows Server 2019
Installation of the self-hosted integration runtime on a domain controller isn't supported.
Self-hosted integration runtime requires a 64-bit Operating System with .NET Framework 4.7.2 or above. See
.NET Framework System Requirements for details.
The recommended minimum configuration for the self-hosted integration runtime machine is a 2-GHz
processor with 4 cores, 8 GB of RAM, and 80 GB of available hard drive space. For the details of system
requirements, see Download.
If the host machine hibernates, the self-hosted integration runtime doesn't respond to data requests.
Configure an appropriate power plan on the computer before you install the self-hosted integration runtime.
If the machine is configured to hibernate, the self-hosted integration runtime installer prompts with a
message.
You must be an administrator on the machine to successfully install and configure the self-hosted integration
runtime.
Copy-activity runs happen with a specific frequency. Processor and RAM usage on the machine follows the
same pattern with peak and idle times. Resource usage also depends heavily on the amount of data that is
moved. When multiple copy jobs are in progress, you see resource usage go up during peak times.
Tasks might fail during extraction of data in Parquet, ORC, or Avro formats. For more on Parquet, see Parquet
format in Azure Data Factory. File creation runs on the self-hosted integration machine. To work as expected,
file creation requires the following prerequisites:
Visual C++ 2010 Redistributable Package (x64)
Java Runtime (JRE) version 8 from a JRE provider such as Adopt OpenJDK. Ensure that the JAVA_HOME
environment variable is set to the JRE folder (and not just the JDK folder).

NOTE
If you are running in government cloud, please review Connect to government cloud.

Setting up a self-hosted integration runtime


To create and set up a self-hosted integration runtime, use the following procedures.
Create a self-hosted IR via Azure PowerShell
1. You can use Azure PowerShell for this task. Here is an example:
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $resourceGroupName -DataFactoryName
$dataFactoryName -Name $selfHostedIntegrationRuntimeName -Type SelfHosted -Description "selfhosted IR
description"

2. Download and install the self-hosted integration runtime on a local machine.


3. Retrieve the authentication key and register the self-hosted integration runtime with the key. Here is a
PowerShell example:

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntimeName

NOTE
Run PowerShell command in Azure government, please see Connect to Azure Government with PowerShell.

Create a self-hosted IR via Azure Data Factory UI


Use the following steps to create a self-hosted IR using Azure Data Factory UI.
1. On the home page of Azure Data Factory UI, select the Manage tab from the leftmost pane.

2. Select Integration runtimes on the left pane, and then select +New .

3. On the Integration runtime setup page, select Azure, Self-Hosted , and then select Continue .
4. On the following page, select Self-Hosted to create a Self-Hosted IR, and then select Continue .

5. Enter a name for your IR, and select Create .


6. On the Integration runtime setup page, select the link under Option 1 to open the express setup on
your computer. Or follow the steps under Option 2 to set up manually. The following instructions are
based on manual setup:
a. Copy and paste the authentication key. Select Download and install integration runtime .
b. Download the self-hosted integration runtime on a local Windows machine. Run the installer.
c. On the Register Integration Runtime (Self-hosted) page, paste the key you saved earlier, and
select Register .
d. On the New Integration Runtime (Self-hosted) Node page, select Finish .
7. After the self-hosted integration runtime is registered successfully, you see the following window:

Set up a self-hosted IR on an Azure VM via an Azure Resource Manager template


You can automate self-hosted IR setup on an Azure virtual machine by using the Create self host IR template.
The template provides an easy way to have a fully functional self-hosted IR inside an Azure virtual network. The
IR has high-availability and scalability features, as long as you set the node count to 2 or higher.
Set up an existing self-hosted IR via local PowerShell
You can use a command line to set up or manage an existing self-hosted IR. This usage can especially help to
automate the installation and registration of self-hosted IR nodes.
Dmgcmd.exe is included in the self-hosted installer. It's typically located in the C:\Program Files\Microsoft
Integration Runtime\4.0\Shared\ folder. This application supports various parameters and can be invoked via a
command line using batch scripts for automation.
Use the application as follows:

dmgcmd ACTION args...

Here are details of the application's actions and arguments:

A C T IO N A RGS DESC RIP T IO N

-rn , " <AuthenticationKey> " [" Register a self-hosted integration


-RegisterNewNode <NodeName> "] runtime node with the specified
authentication key and node name.

-era , " <port> " [" <thumbprint> "] Enable remote access on the current
-EnableRemoteAccess node to set up a high-availability
cluster. Or enable setting credentials
directly against the self-hosted IR
without going through Azure Data
Factory. You do the latter by using the
New-
AzDataFactor yV2LinkedSer viceEn
cr yptedCredential cmdlet from a
remote machine in the same network.

-erac , " <port> " [" <thumbprint> "] Enable remote access to the current
-EnableRemoteAccessInContainer node when the node runs in a
container.

-dra , Disable remote access to the current


-DisableRemoteAccess node. Remote access is needed for
multinode setup. The New-
AzDataFactor yV2LinkedSer viceEn
cr yptedCredential PowerShell
cmdlet still works even when remote
access is disabled. This behavior is true
as long as the cmdlet is executed on
the same machine as the self-hosted IR
node.

-k , " <AuthenticationKey> " Overwrite or update the previous


-Key authentication key. Be careful with this
action. Your previous self-hosted IR
node can go offline if the key is of a
new integration runtime.

-gbf , " <filePath> " " <password> " Generate a backup file for the current
-GenerateBackupFile node. The backup file includes the
node key and data-store credentials.

-ibf , " <filePath> " " <password> " Restore the node from a backup file.
-ImportBackupFile

-r , Restart the self-hosted integration


-Restart runtime host service.
A C T IO N A RGS DESC RIP T IO N

-s , Start the self-hosted integration


-Start runtime host service.

-t , Stop the self-hosted integration


-Stop runtime host service.

-sus , Start the self-hosted integration


-StartUpgradeService runtime upgrade service.

-tus , Stop the self-hosted integration


-StopUpgradeService runtime upgrade service.

-tonau , Turn on the self-hosted integration


-TurnOnAutoUpdate runtime auto-update.

-toffau , Turn off the self-hosted integration


-TurnOffAutoUpdate runtime auto-update.

-ssa , " <domain\user> " [" <password> "] Set DIAHostService to run as a new
-SwitchServiceAccount account. Use the empty password ""
for system accounts and virtual
accounts.

Install and register a self-hosted IR from Microsoft Download Center


1. Go to the Microsoft integration runtime download page.
2. Select Download , select the 64-bit version, and select Next . The 32-bit version isn't supported.
3. Run the MSI file directly, or save it to your hard drive and run it.
4. On the Welcome window, select a language and select Next .
5. Accept the Microsoft Software License Terms and select Next .
6. Select folder to install the self-hosted integration runtime, and select Next .
7. On the Ready to install page, select Install .
8. Select Finish to complete installation.
9. Get the authentication key by using PowerShell. Here's a PowerShell example for retrieving the
authentication key:

Get-AzDataFactoryV2IntegrationRuntimeKey -ResourceGroupName $resourceGroupName -DataFactoryName


$dataFactoryName -Name $selfHostedIntegrationRuntime

10. On the Register Integration Runtime (Self-hosted) window of Microsoft Integration Runtime
Configuration Manager running on your machine, take the following steps:
a. Paste the authentication key in the text area.
b. Optionally, select Show authentication key to see the key text.
c. Select Register .
NOTE
Release Notes are available on the same Microsoft integration runtime download page.

Service account for Self-hosted integration runtime


The default log on service account of Self-hosted integration runtime is NT SERVICE\DIAHostSer vice . You
can see it in Ser vices -> Integration Runtime Ser vice -> Proper ties -> Log on .

Make sure the account has the permission of Log on as a service. Otherwise self-hosted integration runtime
can't start successfully. You can check the permission in Local Security Policy -> Security Settings ->
Local Policies -> User Rights Assignment -> Log on as a ser vice
Notification area icons and notifications
If you move your cursor over the icon or message in the notification area, you can see details about the state of
the self-hosted integration runtime.

High availability and scalability


You can associate a self-hosted integration runtime with multiple on-premises machines or virtual machines in
Azure. These machines are called nodes. You can have up to four nodes associated with a self-hosted integration
runtime. The benefits of having multiple nodes on on-premises machines that have a gateway installed for a
logical gateway are:
Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure in
your big data solution or cloud data integration with Data Factory. This availability helps ensure continuity
when you use up to four nodes.
Improved performance and throughput during data movement between on-premises and cloud data stores.
Get more information on performance comparisons.
You can associate multiple nodes by installing the self-hosted integration runtime software from Download
Center. Then, register it by using either of the authentication keys that were obtained from the New-
AzDataFactor yV2IntegrationRuntimeKey cmdlet, as described in the tutorial.

NOTE
You don't need to create a new self-hosted integration runtime to associate each node. You can install the self-hosted
integration runtime on another machine and register it by using the same authentication key.
NOTE
Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is
enabled on the first node. To do so, select Microsoft Integration Runtime Configuration Manager > Settings >
Remote access to intranet .

Scale considerations
Scale out
When processor usage is high and available memory is low on the self-hosted IR, add a new node to help scale
out the load across machines. If activities fail because they time out or the self-hosted IR node is offline, it helps
if you add a node to the gateway.
Scale up
When the processor and available RAM aren't well utilized, but the execution of concurrent jobs reaches a node's
limits, scale up by increasing the number of concurrent jobs that a node can run. You might also want to scale up
when activities time out because the self-hosted IR is overloaded. As shown in the following image, you can
increase the maximum capacity for a node:

TLS/SSL certificate requirements


Here are the requirements for the TLS/SSL certificate that you use to secure communication between
integration runtime nodes:
The certificate must be a publicly trusted X509 v3 certificate. We recommend that you use certificates that
are issued by a public partner certification authority (CA).
Each integration runtime node must trust this certificate.
We don't recommend Subject Alternative Name (SAN) certificates because only the last SAN item is used. All
other SAN items are ignored. For example, if you have a SAN certificate whose SANs are
node1.domain.contoso.com and node2.domain.contoso.com , you can use this certificate only on a
machine whose fully qualified domain name (FQDN) is node2.domain.contoso.com .
The certificate can use any key size supported by Windows Server 2012 R2 for TLS/SSL certificates.
Certificates that use CNG keys aren't supported.
NOTE
This certificate is used:
To encrypt ports on a self-hosted IR node.
For node-to-node communication for state synchronization, which includes credentials synchronization of linked
services across nodes.
When a PowerShell cmdlet is used for linked-service credential settings from within a local network.
We suggest you use this certificate if your private network environment is not secure or if you want to secure the
communication between nodes within your private network.
Data movement in transit from a self-hosted IR to other data stores always happens within an encrypted channel,
regardless of whether or not this certificate is set.

Credential Sync
If you don't store credentials or secret values in an Azure Key Vault, the credentials or secret values will be stored
in the machines where your self-hosted integration runtime locates. Each node will have a copy of credential
with certain version. In order to make all nodes work together, the version number should be the same for all
nodes.

Proxy server considerations


If your corporate network environment uses a proxy server to access the internet, configure the self-hosted
integration runtime to use appropriate proxy settings. You can set the proxy during the initial registration phase.

When configured, the self-hosted integration runtime uses the proxy server to connect to the cloud service's
source and destination (which use the HTTP or HTTPS protocol). This is why you select Change link during
initial setup.
There are three configuration options:
Do not use proxy : The self-hosted integration runtime doesn't explicitly use any proxy to connect to cloud
services.
Use system proxy : The self-hosted integration runtime uses the proxy setting that is configured in
diahost.exe.config and diawp.exe.config. If these files specify no proxy configuration, the self-hosted
integration runtime connects to the cloud service directly without going through a proxy.
Use custom proxy : Configure the HTTP proxy setting to use for the self-hosted integration runtime, instead
of using configurations in diahost.exe.config and diawp.exe.config. Address and Por t values are required.
User Name and Password values are optional, depending on your proxy's authentication setting. All
settings are encrypted with Windows DPAPI on the self-hosted integration runtime and stored locally on the
machine.
The integration runtime host service restarts automatically after you save the updated proxy settings.
After you register the self-hosted integration runtime, if you want to view or update proxy settings, use
Microsoft Integration Runtime Configuration Manager.
1. Open Microsoft Integration Runtime Configuration Manager .
2. Select the Settings tab.
3. Under HTTP Proxy , select the Change link to open the Set HTTP Proxy dialog box.
4. Select Next . You then see a warning that asks for your permission to save the proxy setting and restart the
integration runtime host service.
You can use the configuration manager tool to view and update the HTTP proxy.
NOTE
If you set up a proxy server with NTLM authentication, the integration runtime host service runs under the domain
account. If you later change the password for the domain account, remember to update the configuration settings for the
service and restart the service. Because of this requirement, we suggest that you access the proxy server by using a
dedicated domain account that doesn't require you to update the password frequently.

Configure proxy server settings


If you select the Use system proxy option for the HTTP proxy, the self-hosted integration runtime uses the
proxy settings in diahost.exe.config and diawp.exe.config. When these files specify no proxy, the self-hosted
integration runtime connects to the cloud service directly without going through a proxy. The following
procedure provides instructions for updating the diahost.exe.config file:
1. In File Explorer, make a safe copy of C:\Program Files\Microsoft Integration
Runtime\4.0\Shared\diahost.exe.config as a backup of the original file.
2. Open Notepad running as administrator.
3. In Notepad, open the text file C:\Program Files\Microsoft Integration
Runtime\4.0\Shared\diahost.exe.config.
4. Find the default system.net tag as shown in the following code:

<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>

You can then add proxy server details as shown in the following example:
<system.net>
<defaultProxy enabled="true">
<proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" />
</defaultProxy>
</system.net>

The proxy tag allows additional properties to specify required settings like scriptLocation . See <proxy>
Element (Network Settings) for syntax.

<proxy autoDetect="true|false|unspecified" bypassonlocal="true|false|unspecified"


proxyaddress="uriString" scriptLocation="uriString" usesystemdefault="true|false|unspecified "/>

5. Save the configuration file in its original location. Then restart the self-hosted integration runtime host
service, which picks up the changes.
To restart the service, use the services applet from Control Panel. Or from Integration Runtime
Configuration Manager, select the Stop Ser vice button, and then select Star t Ser vice .
If the service doesn't start, you likely added incorrect XML tag syntax in the application configuration file
that you edited.

IMPORTANT
Don't forget to update both diahost.exe.config and diawp.exe.config.

You also need to make sure that Microsoft Azure is in your company's allowlist. You can download the list of
valid Azure IP addresses from Microsoft Download Center.
Possible symptoms for issues related to the firewall and proxy server
If you see error messages like the following ones, the likely reason is improper configuration of the firewall or
proxy server. Such configuration prevents the self-hosted integration runtime from connecting to Data Factory
to authenticate itself. To ensure that your firewall and proxy server are properly configured, refer to the previous
section.
When you try to register the self-hosted integration runtime, you receive the following error message:
"Failed to register this Integration Runtime node! Confirm that the Authentication key is valid and the
integration service host service is running on this machine."
When you open Integration Runtime Configuration Manager, you see a status of Disconnected or
Connecting . When you view Windows event logs, under Event Viewer > Application and Ser vices
Logs > Microsoft Integration Runtime , you see error messages like this one:

Unable to connect to the remote server


A component of Integration Runtime has become unresponsive and restarts automatically. Component
name: Integration Runtime (Self-hosted).

Enable remote access from an intranet


If you use PowerShell to encrypt credentials from a networked machine other than where you installed the self-
hosted integration runtime, you can enable the Remote Access from Intranet option. If you run PowerShell to
encrypt credentials on the machine where you installed the self-hosted integration runtime, you can't enable
Remote Access from Intranet .
Enable Remote Access from Intranet before you add another node for high availability and scalability.
When you run the self-hosted integration runtime setup version 3.3 or later, by default the self-hosted
integration runtime installer disables Remote Access from Intranet on the self-hosted integration runtime
machine.
When you use a firewall from a partner or others, you can manually open port 8060 or the user-configured
port. If you have a firewall problem while setting up the self-hosted integration runtime, use the following
command to install the self-hosted integration runtime without configuring the firewall:

msiexec /q /i IntegrationRuntime.msi NOFIREWALL=1

If you choose not to open port 8060 on the self-hosted integration runtime machine, use mechanisms other
than the Setting Credentials application to configure data-store credentials. For example, you can use the New-
AzDataFactor yV2LinkedSer viceEncr yptCredential PowerShell cmdlet.

Ports and firewalls


There are two firewalls to consider:
The corporate firewall that runs on the central router of the organization
The Windows firewall that is configured as a daemon on the local machine where the self-hosted integration
runtime is installed

At the corporate firewall level, you need to configure the following domains and outbound ports:

DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N

Public Cloud: 443 Required by the self-hosted integration


*.servicebus.windows.net runtime for interactive authoring.
Azure Government:
*.servicebus.usgovcloudapi.net
China:
*.servicebus.chinacloudapi.cn
DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N

Public Cloud: 443 Required by the self-hosted integration


{datafactory}. runtime to connect to the Data
{region}.datafactory.azure.net Factory service.
or *.frontend.clouddatahub.net For new created Data Factory in public
Azure Government: cloud, please find the FQDN from your
{datafactory}. Self-hosted Integration Runtime key
{region}.datafactory.azure.us which is in format {datafactory}.
China: {region}.datafactory.azure.net. For old
{datafactory}. Data factory, if you don't see the
{region}.datafactory.azure.cn
FQDN in your Self-hosted Integration
key, please use
*.frontend.clouddatahub.net instead.

download.microsoft.com 443 Required by the self-hosted integration


runtime for downloading the updates.
If you have disabled auto-update, you
can skip configuring this domain.

Key Vault URL 443 Required by Azure Key Vault if you


store the credential in Key Vault.

At the Windows firewall level or machine level, these outbound ports are normally enabled. If they aren't, you
can configure the domains and ports on a self-hosted integration runtime machine.

NOTE
As currently Azure Relay doesn't support service tag, you have to use service tag AzureCloud or Internet in NSG rules
for the communication to Azure Relay. For the communication to Azure Data Factory, you can use service tag
DataFactor yManagement in the NSG rule setup.

Based on your source and sinks, you might need to allow additional domains and outbound ports in your
corporate firewall or Windows firewall.

DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N

*.core.windows.net 443 Used by the self-hosted integration


runtime to connect to the Azure
storage account when you use the
staged copy feature.

*.database.windows.net 1433 Required only when you copy from or


to Azure SQL Database or Azure
Synapse Analytics and optional
otherwise. Use the staged-copy
feature to copy data to SQL Database
or Azure Synapse Analytics without
opening port 1433.

*.azuredatalakestore.net 443 Required only when you copy from or


login.microsoftonline.com/<tenant>/oauth2/token to Azure Data Lake Store and optional
otherwise.

For some cloud databases, such as Azure SQL Database and Azure Data Lake, you might need to allow IP
addresses of self-hosted integration runtime machines on their firewall configuration.
Get URL of Azure Relay
One required domain and port that need to be put in the allowlist of your firewall is for the communication to
Azure Relay. The self-hosted integration runtime uses it for interactive authoring such as test connection, browse
folder list and table list, get schema, and preview data. If you don't want to allow .ser vicebus.windows.net and
would like to have more specific URLs, then you can see all the FQDNs that are required by your self-hosted
integration runtime from the ADF portal. Follow these steps:
1. Go to ADF portal and select your self-hosted integration runtime.
2. In Edit page, select Nodes .
3. Select View Ser vice URLs to get all FQDNs.

4. You can add these FQDNs in the allowlist of firewall rules.

NOTE
For the details related to Azure Relay connections protocol, see Azure Relay Hybrid Connections protocol.

Copy data from a source to a sink


Ensure that you properly enable firewall rules on the corporate firewall, the Windows firewall of the self-hosted
integration runtime machine, and the data store itself. Enabling these rules lets the self-hosted integration
runtime successfully connect to both source and sink. Enable rules for each data store that is involved in the
copy operation.
For example, to copy from an on-premises data store to a SQL Database sink or an Azure Synapse Analytics
sink, take the following steps:
1. Allow outbound TCP communication on port 1433 for both the Windows firewall and the corporate firewall.
2. Configure the firewall settings of the SQL Database to add the IP address of the self-hosted integration
runtime machine to the list of allowed IP addresses.

NOTE
If your firewall doesn't allow outbound port 1433, the self-hosted integration runtime can't access the SQL database
directly. In this case, you can use a staged copy to SQL Database and Azure Synapse Analytics. In this scenario, you
require only HTTPS (port 443) for the data movement.

Installation best practices


You can install the self-hosted integration runtime by downloading a Managed Identity setup package from
Microsoft Download Center. See the article Move data between on-premises and cloud for step-by-step
instructions.
Configure a power plan on the host machine for the self-hosted integration runtime so that the machine
doesn't hibernate. If the host machine hibernates, the self-hosted integration runtime goes offline.
Regularly back up the credentials associated with the self-hosted integration runtime.
To automate self-hosted IR setup operations, refer to Set up an existing self hosted IR via PowerShell.

Next steps
For step-by-step instructions, see Tutorial: Copy on-premises data to cloud.
Self-hosted integration runtime auto-update and
expire notification
7/15/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article will describe how to let self-hosted integration runtime auto-update to the latest version and how
ADF manages the versions of self-hosted integration runtime.

Self-hosted Integration Runtime Auto-update


Generally, when you install a self-hosted integration runtime in your local machine or an Azure VM, you have
two options to manage the version of self-hosted integration runtime: auto-update or maintain manually.
Typically, ADF releases two new versions of self-hosted integration runtime every month which includes new
feature release, bug fix or enhancement. So we recommend users to update to newer version in order to get the
latest feature and enhancement.
The most convenient way is to enable auto-update when you create or edit self-hosted integration runtime. The
self-hosted integration runtime will be automatically update to newer version. You can also schedule the update
at the most suitable time slot as you wish.

You can check the last update datetime in your self-hosted integration runtime client.
You can use this PowerShell command to get the auto-update version.

NOTE
If you have multiple self-hosted integration runtime nodes, there is no downtime during auto-update. The auto-update
happens in one node first while others are working on tasks. When the first node finishes the update, it will take over the
remain tasks when other nodes are updating. If you only have one self-hosted integration runtime, then it has some
downtime during the auto-update.

Auto-update version vs latest version


To ensure the stability of self-hosted integration runtime, although we release two versions, we will only push
one version every month. So sometimes you will find that the auto-update version is the previous version of the
actual latest version. If you want to get the latest version, you can go to download center.
The self-hosted integration runtime Auto update page in ADF portal shows the newer version if current
version is old. When your self-hosted integration runtime is online, this version is auto-update version and will
automatically update your self-hosted integration runtime in the scheduled time. But if your self-hosted
integration runtime is offline, the page only shows the latest version.

Self-hosted Integration Runtime Expire Notification


If you want to manually control which version of self-hosted integration runtime, you can disable the setting of
auto-update and install it manually. Each version of self-hosted integration runtime will be expired in one year.
The expiring message is shown in ADF portal and self-hosted integration runtime client 90 days before
expiration.

Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Create a shared self-hosted integration runtime in
Azure Data Factory
7/21/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This guide shows you how to create a shared self-hosted integration runtime in Azure Data Factory. Then you
can use the shared self-hosted integration runtime in another data factory.

Create a shared self-hosted integration runtime in Azure Data Factory


You can reuse an existing self-hosted integration runtime infrastructure that you already set up in a data factory.
This reuse lets you create a linked self-hosted integration runtime in a different data factory by referencing an
existing shared self-hosted IR.
To see an introduction and demonstration of this feature, watch the following 12-minute video:

Terminology
Shared IR : An original self-hosted IR that runs on a physical infrastructure.
Linked IR : An IR that references another shared IR. The linked IR is a logical IR and uses the infrastructure of
another shared self-hosted IR.

Create a shared self-hosted IR using Azure Data Factory UI


To create a shared self-hosted IR using Azure Data Factory UI, you can take following steps:
1. In the self-hosted IR to be shared, select Grant permission to another Data factor y and in the
"Integration runtime setup" page, select the Data factory in which you want to create the linked IR.

2. Note and copy the above "Resource ID" of the self-hosted IR to be shared.
3. In the data factory to which the permissions were granted, create a new self-hosted IR (linked) and enter
the resource ID.
Create a shared self-hosted IR using Azure PowerShell
To create a shared self-hosted IR using Azure PowerShell, you can take following steps:
1. Create a data factory.
2. Create a self-hosted integration runtime.
3. Share the self-hosted integration runtime with other data factories.
4. Create a linked integration runtime.
5. Revoke the sharing.
Prerequisites

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure PowerShell . Follow the instructions in Install Azure PowerShell on Windows with PowerShellGet.
You use PowerShell to run a script to create a self-hosted integration runtime that can be shared with
other data factories.

NOTE
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on Products
available by region.

Create a data factory


1. Launch the Windows PowerShell Integrated Scripting Environment (ISE).
2. Create variables. Copy and paste the following script. Replace the variables, such as SubscriptionName
and ResourceGroupName , with actual values:

# If input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$".
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
$DataFactoryLocation = "EastUS"

# Shared Self-hosted integration runtime information. This is a Data Factory compute resource for
running any activities
# Data factory name. Must be globally unique
$SharedDataFactoryName = "[Shared Data factory name]"
$SharedIntegrationRuntimeName = "[Shared Integration Runtime Name]"
$SharedIntegrationRuntimeDescription = "[Description for Shared Integration Runtime]"

# Linked integration runtime information. This is a Data Factory compute resource for running any
activities
# Data factory name. Must be globally unique
$LinkedDataFactoryName = "[Linked Data factory name]"
$LinkedIntegrationRuntimeName = "[Linked Integration Runtime Name]"
$LinkedIntegrationRuntimeDescription = "[Description for Linked Integration Runtime]"

3. Sign in and select a subscription. Add the following code to the script to sign in and select your Azure
subscription:

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

4. Create a resource group and a data factory.

NOTE
This step is optional. If you already have a data factory, skip this step.

Create an Azure resource group by using the New-AzResourceGroup command. A resource group is a
logical container into which Azure resources are deployed and managed as a group. The following
example creates a resource group named myResourceGroup in the WestEurope location:

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Run the following command to create a data factory:


Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `
-Location $DataFactoryLocation `
-Name $SharedDataFactoryName

Create a self-hosted integration runtime

NOTE
This step is optional. If you already have the self-hosted integration runtime that you want to share with other data
factories, skip this step.

Run the following command to create a self-hosted integration runtime:

$SharedIR = Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-Type SelfHosted `
-Description $SharedIntegrationRuntimeDescription

Get the integration runtime authentication key and register a node


Run the following command to get the authentication key for the self-hosted integration runtime:

Get-AzDataFactoryV2IntegrationRuntimeKey `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName

The response contains the authentication key for this self-hosted integration runtime. You use this key when you
register the integration runtime node.
Install and register the self-hosted integration runtime
1. Download the self-hosted integration runtime installer from Azure Data Factory Integration Runtime.
2. Run the installer to install the self-hosted integration on a local computer.
3. Register the new self-hosted integration with the authentication key that you retrieved in a previous step.
Share the self-hosted integration runtime with another data factory
Create another data factory

NOTE
This step is optional. If you already have the data factory that you want to share with, skip this step. But in order to add
or remove role assignments to other data factory, you must have Microsoft.Authorization/roleAssignments/write
and Microsoft.Authorization/roleAssignments/delete permissions, such as User Access Administrator or Owner.

$factory = Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $LinkedDataFactoryName

Grant permission
Grant permission to the data factory that needs to access the self-hosted integration runtime you created and
registered.
IMPORTANT
Do not skip this step!

New-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId ` #MSI of the Data Factory with which it needs to be shared
-RoleDefinitionName 'Contributor' `
-Scope $SharedIR.Id

Create a linked self-hosted integration runtime


Run the following command to create a linked self-hosted integration runtime:

Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $LinkedDataFactoryName `
-Name $LinkedIntegrationRuntimeName `
-Type SelfHosted `
-SharedIntegrationRuntimeResourceId $SharedIR.Id `
-Description $LinkedIntegrationRuntimeDescription

Now you can use this linked integration runtime in any linked service. The linked integration runtime uses the
shared integration runtime to run activities.
Revoke integration runtime sharing from a data factory
To revoke the access of a data factory from the shared integration runtime, run the following command:

Remove-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId `
-RoleDefinitionName 'Contributor' `
-Scope $SharedIR.Id

To remove the existing linked integration runtime, run the following command against the shared integration
runtime:

Remove-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-LinkedDataFactoryName $LinkedDataFactoryName

Monitoring
Shared IR
Linked IR

Known limitations of self-hosted IR sharing


The data factory in which a linked IR is created must have an Managed Identity. By default, the data
factories created in the Azure portal or PowerShell cmdlets have an implicitly created Managed Identity.
But when a data factory is created through an Azure Resource Manager template or SDK, you must set
the Identity property explicitly. This setting ensures that Resource Manager creates a data factory that
contains a Managed Identity.
The Data Factory .NET SDK that supports this feature must be version 1.1.0 or later.
To grant permission, you need the Owner role or the inherited Owner role in the data factory where the
shared IR exists.
The sharing feature works only for data factories within the same Azure AD tenant.
For Azure AD guest users, the search functionality in the UI, which lists all data factories by using a search
keyword, doesn't work. But as long as the guest user is the owner of the data factory, you can share the IR
without the search functionality. For the Managed Identity of the data factory that needs to share the IR,
enter that Managed Identity in the Assign Permission box and select Add in the Data Factory UI.

NOTE
This feature is available only in Data Factory V2.

Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Automating self-hosted integration runtime
installation using local PowerShell scripts
5/6/2021 • 2 minutes to read • Edit Online

To automate installation of Self-hosted Integration Runtime on local machines (other than Azure VMs where we
can leverage the Resource Manager template instead), you can use local PowerShell scripts. This article
introduces two scripts you can use.

Prerequisites
Launch PowerShell on your local machine. To run the scripts, you need to choose Run as Administrator .
Download the self-hosted integration runtime software. Copy the path where the downloaded file is.
You also need an authentication key to register the self-hosted integration runtime.
For automating manual updates, you need to have a pre-configured self-hosted integration runtime.

Scripts introduction
NOTE
These scripts are created using the documented command line utility in the self-hosted integration runtime. If needed one
can customize these scripts accordingly to cater to their automation needs. The scripts need to be applied per node, so
make sure to run it across all nodes in case of high availability setup (2 or more nodes).

For automating setup: Install and register a new self-hosted integration runtime node using
InstallGatewayOnLocalMachine.ps1 - The script can be used to install self-hosted integration runtime
node and register it with an authentication key. The script accepts two arguments, first specifying the
location of the self-hosted integration runtime on a local disk, second specifying the authentication
key (for registering self-hosted IR node).
For automating manual updates: Update the self-hosted IR node with a specific version or to the latest
version script-update-gateway.ps1 - This is also supported in case you have turned off the auto-
update, or want to have more control over updates. The script can be used to update the self-hosted
integration runtime node to the latest version or to a specified higher version (downgrade doesn’t work).
It accepts an argument for specifying version number (example: -version 3.13.6942.1). When no version
is specified, it always updates the self-hosted IR to the latest version found in the downloads.

NOTE
Only last 3 versions can be specified. Ideally this is used to update an existing node to the latest version. IT
ASSUMES THAT YOU HAVE A REGISTERED SELF HOSTED IR.

Usage examples
For automating setup
1. Download the self-hosted IR from here.
2. Specify the path where the above downloaded SHIR MSI (installation file) is. For example, if the path is
C:\Users\username\Downloads\IntegrationRuntime_4.7.7368.1.msi, then you can use below PowerShell
command-line example for this task:

PS C:\windows\system32> C:\Users\username\Desktop\InstallGatewayOnLocalMachine.ps1 -path


"C:\Users\username\Downloads\IntegrationRuntime_4.7.7368.1.msi" -authKey "[key]"

NOTE
Replace [key] with the authentication key to register your IR. Replace "username" with your user name. Specify the
location of the "InstallGatewayOnLocalMachine.ps1" file when running the script. In this example we stored it on
Desktop.

3. If there is one pre-installed self-hosted IR on your machine, the script automatically uninstalls it and then
configures a new one. You'll see following window popped out:

4. When the installation and key registration completes, you'll see Succeed to install gateway and Succeed
to register gateway results in your local PowerShell.

For automating manual updates


This script is used to update/install + register latest self-hosted integration runtime. The script run performs the
following steps:
1. Check current self-hosted IR version
2. Get latest version or specified version from argument
3. If there is newer version than current version:
download self-hosted IR msi
upgrade it
You can follow below command-line example to use this script:
Download and install latest gateway:

PS C:\windows\system32> C:\Users\username\Desktop\script-update-gateway.ps1

Download and install gateway of specified version:

PS C:\windows\system32> C:\Users\username\Desktop\script-update-gateway.ps1 -version 3.13.6942.1

If your current version is already the latest one, you'll see following result, suggesting no update is
required.
How to run Self-Hosted Integration Runtime in
Windows container
5/25/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article will explain how to run Self-Hosted Integration Runtime in Windows container. Azure Data Factory
are delivering the official windows container support of Self-Hosted Integration Runtime. You can download the
docker build source code and combine the building and running process in your own continuous delivery
pipeline.

Prerequisites
Windows container requirements
Docker Version 2.3 and later
Self-Hosted Integration Runtime Version 5.2.7713.1 and later

Get started
1. Install Docker and enable Windows Container
2. Download the source code from https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-
Windows-Container
3. Download the latest version SHIR in ‘SHIR’ folder
4. Open your folder in the shell:

cd"yourFolderPath"

5. Build the windows docker image:

dockerbuild.-t"yourDockerImageName"

6. Run docker container:

dockerrun-d-eNODE_NAME="irNodeName"-eAUTH_KEY="IR_AUTHENTICATION_KEY"-eENABLE_HA=true-e HA_PORT=8060
"yourDockerImageName"

NOTE
AUTH_KEY is mandatory for this command. NODE_NAME, ENABLE_HA and HA_PORT are optional. If you don't set the
value, the command will use default values. The default value of ENABLE_HA is false and HA_PORT is 8060.

Container health check


After 120 seconds startup period, the health checker will run periodically every 30 seconds. It will provide the IR
health status to container engine.
Limitations
Currently we don't support below features when running Self-Hosted Integration Runtime in Windows
container:
HTTP proxy
Encrypted Node-node communication with TLS/SSL certificate
Generate and import backup
Daemon service
Auto update
Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Create an Azure-SSIS integration runtime in Azure
Data Factory
7/20/2021 • 40 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article provides steps for provisioning an Azure-SQL Server Integration Services (SSIS) integration runtime
(IR) in Azure Data Factory (ADF). An Azure-SSIS IR supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
After an Azure-SSIS IR is provisioned, you can use familiar tools to deploy and run your packages in Azure.
These tools are already Azure-enabled and include SQL Server Data Tools (SSDT), SQL Server Management
Studio (SSMS), and command-line utilities like dtutil and AzureDTExec.
The Provisioning Azure-SSIS IR tutorial shows how to create an Azure-SSIS IR via the Azure portal or the Data
Factory app. The tutorial also shows how to optionally use an Azure SQL Database server or managed instance
to host SSISDB. This article expands on the tutorial and describes how to do these optional tasks:
Use an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a managed
instance with private endpoint to host SSISDB. As a prerequisite, you need to configure virtual network
permissions and settings for your Azure-SSIS IR to join a virtual network.
Use Azure Active Directory (Azure AD) authentication with the specified system/user-assigned managed
identity for your data factory to connect to an Azure SQL Database server or managed instance. As a
prerequisite, you need to add the specified system/user-assigned managed identity for your data factory
as a database user who can create an SSISDB instance.
Join your Azure-SSIS IR to a virtual network, or configure a self-hosted IR as proxy for your Azure-SSIS IR
to access data on-premises.
This article shows how to provision an Azure-SSIS IR by using the Azure portal, Azure PowerShell, and an Azure
Resource Manager template.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Azure subscription . If you don't already have a subscription, you can create a free trial account.
Azure SQL Database ser ver or SQL Managed Instance (optional) . If you don't already have a
database server or managed instance, create one in the Azure portal before you get started. Data Factory
will in turn create an SSISDB instance on this database server.
We recommend that you create the database server or managed instance in the same Azure region as
the integration runtime. This configuration lets the integration runtime write execution logs into SSISDB
without crossing Azure regions.
Keep these points in mind:
The SSISDB instance can be created on your behalf as a single database, as part of an elastic pool,
or in a managed instance. It can be accessible in a public network or by joining a virtual network.
For guidance in choosing between SQL Database and SQL Managed Instance to host SSISDB, see
the Compare SQL Database and SQL Managed Instance section in this article.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a SQL managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Join an Azure-SSIS IR to a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a SQL managed instance with private endpoint to host SSISDB. For
more information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see
New-AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure AD authentication with the specified system/user-assigned managed
identity for your data factory. For the latter, you need to add the specified system/user-assigned
managed identity for your data factory into an Azure AD group with access permissions to the
database server. For more information, see Enable Azure AD authentication for an Azure-SSIS IR.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.
Azure Resource Manager vir tual network (optional) . You must have an Azure Resource Manager
virtual network if at least one of the following conditions is true:
You're hosting SSISDB on an Azure SQL Database server with IP firewall rules/virtual network
service endpoints or a managed instance with private endpoint.
You want to connect to on-premises data stores from SSIS packages running on your Azure-SSIS
IR without configuring a self-hosted IR.
Azure PowerShell (optional) . Follow the instructions in How to install and configure Azure PowerShell,
if you want to run a PowerShell script to provision your Azure-SSIS IR.
Regional support
For a list of Azure regions in which Data Factory and an Azure-SSIS IR are available, see Data Factory and SSIS IR
availability by region.
Comparison of SQL Database and SQL Managed Instance
The following table compares certain features of an Azure SQL Database server and SQL Managed Instance as
they relate to Azure-SSIR IR:
F EAT URE SQ L DATA B A SE SQ L M A N A GED IN STA N C E

Scheduling The SQL Server Agent is not available. The Managed Instance Agent is
available.
See Schedule a package execution in a
Data Factory pipeline.

Authentication You can create an SSISDB instance with You can create an SSISDB instance with
a contained database user who a contained database user who
represents any Azure AD group with represents the managed identity of
the managed identity of your data your data factory.
factory as a member in the db_owner
role. See Enable Azure AD authentication to
create an SSISDB in Azure SQL
See Enable Azure AD authentication to Managed Instance.
create an SSISDB in Azure SQL
Database server.

Ser vice tier When you create an Azure-SSIS IR with When you create an Azure-SSIS IR with
your Azure SQL Database server, you your managed instance, you can't
can select the service tier for SSISDB. select the service tier for SSISDB. All
There are multiple service tiers. databases in your managed instance
share the same resource allocated to
that instance.

Vir tual network Your Azure-SSIS IR can join an Azure Your Azure-SSIS IR can join an Azure
Resource Manager virtual network if Resource Manager virtual network if
you use an Azure SQL Database server you use a managed instance with
with IP firewall rules/virtual network private endpoint. The virtual network
service endpoints. is required when you don't enable a
public endpoint for your managed
instance.

If you join your Azure-SSIS IR to the


same virtual network as your managed
instance, make sure that your Azure-
SSIS IR is in a different subnet from
your managed instance. If you join
your Azure-SSIS IR to a different
virtual network from your managed
instance, we recommend either a
virtual network peering or a network-
to-network connection. See Connect
your application to an Azure SQL
Database Managed Instance.

Distributed transactions This feature is supported through Not supported.


elastic transactions. Microsoft
Distributed Transaction Coordinator
(MSDTC) transactions are not
supported. If your SSIS packages use
MSDTC to coordinate distributed
transactions, consider migrating to
elastic transactions for Azure SQL
Database. For more information, see
Distributed transactions across cloud
databases.

Use the Azure portal to create an integration runtime


In this section, you use the Azure portal, specifically the Data Factory user interface (UI) or app, to create an
Azure-SSIS IR.
Create a data factory
To create your data factory via the Azure portal, follow the step-by-step instructions in Create a data factory via
the UI. Select Pin to dashboard while doing so, to allow quick access after its creation.
After your data factory is created, open its overview page in the Azure portal. Select the Author & Monitor tile
to open its Let's get star ted page on a separate tab. There, you can continue to create your Azure-SSIS IR.
Provision an Azure -SSIS integration runtime
On the home page, select the Configure SSIS tile to open the Integration runtime setup pane.

The Integration runtime setup pane has three pages where you successively configure general, deployment,
and advanced settings.
General settings page
On the General settings page of Integration runtime setup pane, complete the following steps.
1. For Name , enter the name of your integration runtime.
2. For Description , enter the description of your integration runtime.
3. For Location , select the location of your integration runtime. Only supported locations are displayed. We
recommend that you select the same location of your database server to host SSISDB.
4. For Node Size , select the size of the node in your integration runtime cluster. Only supported node sizes
are displayed. Select a large node size (scale up) if you want to run many compute-intensive or memory-
intensive packages.

NOTE
If you require compute isolation, please select the Standard_E64i_v3 node size. This node size represents
isolated virtual machines that consume their entire physical host and provide the necessary level of isolation
required by certain workloads, such as the US Department of Defense's Impact Level 5 (IL5) workloads.

5. For Node Number , select the number of nodes in your integration runtime cluster. Only supported node
numbers are displayed. Select a large cluster with many nodes (scale out) if you want to run many
packages in parallel.
6. For Edition/License , select the SQL Server edition for your integration runtime: Standard or Enterprise.
Select Enterprise if you want to use advanced features on your integration runtime.
7. For Save Money , select the Azure Hybrid Benefit option for your integration runtime: Yes or No . Select
Yes if you want to bring your own SQL Server license with Software Assurance to benefit from cost
savings with hybrid use.
8. Select Continue .
Deployment settings page
On the Deployment settings page of Integration runtime setup pane, you have the options to create
SSISDB and or Azure-SSIS IR package stores.
C r e a t i n g SSI SD B

On the Deployment settings page of Integration runtime setup pane, if you want to deploy your packages
into SSISDB (Project Deployment Model), select the Create SSIS catalog (SSISDB) hosted by Azure SQL
Database ser ver/Managed Instance to store your projects/packages/environments/execution logs
check box. Alternatively, if you want to deploy your packages into file system, Azure Files, or SQL Server
database (MSDB) hosted by Azure SQL Managed Instance (Package Deployment Model), no need to create
SSISDB nor select the check box.
Regardless of your deployment model, if you want to use SQL Server Agent hosted by Azure SQL Managed
Instance to orchestrate/schedule your package executions, it's enabled by SSISDB, so select the check box
anyway. For more information, see Schedule SSIS package executions via Azure SQL Managed Instance Agent.
If you select the check box, complete the following steps to bring your own database server to host SSISDB that
we'll create and manage on your behalf.
1. For Subscription , select the Azure subscription that has your database server to host SSISDB.
2. For Location , select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
3. For Catalog Database Ser ver Endpoint , select the endpoint of your database server to host SSISDB.
Based on the selected database server, the SSISDB instance can be created on your behalf as a single
database, as part of an elastic pool, or in a managed instance. It can be accessible in a public network or
by joining a virtual network. For guidance in choosing the type of database server to host SSISDB, see
Compare SQL Database and SQL Managed Instance.
If you select an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a
managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual network. For more
information, see Join an Azure-SSIS IR to a virtual network.
4. Select either the Use AAD authentication with the system managed identity for Data Factor y or
Use AAD authentication with a user-assigned managed identity for Data Factor y check box to
choose Azure AD authentication method for Azure-SSIS IR to access your database server that hosts
SSISDB. Don't select any of the check boxes to choose SQL authentication method instead.
If you select any of the check boxes, you'll need to add the specified system/user-assigned managed
identity for your data factory into an Azure AD group with access permissions to your database server. If
you select the Use AAD authentication with a user-assigned managed identity for Data Factor y
check box, you can then select any existing credentials created using your specified user-assigned
managed identities or create new ones. For more information, see Enable Azure AD authentication for an
Azure-SSIS IR.
5. For Admin Username , enter the SQL authentication username for your database server that hosts
SSISDB.
6. For Admin Password , enter the SQL authentication password for your database server that hosts
SSISDB.
7. Select the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover check
box to configure a dual standby Azure SSIS IR pair that works in sync with Azure SQL Database/Managed
Instance failover group for business continuity and disaster recovery (BCDR).
If you select the check box, enter a name to identify your pair of primary and secondary Azure-SSIS IRs in
the Dual standby pair name text box. You need to enter the same pair name when creating your
primary and secondary Azure-SSIS IRs.
For more information, see Configure your Azure-SSIS IR for BCDR.
8. For Catalog Database Ser vice Tier , select the service tier for your database server to host SSISDB.
Select the Basic, Standard, or Premium tier, or select an elastic pool name.
Select Test connection when applicable, and if it's successful, select Continue .

NOTE
If you use Azure SQL Database server to host SSISDB, your data will be stored in geo-redundant storage for backups by
default. If you don't want your data to be replicated in other regions, please follow the instructions to Configure backup
storage redundancy by using PowerShell.

C r e a t i n g A z u r e - SSI S I R p a c k a g e st o r e s

On the Deployment settings page of Integration runtime setup pane, if you want to manage your
packages that are deployed into MSDB, file system, or Azure Files (Package Deployment Model) with Azure-SSIS
IR package stores, select the Create package stores to manage your packages that are deployed into
file system/Azure Files/SQL Ser ver database (MSDB) hosted by Azure SQL Managed Instance check
box.
Azure-SSIS IR package store allows you to import/export/delete/run packages and monitor/stop running
packages via SSMS similar to the legacy SSIS package store. For more information, see Manage SSIS packages
with Azure-SSIS IR package stores.
If you select this check box, you can add multiple package stores to your Azure-SSIS IR by selecting New .
Conversely, one package store can be shared by multiple Azure-SSIS IRs.
On the Add package store pane, complete the following steps.
1. For Package store name , enter the name of your package store.
2. For Package store linked ser vice , select your existing linked service that stores the access information
for file system/Azure Files/Azure SQL Managed Instance where your packages are deployed or create a
new one by selecting New . On the New linked ser vice pane, complete the following steps.

NOTE
You can use either Azure File Storage or File System linked services to access Azure Files. If you use Azure
File Storage linked service, Azure-SSIS IR package store supports only Basic (not Account key nor SAS URI )
authentication method for now.
a. For Name , enter the name of your linked service.
b. For Description , enter the description of your linked service.
c. For Type , select Azure File Storage , Azure SQL Managed Instance , or File System .
d. You can ignore Connect via integration runtime , since we always use your Azure-SSIS IR to
fetch the access information for package stores.
e. If you select Azure File Storage , for Authentication method , select Basic , and then complete
the following steps.
a. For Account selection method , select From Azure subscription or Enter manually .
b. If you select From Azure subscription , select the relevant Azure subscription , Storage
account name , and File share .
c. If you select Enter manually , enter
\\<storage account name>.file.core.windows.net\<file share name> for Host ,
Azure\<storage account name>for Username , and <storage account key> for Password or
select your Azure Key Vault where it's stored as a secret.
f. If you select Azure SQL Managed Instance , complete the following steps.
a. Select Connection string or your Azure Key Vault where it's stored as a secret.
b. If you select Connection string , complete the following steps.
a. For Account selection method , if you choose From Azure subscription , select
the relevant Azure subscription , Ser ver name , Endpoint type and Database
name . If you choose Enter manually , complete the following steps.
a. For Fully qualified domain name , enter
<server name>.<dns prefix>.database.windows.net or
<server name>.public.<dns prefix>.database.windows.net,3342 as the private
or public endpoint of your Azure SQL Managed Instance, respectively. If you
enter the private endpoint, Test connection isn't applicable, since ADF UI
can't reach it.
b. For Database name , enter msdb .
b. For Authentication type , select SQL Authentication , Managed Identity ,
Ser vice Principal , or User-Assigned Managed Identity .
If you select SQL Authentication , enter the relevant Username and
Password or select your Azure Key Vault where it's stored as a secret.
If you select Managed Identity , grant the system managed identity for your
ADF access to your Azure SQL Managed Instance.
If you select Ser vice Principal , enter the relevant Ser vice principal ID and
Ser vice principal key or select your Azure Key Vault where it's stored as a
secret.
If you select User-Assigned Managed Identity , grant the specified user-
assigned managed identity for your ADF access to your Azure SQL Managed
Instance. You can then select any existing credentials created using your
specified user-assigned managed identities or create new ones.
g. If you select File system , enter the UNC path of folder where your packages are deployed for
Host , as well as the relevant Username and Password or select your Azure Key Vault where it's
stored as a secret.
h. Select Test connection when applicable and if it's successful, select Create .
3. Your added package stores will appear on the Deployment settings page. To remove them, select their
check boxes, and then select Delete .
Select Test connection when applicable and if it's successful, select Continue .
Advanced settings page
On the Advanced settings page of Integration runtime setup pane, complete the following steps.
1. For Maximum Parallel Executions Per Node , select the maximum number of packages to run
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number if you want to use more than one core to run a single large package that's
compute or memory intensive. Select a high number if you want to run one or more small packages in a
single core.
2. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box to choose whether you want to add
standard/express custom setups on your Azure-SSIS IR. For more information, see Custom setup for an
Azure-SSIS IR.
If you select the check box, complete the following steps.
a. For Custom setup container SAS URI , enter the SAS URI of your container where you store
scripts and associated files for standard custom setups.
b. For Express custom setup , select New to open the Add express custom setup panel and then
select any types under the Express custom setup type dropdown menu, e.g. Run cmdkey
command , Add environment variable , Install licensed component , etc.
If you select the Install licensed component type, you can then select any integrated
components from our ISV partners under the Component name dropdown menu and if
required, enter the product license key/upload the product license file that you purchased from
them into the License key /License file box.
Your added express custom setups will appear on the Advanced settings page. To remove them,
you can select their check boxes and then select Delete .
3. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to create
cer tain network resources, and optionally bring your own static public IP addresses check box
to choose whether you want to join your integration runtime to a virtual network.
Select it if you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
(that is, you have on-premises data sources or destinations in your SSIS packages) without configuring a
self-hosted IR. For more information, see Join Azure-SSIS IR to a virtual network.
If you select the check box, complete the following steps.

a. For Subscription , select the Azure subscription that has your virtual network.
b. For Location , the same location of your integration runtime is selected.
c. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
d. For VNet Name , select the name of your virtual network. It should be the same one used for your
Azure SQL Database server with virtual network service endpoints or managed instance with
private endpoint to host SSISDB. Or it should be the same one connected to your on-premises
network. Otherwise, it can be any virtual network to bring your own static public IP addresses for
Azure-SSIS IR.
e. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for your Azure SQL Database server with virtual network service endpoints to host SSISDB.
Or it should be a different subnet from the one used for your managed instance with private
endpoint to host SSISDB. Otherwise, it can be any subnet to bring your own static public IP
addresses for Azure-SSIS IR.
f. Select the Bring static public IP addresses for your Azure-SSIS Integration Runtime
check box to choose whether you want to bring your own static public IP addresses for Azure-SSIS
IR, so you can allow them on the firewall for your data sources.
If you select the check box, complete the following steps.
a. For First static public IP address , select the first static public IP address that meets the
requirements for your Azure-SSIS IR. If you don't have any, click Create new link to create
static public IP addresses on Azure portal and then click the refresh button here, so you can
select them.
b. For Second static public IP address , select the second static public IP address that meets
the requirements for your Azure-SSIS IR. If you don't have any, click Create new link to
create static public IP addresses on Azure portal and then click the refresh button here, so
you can select them.
4. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS Integration
Runtime check box to choose whether you want to configure a self-hosted IR as proxy for your Azure-
SSIS IR. For more information, see Set up a self-hosted IR as proxy.
If you select the check box, complete the following steps.
a. For Self-Hosted Integration Runtime , select your existing self-hosted IR as a proxy for Azure-
SSIS IR.
b. For Staging Storage Linked Ser vice , select your existing Azure Blob storage linked service or
create a new one for staging.
c. For Staging Path , specify a blob container in your selected Azure Blob storage account or leave it
empty to use a default one for staging.
5. Select VNet Validation > Continue .
On the Summar y section, review all provisioning settings, bookmark the recommended documentation links,
and select Finish to start the creation of your integration runtime.
NOTE
Excluding any custom setup time, this process should finish within 5 minutes. But it might take 20-30 minutes for the
Azure-SSIS IR to join a virtual network.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB. It also configures
permissions and settings for your virtual network, if specified, and joins your Azure-SSIS IR to the virtual network.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.

Connections pane
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .

You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.
Azure SSIS integration runtimes in the portal
1. In the Azure Data Factory UI, switch to the Manage tab and then switch to the Integration runtimes
tab on the Connections pane to view existing integration runtimes in your data factory.

2. Select New to create a new Azure-SSIS IR and open the Integration runtime setup pane.
3. In the Integration runtime setup pane, select the Lift-and-shift existing SSIS packages to
execute in Azure tile, and then select Continue .

4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure SSIS integration runtime
section.

Use Azure PowerShell to create an integration runtime


In this section, you use Azure PowerShell to create an Azure-SSIS IR.
Create variables
Copy and paste the following script. Specify values for the variables.

### Azure Data Factory info


# If your input contains a PSH special character like "$", precede it with the escape character "`" - for
example, "`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$ResourceGroupName = "[your Azure resource group name]"
# Data factory name - must be globally unique
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime info - This is a Data Factory compute resource for running SSIS packages.
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, whereas Enterprise lets you use advanced features on
your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, whereas BasePrice lets you bring
your own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid
Benefit option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported. For other nodes, up to (2 x
number of cores) are currently supported.
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use an Azure SQL Database
server with IP firewall rules/virtual network service endpoints or a managed instance with private endpoint
to host SSISDB, or if you require access to on-premises data without configuring a self-hosted IR. We
recommend an Azure Resource Manager virtual network, because classic virtual networks will be deprecated
soon.
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Use the same subnet as the one used for your
Azure SQL Database server with virtual network service endpoints, or a different subnet from the one used
for your managed instance with a private endpoint
# Public IP address info: OPTIONAL to provide two standard static public IP addresses with DNS name under
the same subscription and in the same region as your virtual network
$FirstPublicIP = "[your first public IP address resource ID or leave it empty]"
$SecondPublicIP = "[your second public IP address resource ID or leave it empty]"

### SSISDB info


$SSISDBServerEndpoint = "[your Azure SQL Database server name.database.windows.net or managed instance
name.DNS prefix.database.windows.net or managed instance name.public.DNS prefix.database.windows.net,3342 or
leave it empty if you do not use SSISDB]" # WARNING: If you use SSISDB, ensure that there's no existing
SSISDB on your database server, so we can prepare and manage one on your behalf
# Authentication info: SQL or Azure AD
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for Azure
AD authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for Azure
AD authentication]"
# For the basic pricing tier, specify "Basic," not "B." For standard, premium, and elastic pool tiers,
specify "S0," "S1," "S2," "S3," etc. See https://docs.microsoft.com/azure/sql-database/sql-database-
resource-limits-database-server.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database server or leave it empty for managed instance]"

### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
Sign in and select a subscription
Add the following script to sign in and select your Azure subscription.

Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

Validate the connection to database server


Add the following script to validate your Azure SQL Database server or managed instance.

# Validate only if you use SSISDB and you don't use virtual network or Azure AD authentication
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
if([string]::IsNullOrEmpty($VnetId) -and [string]::IsNullOrEmpty($SubnetName))
{
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) -and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you
want to proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}
}
}

Configure the virtual network


Add the following script to automatically configure virtual network permissions and settings for your Azure-
SSIS integration runtime to join.
# Make sure to run this script against the subscription to which the virtual network belongs
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Create a resource group


Create an Azure resource group by using the New-AzResourceGroup command. A resource group is a logical
container into which Azure resources are deployed and managed as a group.
If your resource group already exists, don't copy this code to your script.

New-AzResourceGroup -Location $DataFactoryLocation -Name $ResourceGroupName

Create a data factory


Run the following command to create a data factory.

Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `


-Location $DataFactoryLocation `
-Name $DataFactoryName

Create an integration runtime


Run the following commands to create an Azure-SSIS integration runtime that runs SSIS packages in Azure.
If you don't use SSISDB, you can omit the CatalogServerEndpoint , CatalogPricingTier , and
CatalogAdminCredential parameters.

If you don't use an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a
managed instance with private endpoint to host SSISDB, or require access to on-premises data, you can omit the
VNetId and Subnet parameters or pass empty values for them. You can also omit them if you configure a self-
hosted IR as proxy for your Azure-SSIS IR to access data on-premises. Otherwise, you can't omit them and must
pass valid values from your virtual network configuration. For more information, see Join an Azure-SSIS IR to a
virtual network.
If you use managed instance to host SSISDB, you can omit the CatalogPricingTier parameter or pass an empty
value for it. Otherwise, you can't omit it and must pass a valid value from the list of supported pricing tiers for
Azure SQL Database. For more information, see SQL Database resource limits.
If you use Azure AD authentication with the specified system/user-assigned managed identity for your data
factory to connect to the database server, you can omit the CatalogAdminCredential parameter. But you must
add the specified system/user-assigned managed identity for your data factory into an Azure AD group with
access permissions to the database server. For more information, see Enable Azure AD authentication for an
Azure-SSIS IR. Otherwise, you can't omit it and must pass a valid object formed from your server admin
username and password for SQL authentication.

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode `
-VnetId $VnetId `
-Subnet $SubnetName

# Add the CatalogServerEndpoint, CatalogPricingTier, and CatalogAdminCredential parameters if you use SSISDB
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword)) # Add the CatalogAdminCredential parameter if you don't
use Azure AD authentication
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogAdminCredential $serverCreds
}
}

# Add custom setup parameters if you use standard/express custom setups


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
if(![string]::IsNullOrEmpty($ExpressCustomSetup))
{
if($ExpressCustomSetup -eq "RunCmdkey")
{
$addCmdkeyArgument = "YourFileShareServerName or YourAzureStorageAccountName.file.core.windows.net"
$userCmdkeyArgument = "YourDomainName\YourUsername or azure\YourAzureStorageAccountName"
$passCmdkeyArgument = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourPassword or YourAccessKey")
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.CmdkeySetup($addCmdkeyArgument,
$userCmdkeyArgument, $passCmdkeyArgument)
}
if($ExpressCustomSetup -eq "SetEnvironmentVariable")
{
$variableName = "YourVariableName"
$variableValue = "YourVariableValue"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.EnvironmentVariableSetup($variableName, $variableValue)
}
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
{
$moduleVersion = "YourAzModuleVersion"
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.AzPowerShellSetup($moduleVersion)
}
if($ExpressCustomSetup -eq "SentryOne.TaskFactory")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.SQLPhonetics.NET")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.HEDDA.IO")
{
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup)
}
if($ExpressCustomSetup -eq "KingswaySoft.IntegrationToolkit")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "KingswaySoft.ProductivityPack")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "Theobald.XtractIS")
{
$jsonData = Get-Content -Raw -Path YourLicenseFile.json
$jsonData = $jsonData -replace '\s',''
$jsonData = $jsonData.replace('"','\"')
$licenseKey = New-Object Microsoft.Azure.Management.DataFactory.Models.SecureString($jsonData)
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "AecorSoft.IntegrationService")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Standard")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Extended")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
# Create an array of one or more express custom setups
$setups = New-Object System.Collections.ArrayList
$setups = New-Object System.Collections.ArrayList
$setups.Add($setup)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-ExpressCustomSetup $setups
}

# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName

if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}

# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}

Start the integration runtime


Run the following commands to start the Azure-SSIS integration runtime.

write-host("##### Starting #####")


Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")
NOTE
Excluding any custom setup time, this process should finish within 5 minutes. But it might take 20-30 minutes for the
Azure-SSIS IR to join a virtual network.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB. It also configures
permissions and settings for your virtual network, if specified, and joins your Azure-SSIS IR to the virtual network.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.

Full script
Here's the full script that creates an Azure-SSIS integration runtime.

### Azure Data Factory info


# If your input contains a PSH special character like "$", precede it with the escape character "`" - for
example, "`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
# Data factory name - must be globally unique
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime info - This is a Data Factory compute resource for running SSIS packages.
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, whereas Enterprise lets you use advanced features on
your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, whereas BasePrice lets you bring
your own on-premises SQL Server license with Software Assurance to earn cost savings from the Azure Hybrid
Benefit option
# For a Standard_D1_v2 node, up to four parallel executions per node are supported. For other nodes, up to
(2 x number of cores) are currently supported.
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use an Azure SQL Database
server with IP firewall rules/virtual network service endpoints or a managed instance with private endpoint
to host SSISDB, or if you require access to on-premises data without configuring a self-hosted IR. We
recommend an Azure Resource Manager virtual network, because classic virtual networks will be deprecated
soon.
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Use the same subnet as the one used for your
Azure SQL Database server with virtual network service endpoints, or a different subnet from the one used
for your managed instance with a private endpoint
for your managed instance with a private endpoint
# Public IP address info: OPTIONAL to provide two standard static public IP addresses with DNS name under
the same subscription and in the same region as your virtual network
$FirstPublicIP = "[your first public IP address resource ID or leave it empty]"
$SecondPublicIP = "[your second public IP address resource ID or leave it empty]"

### SSISDB info


$SSISDBServerEndpoint = "[your Azure SQL Database server name.database.windows.net or managed instance
name.DNS prefix.database.windows.net or managed instance name.public.DNS prefix.database.windows.net,3342 or
leave it empty if you do not use SSISDB]" # WARNING: If you use SSISDB, ensure that there's no existing
SSISDB on your database server, so we can prepare and manage one on your behalf
# Authentication info: SQL or Azure AD
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for Azure
AD authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for Azure
AD authentication]"
# For the basic pricing tier, specify "Basic," not "B." For standard, premium, and elastic pool tiers,
specify "S0," "S1," "S2," "S3," etc. See https://docs.microsoft.com/azure/sql-database/sql-database-
resource-limits-database-server.
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database server or leave it empty for managed instance]"

### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access

### Sign in and select a subscription


Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName

### Validate the connection to the database server


# Validate only if you use SSISDB and don't use a virtual network or Azure AD authentication
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
if([string]::IsNullOrEmpty($VnetId) -and [string]::IsNullOrEmpty($SubnetName))
{
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) -and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you
want to proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}
}
}

### Configure a virtual network


# Make sure to run this script against the subscription to which the virtual network belongs
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

### Create a data factory


Set-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName `
-Location $DataFactoryLocation `
-Name $DataFactoryName

### Create an integration runtime


Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-LicenseType $AzureSSISLicenseType `
-MaxParallelExecutionsPerNode $AzureSSISMaxParallelExecutionsPerNode `
-VnetId $VnetId `
-Subnet $SubnetName

# Add CatalogServerEndpoint, CatalogPricingTier, and CatalogAdminCredential parameters if you use SSISDB


if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword)) # Add the CatalogAdminCredential parameter if you don't
use Azure AD authentication
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogAdminCredential $serverCreds
}
}

# Add custom setup parameters if you use standard/express custom setups


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
if(![string]::IsNullOrEmpty($ExpressCustomSetup))
{
if($ExpressCustomSetup -eq "RunCmdkey")
{
$addCmdkeyArgument = "YourFileShareServerName or YourAzureStorageAccountName.file.core.windows.net"
$userCmdkeyArgument = "YourDomainName\YourUsername or azure\YourAzureStorageAccountName"
$passCmdkeyArgument = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourPassword or YourAccessKey")
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.CmdkeySetup($addCmdkeyArgument,
$userCmdkeyArgument, $passCmdkeyArgument)
}
if($ExpressCustomSetup -eq "SetEnvironmentVariable")
{
$variableName = "YourVariableName"
$variableValue = "YourVariableValue"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.EnvironmentVariableSetup($variableName, $variableValue)
}
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
{
$moduleVersion = "YourAzModuleVersion"
$setup = New-Object Microsoft.Azure.Management.DataFactory.Models.AzPowerShellSetup($moduleVersion)
}
if($ExpressCustomSetup -eq "SentryOne.TaskFactory")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.SQLPhonetics.NET")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.HEDDA.IO")
{
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup)
}
if($ExpressCustomSetup -eq "KingswaySoft.IntegrationToolkit")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "KingswaySoft.ProductivityPack")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "Theobald.XtractIS")
{
$jsonData = Get-Content -Raw -Path YourLicenseFile.json
$jsonData = $jsonData -replace '\s',''
$jsonData = $jsonData.replace('"','\"')
$licenseKey = New-Object Microsoft.Azure.Management.DataFactory.Models.SecureString($jsonData)
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "AecorSoft.IntegrationService")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
}
if($ExpressCustomSetup -eq "CData.Standard")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Extended")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
# Create an array of one or more express custom setups
$setups = New-Object System.Collections.ArrayList
$setups.Add($setup)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-ExpressCustomSetup $setups
}

# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName

if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}

# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}

### Start the integration runtime


write-host("##### Starting #####")
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

write-host("##### Completed #####")


write-host("If any cmdlet is unsuccessful, please consider using -Debug option for diagnostics.")

Use an Azure Resource Manager template to create an integration


runtime
In this section, you use an Azure Resource Manager template to create the Azure-SSIS integration runtime.
Here's a sample walkthrough:
1. Create a JSON file with the following Azure Resource Manager template. Replace values in the angle
brackets (placeholders) with your own values.

{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {},
"variables": {},
"resources": [{
"name": "<Specify a name for your data factory>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "East US",
"properties": {},
"resources": [{
"type": "integrationruntimes",
"name": "<Specify a name for your Azure-SSIS IR>",
"dependsOn": [ "<The name of the data factory you specified at the beginning>" ],
"apiVersion": "2018-06-01",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "East US",
"nodeSize": "Standard_D8_v3",
"numberOfNodes": 1,
"maxParallelExecutionsPerNode": 8
},
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "<Azure SQL Database server
name>.database.windows.net",
"catalogAdminUserName": "<Azure SQL Database server admin username>",
"catalogAdminPassword": {
"type": "SecureString",
"value": "<Azure SQL Database server admin password>"
},
"catalogPricingTier": "Basic"
}
}
}
}
}]
}]
}

2. To deploy the Azure Resource Manager template, run the New-AzResourceGroupDeployment command as
shown in the following example. In the example, ADFTutorialResourceGroup is the name of your resource
group. ADFTutorialARM.json is the file that contains the JSON definition for your data factory and the
Azure-SSIS IR.

New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json

This command creates your data factory and Azure-SSIS IR in it, but it doesn't start the IR.
3. To start your Azure-SSIS IR, run the Start-AzDataFactoryV2IntegrationRuntime command:
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName "<Resource Group Name>" `
-DataFactoryName "<Data Factory Name>" `
-Name "<Azure SSIS IR Name>" `
-Force

NOTE
Excluding any custom setup time, this process should finish within 5 minutes. But it might take 20-30 minutes for the
Azure-SSIS IR to join a virtual network.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB. It also configures
permissions and settings for your virtual network, if specified, and joins your Azure-SSIS IR to the virtual network.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.

Deploy SSIS packages


If you use SSISDB, you can deploy your packages into it and run them on your Azure-SSIS IR by using the
Azure-enabled SSDT or SSMS tools. These tools connect to your database server via its server endpoint:
For an Azure SQL Database server, the server endpoint format is <server name>.database.windows.net .
For a managed instance with private endpoint, the server endpoint format is
<server name>.<dns prefix>.database.windows.net .
For a managed instance with public endpoint, the server endpoint format is
<server name>.public.<dns prefix>.database.windows.net,3342 .

If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.

Next steps
See other Azure-SSIS IR topics in this documentation:
Azure-SSIS integration runtime. This article provides information about integration runtimes in general,
including Azure-SSIS IR.
Monitor an Azure-SSIS IR. This article shows you how to retrieve and understand information about your
Azure-SSIS IR.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or delete your Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding more nodes.
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Connect to on-premises data sources with Windows authentication
Schedule package executions in Azure
Execute SSIS packages in Azure from SSDT
3/5/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes the feature of Azure-enabled SQL Server Integration Services (SSIS) projects on SQL
Server Data Tools (SSDT). It allows you to assess the cloud compatibility of your SSIS packages and run them on
Azure-SSIS Integration Runtime (IR) in Azure Data Factory (ADF). You can use this feature to test your existing
packages before you lift & shift/migrate them to Azure or to develop new packages to run in Azure.
With this feature, you can attach a newly created/existing Azure-SSIS IR to SSIS projects and then execute your
packages on it. We support running packages to be deployed into SSIS catalog (SSISDB) hosted by your Azure
SQL Database server or managed instance in Project Deployment Model. We also support running packages to
be deployed into file system/Azure Files/SQL Server database (MSDB) hosted by your Azure SQL managed
instance in Package Deployment Model.

Prerequisites
To use this feature, please download and install the latest SSDT with SSIS Projects extension for Visual Studio
(VS) from here. Alternatively, you can also download and install the latest SSDT as a standalone installer from
here.

Azure-enable SSIS projects


Creating new Azure -enabled SSIS projects
On SSDT, you can create new Azure-enabled SSIS projects using the Integration Ser vices Project (Azure-
Enabled) template.
After the Azure-enabled project is created, you will be prompted to connect to SSIS in Azure Data Factory.

If you want to connect to your Azure-SSIS IR right away, see Connecting to Azure-SSIS IR for more details. You
can also connect later by right-clicking on your project node in the Solution Explorer window of SSDT to pop up
a menu. Next, select the Connect to SSIS in Azure Data Factor y item in SSIS in Azure Data Factor y
submenu.
Azure -enabling existing SSIS projects
For existing SSIS projects, you can Azure-enable them by following these steps:
1. Right-click on your project node in the Solution Explorer window of SSDT to pop up a menu. Next, select
the Azure-Enabled Project item in SSIS in Azure Data Factor y submenu to launch the Azure-
Enabled Project Wizard .

2. On the Select Visual Studio Configuration page, select your existing VS configuration to apply
package execution settings in Azure. You can also create a new one if you haven't done so already, see
Creating a new VS configuration. We recommend that you have at least two different VS configurations
for package executions in the local and cloud environments, so you can Azure-enable your project against
the cloud configuration. In this way, if you've parameterized your project or packages, you can assign
different values to your project or package parameters at run-time based on the different execution
environments (either on your local machine or in Azure). For example, see Switching package execution
environments.
3. Azure-enabling your existing SSIS projects requires you to set their target server version to be the latest
one supported by Azure-SSIS IR. Azure-SSIS IR is currently based on SQL Ser ver 2017 . Please ensure
that your packages don't contain additional components that are unsupported on SQL Server 2017.
Please also ensure that all compatible additional components have also been installed on your Azure-
SSIS IR via custom setups, see Customizing your Azure-SSIS IR. Select the Next button to continue.
4. See Connecting to Azure-SSIS IR to complete connecting your project to Azure-SSIS IR.

Connect Azure-enabled projects to SSIS in Azure Data Factory


By connecting your Azure-enabled projects to SSIS in ADF, you can upload your packages into Azure Files and
run them on Azure-SSIS IR. You can do so by following these steps:
1. On the SSIS in ADF Introduction page, review the introduction and select the Next button to continue.
2. On the Select SSIS IR in ADF page, select your existing ADF and Azure-SSIS IR to run packages. You
can also create new ones if you don't have any.
To select your existing Azure-SSIS IR, select the relevant Azure subscription and ADF first.
If you select your existing ADF that doesn't have any Azure-SSIS IR, select the Create SSIS IR button
to create a new one on ADF portal. Once created, you can return to this page to select your new
Azure-SSIS IR.
If you select your existing Azure subscription that doesn't have any ADF, select the Create SSIS IR
button to launch the Integration Runtime Creation Wizard . On the wizard, you can enter your
designated location and prefix for us to automatically create a new Azure Resource Group, Data
Factory, and SSIS IR on your behalf, named in the following pattern: YourPrefix-RG/DF/IR-
YourCreationTime . Once created, you can return to this page to select your new ADF and Azure-SSIS
IR.
3. On the Select Azure Storage page, select your existing Azure Storage account to upload packages into
Azure Files. You can also create a new one if you don't have any.
To select your existing Azure Storage account, select the relevant Azure subscription first.
If you select the same Azure subscription as your Azure-SSIS IR that doesn't have any Azure Storage
account, select the Create Azure Storage button. We'll automatically create a new one on your
behalf in the same location as your Azure-SSIS IR, named by combining a prefix of your Azure-SSIS IR
name and its creation date. Once created, you can return to this page to select your new Azure Storage
account.
If you select a different Azure subscription that doesn't have any Azure Storage account, select the
Create Azure Storage button to create a new one on Azure portal. Once created, you can return to
this page to select your new Azure Storage account.
4. Select the Connect button to complete connecting your project to Azure-SSIS IR. We'll display your
selected Azure-SSIS IR and Azure Storage account under the Linked Azure Resources node in Solution
Explorer window of SSDT. We'll also regularly refresh and display the status of your Azure-SSIS IR there.
You can manage your Azure-SSIS IR by right-clicking on its node to pop up a menu and then selecting the
Star t\Stop\Manage item that takes you to ADF portal to do so.

Assess SSIS project\packages for executions in Azure


Assessing single or multiple packages
Before executing your packages in Azure, you can assess them to surface any potential cloud compatibility
issues. These include migration blockers and additional information that you should be aware of.
You have the options to assess single packages one-by-one or all packages at the same time under your
project.
On the Assessment Repor t window of SSDT, you can find all potential cloud compatibility issues that
are surfaced, each with its own detailed description and recommendation. You can also export the
assessment report into a CSV file that can be shared with anyone who should mitigate these issues.

Suppressing assessment rules


Once you're sure that some potential cloud compatibility issues aren't applicable or have been properly
mitigated in your packages, you can suppress the relevant assessment rules that surface them. This will reduce
the noise in your subsequent assessment reports.
Select the Configure Assessment Rule Suppression link in Assessment Repor t window of SSDT to
pop up the Assessment Rule Suppression Settings window, where you can select the assessment
rules to suppress.

Alternatively, right-click on your project node in the Solution Explorer window of SSDT to pop up a menu.
Select the Azure-Enabled Settings item in SSIS in Azure Data Factor y submenu to pop up a
window containing your project property pages. Select the Suppressed Assessment Rule IDs
property in Azure-Enabled Settings section. Finally, select its ellipsis (...) button to pop up the
Assessment Rule Suppression Settings window, where you can select the assessment rules to
suppress.
Execute SSIS packages in Azure
Configuring Azure -enabled settings
Before executing your packages in Azure, you can configure your Azure-enabled settings for them. For example,
you can enable Windows authentication on your Azure-SSIS IR to access on-premises/cloud data stores by
following these steps:
1. Right-click on your project node in the Solution Explorer window of SSDT to pop up a menu. Next, select
the Azure-Enabled Settings item in SSIS in Azure Data Factor y submenu to pop up a window
containing your project property pages.

2. Select the Enable Windows Authentication property in Azure-Enabled Settings section and then
select True in its dropdown menu. Next, select the Windows Authentication Credentials property
and then select its ellipsis (...) button to pop up the Windows Authentication Credentials window.

3. Enter your Windows authentication credentials. For example, to access Azure Files, you can enter Azure ,
YourStorageAccountName , and YourStorageAccountKey for Domain , Username , and Password ,
respectively.

Starting package executions


After connecting your Azure-enabled projects to SSIS in ADF, assessing their cloud compatibility, and mitigating
potential issues, you can execute/test your packages on Azure-SSIS IR.
Select the Star t button in SSDT toolbar to drop down a menu. Next, select the Execute in Azure item.

Alternatively, right-click on your package node in the Solution Explorer window of SSDT to pop up a
menu. Next, select the Execute Package in Azure item.

NOTE
Executing your packages in Azure requires you to have a running Azure-SSIS IR, so if your Azure-SSIS IR is stopped, a
dialog window will pop up to start it. Excluding any custom setup time, this process should be completed within 5
minutes, but could take approximately 20 - 30 minutes for Azure-SSIS IR joining a virtual network. After executing your
packages in Azure, you can stop your Azure-SSIS IR to manage its running cost by right-clicking on its node in the
Solution Explorer window of SSDT to pop up a menu and then selecting the Star t\Stop\Manage item that takes you to
ADF portal to do so.

Using Execute Package Task


If your packages contain Execute Package Tasks that refer to child packages stored on local file systems, follow
these additional steps:
1. Upload the child packages into Azure Files under the same Azure Storage account connected to your
projects and get their new Universal Naming Convention (UNC) path, e.g.
\\YourStorageAccountName.file.core.windows.net\ssdtexecution\YourChildPackage1.dtsx

2. Replace the file path of those child packages in the File Connection Manager of Execute Package Tasks
with their new UNC path
If your local machine running SSDT can't access the new UNC path, you can enter it on the Properties
panel of File Connection Manager
Alternatively, you can use a variable for the file path to assign the right value at run-time
If your packages contain Execute Package Tasks that refer to child packages in the same project, no additional
step is necessary.
Switching package protection level
Executing SSIS packages in Azure doesn't support Encr yptSensitiveWithUserKey /Encr yptAllWithUserKey
protection levels. Consequently, if your packages are configured to use those, we'll temporarily convert them
into using Encr yptSensitiveWithPassword /Encr yptAllWithPassword protection levels, respectively. We'll
also randomly generate encryption passwords when we upload your packages into Azure Files for executions
on your Azure-SSIS IR.

NOTE
If your packages contain Execute Package Tasks that refer to child packages configured to use
Encr yptSensitiveWithUserKey /Encr yptAllWithUserKey protection levels, you need to manually reconfigure those
child packages to use Encr yptSensitiveWithPassword /Encr yptAllWithPassword protection levels, respectively,
before executing your packages.

If your packages are already configured to use Encr yptSensitiveWithPassword /Encr yptAllWithPassword
protection levels, we'll keep them unchanged. We'll still randomly generate encryption passwords when we
upload your packages into Azure Files for executions on your Azure-SSIS IR.
Switching package execution environments
If you parameterize your project/packages in Project Deployment Model, you can create multiple VS
configurations to switch package execution environments. In this way, you can assign environment-specific
values to your project/package parameters at run-time. We recommend that you have at least two different VS
configurations for package executions in the local and cloud environments, so you can Azure-enable your
projects against the cloud configuration. Here's a step-by-step example of switching package execution
environments between your local machine and Azure:
1. Let's say your package contains a File System Task that sets the attributes of a file. When you run it on
your local machine, it sets the attributes of a file stored on your local file system. When you run it on your
Azure-SSIS IR, you want it to set the attributes of a file stored in Azure Files. First, create a package
parameter of string type and name it FilePath to hold the value of target file path.

2. Next, on the General page of File System Task Editor window, parameterize the SourceVariable
property in Source Connection section with the FilePath package parameter.
3. By default, you have an existing VS configuration for package executions in the local environment named
Development . Create a new VS configuration for package executions in the cloud environment named
Azure , see Creating a new VS configuration, if you haven't done so already.
4. When viewing the parameters of your package, select the Add Parameters to Configurations button
to open the Manage Parameter Values window for your package. Next, assign different values of
target file path to the FilePath package parameter under the Development and Azure configurations.

5. Azure-enable your project against the cloud configuration, see Azure-enabling existing SSIS projects, if
you haven't done so already. Next, configure Azure-enabled settings to enable Windows authentication
for your Azure-SSIS IR to access Azure Files, see Configuring Azure-enabled settings, if you haven't done
so already.
6. Execute your package in Azure. You can switch your package execution environment back to your local
machine by selecting the Development configuration.

Using package configuration file


If you use package configuration files in Package Deployment Model, you can assign environment-specific
values to your package properties at run-time. We'll automatically upload those files with your packages into
Azure Files for executions on your Azure-SSIS IR.
Checking package execution logs
After starting your package execution, we'll format and display its logs in the Progress window of SSDT. For a
long-running package, we'll periodically update its logs by the minutes. You can immediately cancel your
package execution by selecting the Stop button in SSDT toolbar. You can also temporarily find the raw data of its
logs in the following UNC path:
\\<YourStorageAccountName>.file.core.windows.net\ssdtexecution\<YourProjectName-FirstConnectTime>\
<YourPackageName-tmp-ExecutionTime>\logs
, but we'll clean it up after one day.

Current limitations
The Azure-enabled SSDT supports only commercial/global cloud regions and doesn't support
governmental/national cloud regions for now.

Next steps
Once you're satisfied with running your packages in Azure from SSDT, you can deploy and run them as Execute
SSIS Package activities in ADF pipelines, see Running SSIS packages as Execute SSIS Package activities in ADF
pipelines.
Run SSIS packages by using Azure SQL Managed
Instance Agent
4/22/2021 • 5 minutes to read • Edit Online

This article describes how to run a SQL Server Integration Services (SSIS) package by using Azure SQL
Managed Instance Agent. This feature provides behaviors that are similar to when you schedule SSIS packages
by using SQL Server Agent in your on-premises environment.
With this feature, you can run SSIS packages that are stored in SSISDB in a SQL Managed Instance, a file system
like Azure Files, or an Azure-SSIS integration runtime package store.

Prerequisites
To use this feature, download and install latest SQL Server Management Studio (SSMS). Version support details
as below:
To run packages in SSISDB or file system, install SSMS version 18.5 or above.
To run packages in package store, install SSMS version 18.6 or above.
You also need to provision an Azure-SSIS integration runtime in Azure Data Factory. It uses a SQL Managed
Instance as an endpoint server.

Run an SSIS package in SSISDB


In this procedure, you use SQL Managed Instance Agent to invoke an SSIS package that's stored in SSISDB.
1. In the latest version of SSMS, connect to a SQL Managed Instance.
2. Create a new agent job and a new job step. Under SQL Ser ver Agent , right-click the Jobs folder, and
then select New Job .
3. On the New Job Step page, select SQL Ser ver Integration Ser vices Package as the type.

4. On the Package tab, select SSIS Catalog as the package location.


5. Because SSISDB is in a SQL Managed Instance, you don't need to specify authentication.
6. Specify an SSIS package from SSISDB.
7. On the Configuration tab, you can:
Specify parameter values under Parameters .
Override values under Connection Managers .
Override the property and choose the logging level under Advanced .
8. Select OK to save the agent job configuration.
9. Start the agent job to run the SSIS package.

Run an SSIS package in the file system


In this procedure, you use SQL Managed Instance Agent to run an SSIS package that's stored in the file system.
1. In the latest version of SSMS, connect to a SQL Managed Instance.
2. Create a new agent job and a new job step. Under SQL Ser ver Agent , right-click the Jobs folder, and
then select New Job .
3. On the New Job Step page, select SQL Ser ver Integration Ser vices Package as the type.

4. On the Package tab:


a. For Package location , select File system .
b. For File source type :
If your package is uploaded to Azure Files, select Azure file share .

The package path is


\\<storage account name>.file.core.windows.net\<file share name>\<package name>.dtsx .
Under Package file access credential , enter the Azure file account name and account key
to access the Azure file. The domain is set as Azure .
If your package is uploaded to a network share, select Network share .
The package path is the UNC path of your package file with its .dtsx extension.
Enter the corresponding domain, username, and password to access the network share
package file.
c. If your package file is encrypted with a password, select Encr yption password and enter the
password.
5. On the Configurations tab, enter the configuration file path if you need a configuration file to run the
SSIS package. If you store your configuration in Azure Files, its configuration path will be
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .

6. On the Execution options tab, you can choose whether to use Windows authentication or 32-bit
runtime to run the SSIS package.
7. On the Logging tab, you can choose the logging path and corresponding logging access credential to
store the log files. By default, the logging path is the same as the package folder path, and the logging
access credential is the same as the package access credential. If you store your logs in Azure Files, your
logging path will be \\<storage account name>.file.core.windows.net\<file share name>\<log folder name>
.
8. On the Set values tab, you can enter the property path and value to override the package properties.
For example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .

9. Select OK to save the agent job configuration.


10. Start the agent job to run the SSIS package.

Run an SSIS package in the package store


In this procedure, you use SQL Managed Instance Agent to run an SSIS package that's stored in the Azure-SSIS
IR package store.
1. In the latest version of SSMS, connect to a SQL Managed Instance.
2. Create a new agent job and a new job step. Under SQL Ser ver Agent , right-click the Jobs folder, and
then select New Job .

3. On the New Job Step page, select SQL Ser ver Integration Ser vices Package as the type.
4. On the Package tab:
a. For Package location , select Package Store .
b. For Package path :
The package path is <package store name>\<folder name>\<package name> .
c. If your package file is encrypted with a password, select Encr yption password and enter the
password.
5. On the Configurations tab, enter the configuration file path if you need a configuration file to run the
SSIS package. If you store your configuration in Azure Files, its configuration path will be
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .

6. On the Execution options tab, you can choose whether to use Windows authentication or 32-bit
runtime to run the SSIS package.
7. On the Logging tab, you can choose the logging path and corresponding logging access credential to
store the log files. By default, the logging path is the same as the package folder path, and the logging
access credential is the same as the package access credential. If you store your logs in Azure Files, your
logging path will be \\<storage account name>.file.core.windows.net\<file share name>\<log folder name>
.
8. On the Set values tab, you can enter the property path and value to override the package properties.
For example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .

9. Select OK to save the agent job configuration.


10. Start the agent job to run the SSIS package.

Cancel SSIS package execution


To cancel package execution from a SQL Managed Instance Agent job, take the following steps instead of directly
stopping the agent job:
1. Find your SQL agent jobId from msdb.dbo.sysjobs .
2. Find the corresponding SSIS executionId based on the job ID, by using this query:

select * from '{table for job execution}' where parameter_value = 'SQL_Agent_Job_{jobId}' order by
execution_id desc

If your SSIS packages are in SSISDB, then use ssisdb.internal.execution_parameter_values as table


for job execution. If your SSIS packages are in file system, then use
ssisdb.internal.execution_parameter_values_noncatalog .
3. Right-click the SSISDB catalog, and then select Active Operations .

4. Stop the corresponding operation based on executionId .

Next steps
You can also schedule SSIS packages by using Azure Data Factory. For step-by-step instructions, see Azure Data
Factory event trigger.
Run SQL Server Integration Services packages with
the Azure-enabled dtexec utility
3/5/2021 • 6 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes the Azure-enabled dtexec (AzureDTExec) command prompt utility. It's used to run SQL
Server Integration Services (SSIS) packages on the Azure-SSIS Integration Runtime (IR) in Azure Data Factory.
The traditional dtexec utility comes with SQL Server. For more information, see dtexec utility. It's often invoked
by third-party orchestrators or schedulers, such as ActiveBatch and Control-M, to run SSIS packages on-
premises.
The modern AzureDTExec utility comes with a SQL Server Management Studio (SSMS) tool. It can also be
invoked by third-party orchestrators or schedulers to run SSIS packages in Azure. It facilitates the lifting and
shifting or migration of your SSIS packages to the cloud. After migration, if you want to keep using third-party
orchestrators or schedulers in your day-to-day operations, they can now invoke AzureDTExec instead of dtexec.
AzureDTExec runs your packages as Execute SSIS Package activities in Data Factory pipelines. For more
information, see Run SSIS packages as Azure Data Factory activities.
AzureDTExec can be configured via SSMS to use an Azure Active Directory (Azure AD) application that generates
pipelines in your data factory. It can also be configured to access file systems, file shares, or Azure Files where
you store your packages. Based on the values you give for its invocation options, AzureDTExec generates and
runs a unique Data Factory pipeline with an Execute SSIS Package activity in it. Invoking AzureDTExec with the
same values for its options reruns the existing pipeline.

Prerequisites
To use AzureDTExec, download and install the latest version of SSMS, which is version 18.3 or later. Download it
from this website.

Configure the AzureDTExec utility


Installing SSMS on your local machine also installs AzureDTExec. To configure its settings, start SSMS with the
Run as administrator option. Then select Tools > Migrate to Azure > Configure Azure-enabled DTExec .

This action opens a AzureDTExecConfig window that needs to be opened with administrative privileges for it
to write into the AzureDTExec.settings file. If you haven't run SSMS as an administrator, a User Account Control
(UAC) window opens. Enter your admin password to elevate your privileges.

In the AzureDTExecConfig window, enter your configuration settings as follows:


ApplicationId : Enter the unique identifier of the Azure AD app that you create with the right permissions to
generate pipelines in your data factory. For more information, see Create an Azure AD app and service
principal via Azure portal.
AuthenticationKey : Enter the authentication key for your Azure AD app.
TenantId : Enter the unique identifier of the Azure AD tenant, under which your Azure AD app is created.
DataFactor y : Enter the name of your data factory in which unique pipelines with Execute SSIS Package
activity in them are generated based on the values of options provided when you invoke AzureDTExec.
IRName : Enter the name of the Azure-SSIS IR in your data factory, on which the packages specified in their
Universal Naming Convention (UNC) path will run when you invoke AzureDTExec.
PipelineNameHashStrLen : Enter the length of hash strings to be generated from the values of options you
provide when you invoke AzureDTExec. The strings are used to form unique names for Data Factory pipelines
that run your packages on the Azure-SSIS IR. Usually a length of 32 characters is sufficient.
ResourceGroup : Enter the name of the Azure resource group in which your data factory was created.
SubscriptionId : Enter the unique identifier of the Azure subscription, under which your data factory was
created.
LogAccessDomain : Enter the domain credential to access your log folder in its UNC path when you write
log files, which is required when LogPath is specified and LogLevel isn't null .
LogAccessPassword : Enter the password credential to access your log folder in its UNC path when you
write log files, which is required when LogPath is specified and LogLevel isn't null .
LogAccessUserName : Enter the username credential to access your log folder in its UNC path when you
write log files, which is required when LogPath is specified and LogLevel isn't null .
LogLevel : Enter the selected scope of logging from predefined null , Basic , Verbose , or Performance
options for your package executions on the Azure-SSIS IR.
LogPath : Enter the UNC path of the log folder, into which log files from your package executions on the
Azure-SSIS IR are written.
PackageAccessDomain : Enter the domain credential to access your packages in their UNC path that's
specified when you invoke AzureDTExec.
PackageAccessPassword : Enter the password credential to access your packages in their UNC path that's
specified when you invoke AzureDTExec.
PackageAccessUserName : Enter the username credential to access your packages in their UNC path that's
specified when you invoke AzureDTExec.
To store your packages and log files in file systems or file shares on-premises, join your Azure-SSIS IR to a
virtual network connected to your on-premises network so that it can fetch your packages and write your log
files. For more information, see Join an Azure-SSIS IR to a virtual network.
To avoid showing sensitive values written into the AzureDTExec.settings file in plain text, we encode them into
strings of Base64 encoding. When you invoke AzureDTExec, all Base64-encoded strings are decoded back into
their original values. You can further secure the AzureDTExec.settings file by limiting the accounts that can access
it.

Invoke the AzureDTExec utility


You can invoke AzureDTExec at the command-line prompt and provide the relevant values for specific options in
your use-case scenario.
The utility is installed at {SSMS Folder}\Common7\IDE\CommonExtensions\Microsoft\SSIS\150\Binn . You can add its
path to the 'PATH' environment variable for it to be invoked from anywhere.

> cd "C:\Program Files (x86)\Microsoft SQL Server Management Studio


18\Common7\IDE\CommonExtensions\Microsoft\SSIS\150\Binn"
> AzureDTExec.exe ^
/F \\MyStorageAccount.file.core.windows.net\MyFileShare\MyPackage.dtsx ^
/Conf \\MyStorageAccount.file.core.windows.net\MyFileShare\MyConfig.dtsConfig ^
/Conn "MyConnectionManager;Data Source=MyDatabaseServer.database.windows.net;User
ID=MyAdminUsername;Password=MyAdminPassword;Initial Catalog=MyDatabase" ^
/Set \package.variables[MyVariable].Value;MyValue ^
/De MyEncryptionPassword

Invoking AzureDTExec offers similar options as invoking dtexec. For more information, see dtexec Utility. Here
are the options that are currently supported:
/F[ile] : Loads a package that's stored in file system, file share, or Azure Files. As the value for this option, you
can specify the UNC path for your package file in file system, file share, or Azure Files with its .dtsx extension.
If the UNC path specified contains any space, put quotation marks around the whole path.
/Conf[igFile] : Specifies a configuration file to extract values from. Using this option, you can set a run-time
configuration for your package that differs from the one specified at design time. You can store different
settings in an XML configuration file and then load them before your package execution. For more
information, see SSIS package configurations. To specify the value for this option, use the UNC path for your
configuration file in file system, file share, or Azure Files with its dtsConfig extension. If the UNC path
specified contains any space, put quotation marks around the whole path.
/Conn[ection] : Specifies connection strings for existing connection managers in your package. Using this
option, you can set run-time connection strings for existing connection managers in your package that differ
from the ones specified at design time. Specify the value for this option as follows:
connection_manager_name_or_id;connection_string [[;connection_manager_name_or_id;connection_string]...] .
/Set : Overrides the configuration of a parameter, variable, property, container, log provider, Foreach
enumerator, or connection in your package. This option can be specified multiple times. Specify the value for
this option as follows: property_path;value . For example, \package.variables[counter].Value;1 overrides the
value of counter variable as 1. You can use the Package Configuration wizard to find, copy, and paste the
value of property_path for items in your package whose value you want to override. For more information,
see Package Configuration wizard.
/De[cr ypt] : Sets the decryption password for your package that's configured with the
Encr yptAllWithPassword /Encr yptSensitiveWithPassword protection level.

NOTE
Invoking AzureDTExec with new values for its options generates a new pipeline except for the option /De[cript] .

Next steps
After unique pipelines with the Execute SSIS Package activity in them are generated and run when you invoke
AzureDTExec, they can be monitored on the Data Factory portal. You can also assign Data Factory triggers to
them if you want to orchestrate/schedule them using Data Factory. For more information, see Run SSIS
packages as Data Factory activities.

WARNING
The generated pipeline is expected to be used only by AzureDTExec. Its properties or parameters might change in the
future, so don't modify or reuse them for any other purposes. Modifications might break AzureDTExec. If this happens,
delete the pipeline. AzureDTExec generates a new pipeline the next time it's invoked.
Run an SSIS package with the Execute SSIS Package
activity in Azure Data Factory
7/2/2021 • 30 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to run a SQL Server Integration Services (SSIS) package in an Azure Data Factory
pipeline by using the Execute SSIS Package activity.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Create an Azure-SSIS integration runtime (IR) if you don't have one already by following the step-by-step
instructions in the Tutorial: Provisioning Azure-SSIS IR.

Run a package in the Azure portal


In this section, you use the Data Factory user interface (UI) or app to create a Data Factory pipeline with an
Execute SSIS Package activity that runs your SSIS package.
Create a pipeline with an Execute SSIS Package activity
In this step, you use the Data Factory UI or app to create a pipeline. You add an Execute SSIS Package activity to
the pipeline and configure it to run your SSIS package.
1. On your Data Factory overview or home page in the Azure portal, select the Author & Monitor tile to
start the Data Factory UI or app in a separate tab.
On the home page, select Orchestrate .

2. In the Activities toolbox, expand General . Then drag an Execute SSIS Package activity to the pipeline
designer surface.
Select the Execute SSIS Package activity object to configure its General , Settings , SSIS Parameters ,
Connection Managers , and Proper ty Overrides tabs.
General tab
On the General tab of Execute SSIS Package activity, complete the following steps.

1. For Name , enter the name of your Execute SSIS Package activity.
2. For Description , enter the description of your Execute SSIS Package activity.
3. For Timeout , enter the maximum amount of time your Execute SSIS Package activity can run. Default is 7
days, format is D.HH:MM:SS.
4. For Retr y , enter the maximum number of retry attempts for your Execute SSIS Package activity.
5. For Retr y inter val , enter the number of seconds between each retry attempt for your Execute SSIS
Package activity. Default is 30 seconds.
6. Select the Secure output check box to choose whether you want to exclude the output of your Execute
SSIS Package activity from logging.
7. Select the Secure input check box to choose whether you want to exclude the input of your Execute SSIS
Package activity from logging.
Settings tab
On the Settings tab of Execute SSIS Package activity, complete the following steps.
1. For Azure-SSIS IR , select the designated Azure-SSIS IR to run your Execute SSIS Package activity.
2. For Description , enter the description of your Execute SSIS Package activity.
3. Select the Windows authentication check box to choose whether you want to use Windows
authentication to access data stores, such as SQL servers/file shares on-premises or Azure Files.
If you select this check box, enter the values for your package execution credentials in the Domain ,
Username , and Password boxes. For example, to access Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .

4. Select the 32-Bit runtime check box to choose whether your package needs 32-bit runtime to run.
5. For Package location , select SSISDB , File System (Package) , File System (Project) , Embedded
package , or Package store .
P a c k a g e l o c a t i o n : SSI SD B

SSISDB as your package location is automatically selected if your Azure-SSIS IR was provisioned with an SSIS
catalog (SSISDB) hosted by Azure SQL Database server/Managed Instance or you can select it yourself. If it's
selected, complete the following steps.
1. If your Azure-SSIS IR is running and the Manual entries check box is cleared, browse and select your
existing folders, projects, packages, and environments from SSISDB. Select Refresh to fetch your newly
added folders, projects, packages, or environments from SSISDB, so that they're available for browsing
and selection. To browse and select the environments for your package executions, you must configure
your projects beforehand to add those environments as references from the same folders under SSISDB.
For more information, see Create and map SSIS environments.
2. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
3. If your Azure-SSIS IR isn't running or the Manual entries check box is selected, enter your package and
environment paths from SSISDB directly in the following formats:
<folder name>/<project name>/<package name>.dtsx and <folder name>/<environment name> .

P a c k a g e l o c a t i o n : F i l e Sy st e m (P a c k a g e )

File System (Package) as your package location is automatically selected if your Azure-SSIS IR was
provisioned without SSISDB or you can select it yourself. If it's selected, complete the following steps.
1. Specify your package to run by providing a Universal Naming Convention (UNC) path to your package
file (with .dtsx ) in the Package path box. You can browse and select your package by selecting Browse
file storage or enter its path manually. For example, if you store your package in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<package name>.dtsx .

2. If you configure your package in a separate file, you also need to provide a UNC path to your
configuration file (with .dtsConfig ) in the Configuration path box. You can browse and select your
configuration by selecting Browse file storage or enter its path manually. For example, if you store your
configuration in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .

3. Specify the credentials to access your package and configuration files. If you previously entered the
values for your package execution credentials (for Windows authentication ), you can reuse them by
selecting the Same as package execution credentials check box. Otherwise, enter the values for your
package access credentials in the Domain , Username , and Password boxes. For example, if you store
your package and configuration in Azure Files, the domain is Azure , the username is
<storage account name> , and the password is <storage account key> .

Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .

These credentials are also used to access your child packages in Execute Package Task that are referenced
by their own path and other configurations specified in your packages.
4. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SQL Server Data Tools (SSDT), enter the value for your password in the
Encr yption password box. Alternatively, you can use a secret stored in your Azure Key Vault as its value
(see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
5. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
6. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
7. Specify the credentials to access your log folder. If you previously entered the values for your package
access credentials (see above), you can reuse them by selecting the Same as package access
credentials check box. Otherwise, enter the values for your logging access credentials in the Domain ,
Username , and Password boxes. For example, if you store your logs in Azure Files, the domain is Azure
, the username is <storage account name> , and the password is <storage account key> . Alternatively, you
can use secrets stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : F i l e Sy st e m (P r o j e c t )

If you select File System (Project) as your package location, complete the following steps.
1. Specify your package to run by providing a UNC path to your project file (with .ispac ) in the Project
path box and a package file (with .dtsx ) from your project in the Package name box. You can browse
and select your project by selecting Browse file storage or enter its path manually. For example, if you
store your project in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<project name>.ispac .

2. Specify the credentials to access your project and package files. If you previously entered the values for
your package execution credentials (for Windows authentication ), you can reuse them by selecting the
Same as package execution credentials check box. Otherwise, enter the values for your package
access credentials in the Domain , Username , and Password boxes. For example, if you store your
project and package in Azure Files, the domain is Azure , the username is <storage account name> , and
the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .

These credentials are also used to access your child packages in Execute Package Task that are referenced
from the same project.
3. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value (see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values on the
SSIS Parameters , Connection Managers , or Proper ty Overrides tabs (see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
4. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
5. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
6. Specify the credentials to access your log folder. If you previously entered the values for your package
access credentials (see above), you can reuse them by selecting the Same as package access
credentials check box. Otherwise, enter the values for your logging access credentials in the Domain ,
Username , and Password boxes. For example, if you store your logs in Azure Files, the domain is Azure
, the username is <storage account name> , and the password is <storage account key> . Alternatively, you
can use secrets stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : Em b e d d e d p a c k a g e

If you select Embedded package as your package location, complete the following steps.
1. Drag and drop your package file (with .dtsx ) or Upload it from a file folder into the box provided. Your
package will be automatically compressed and embedded in the activity payload. Once embedded, you
can Download your package later for editing. You can also Parameterize your embedded package by
assigning it to a pipeline parameter that can be used in multiple activities, hence optimizing the size of
your pipeline payload. Embedding project files (with .ispac ) is currently unsupported, so you can't use
SSIS parameters/connection managers with project-level scope in your embedded packages.
2. If your embedded package is not all encrypted and we detect the use of Execute Package Task (EPT) in it,
the Execute Package Task check box will be automatically selected and your child packages that are
referenced by their file system path will be automatically added, so you can also embed them.
If we can't detect the use of EPT, you need to manually select the Execute Package Task check box and
add your child packages that are referenced by their file system path one by one, so you can also embed
them. If your child packages are stored in SQL Server database (MSDB), you can't embed them, so you
need to ensure that your Azure-SSIS IR can access MSDB to fetch them using their SQL Server references.
Embedding project files (with .ispac ) is currently unsupported, so you can't use project-based
references for your child packages.
3. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value. To do so, select the AZURE
KEY VAULT check box next to it. Select or edit your existing key vault linked service or create a new one.
Then select the secret name and version for your value. When you create or edit your key vault linked
service, you can select or edit your existing key vault or create a new one. Make sure to grant Data
Factory managed identity access to your key vault if you haven't done so already. You can also enter your
secret directly in the following format: <key vault linked service name>/<secret name>/<secret version> .
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
4. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
5. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
6. Specify the credentials to access your log folder by entering their values in the Domain , Username , and
Password boxes. For example, if you store your logs in Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> . Alternatively, you can use secrets
stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : P a c k a g e st o r e

If you select Package store as your package location, complete the following steps.

1. For Package store name , select an existing package store that's attached to your Azure-SSIS IR.
2. Specify your package to run by providing its path (without .dtsx ) from the selected package store in the
Package path box. If the selected package store is on top of file system/Azure Files, you can browse and
select your package by selecting Browse file storage , otherwise you can enter its path in the format of
<folder name>\<package name> . You can also import new packages into the selected package store via SQL
Server Management Studio (SSMS) similar to the legacy SSIS package store. For more information, see
Manage SSIS packages with Azure-SSIS IR package stores.
3. If you configure your package in a separate file, you need to provide a UNC path to your configuration
file (with .dtsConfig ) in the Configuration path box. You can browse and select your configuration by
selecting Browse file storage or enter its path manually. For example, if you store your configuration in
Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .

4. Select the Configuration access credentials check box to choose whether you want to specify the
credentials to access your configuration file separately. This is needed when the selected package store is
on top of SQL Server database (MSDB) hosted by your Azure SQL Managed Instance or doesn't also store
your configuration file.
If you previously entered the values for your package execution credentials (for Windows
authentication ), you can reuse them by selecting the Same as package execution credentials check
box. Otherwise, enter the values for your configuration access credentials in the Domain , Username ,
and Password boxes. For example, if you store your configuration in Azure Files, the domain is Azure ,
the username is <storage account name> , and the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .

5. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value (see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
6. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
7. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
8. Specify the credentials to access your log folder by entering their values in the Domain , Username , and
Password boxes. For example, if you store your logs in Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> . Alternatively, you can use secrets
stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
SSIS Parameters tab
On the SSIS Parameters tab of Execute SSIS Package activity, complete the following steps.
1. If your Azure-SSIS IR is running, SSISDB is selected as your package location, and the Manual entries
check box on the Settings tab is cleared, the existing SSIS parameters in your selected project and
package from SSISDB are displayed for you to assign values to them. Otherwise, you can enter them one
by one to assign values to them manually. Make sure that they exist and are correctly entered for your
package execution to succeed.
2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive parameters to assign values to
them on this tab.
When you assign values to your parameters, you can add dynamic content by using expressions, functions, Data
Factory system variables, and Data Factory pipeline parameters or variables.
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the AZURE KEY
VAULT check box next to them. Select or edit your existing key vault linked service or create a new one. Then
select the secret name and version for your value. When you create or edit your key vault linked service, you can
select or edit your existing key vault or create a new one. Make sure to grant Data Factory managed identity
access to your key vault if you haven't done so already. You can also enter your secret directly in the following
format: <key vault linked service name>/<secret name>/<secret version> .
Connection Managers tab
On the Connection Managers tab of Execute SSIS Package activity, complete the following steps.
1. If your Azure-SSIS IR is running, SSISDB is selected as your package location, and the Manual entries
check box on the Settings tab is cleared, the existing connection managers in your selected project and
package from SSISDB are displayed for you to assign values to their properties. Otherwise, you can enter
them one by one to assign values to their properties manually. Make sure that they exist and are correctly
entered for your package execution to succeed.
You can obtain the correct SCOPE , NAME , and PROPERTY names for any connection manager by
opening the package that contains it on SSDT. After the package is opened, select the relevant connection
manager to show the names and values for all of its properties on the Proper ties window of SSDT. With
this info, you can override the values of any connection manager properties at run-time.

For example, without modifying your original package on SSDT, you can convert its on-premises-to-on-
premises data flows running on SQL Server into on-premises-to-cloud data flows running on SSIS IR in
ADF by overriding the values of ConnectByProxy , ConnectionString , and
ConnectUsingManagedIdentity properties in existing connection managers at run-time.
These run-time overrides can enable Self-Hosted IR (SHIR) as a proxy for SSIS IR when accessing data on
premises, see Configuring SHIR as a proxy for SSIS IR, and Azure SQL Database/Managed Instance
connections using the latest MSOLEDBSQL driver that in turn enables Azure Active Directory (AAD)
authentication with ADF managed identity, see Configuring AAD authentication with ADF managed
identity for OLEDB connections.
2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive connection manager properties
to assign values to them on this tab.
When you assign values to your connection manager properties, you can add dynamic content by using
expressions, functions, Data Factory system variables, and Data Factory pipeline parameters or variables.
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the AZURE KEY
VAULT check box next to them. Select or edit your existing key vault linked service or create a new one. Then
select the secret name and version for your value. When you create or edit your key vault linked service, you can
select or edit your existing key vault or create a new one. Make sure to grant Data Factory managed identity
access to your key vault if you haven't done so already. You can also enter your secret directly in the following
format: <key vault linked service name>/<secret name>/<secret version> .
Property Overrides tab
On the Proper ty Overrides tab of Execute SSIS Package activity, complete the following steps.

1. Enter the paths of existing properties in your selected package one by one to assign values to them
manually. Make sure that they exist and are correctly entered for your package execution to succeed. For
example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .
You can obtain the correct PROPERTY PATH for any package property by opening the package that
contains it on SSDT. After the package is opened, select its control flow and Configurations property on
the Proper ties window of SSDT. Next, select the ellipsis (...) button next to its Configurations property
to open the Package Configurations Organizer that's normally used to create package configurations
in Package Deployment Model.

On the Package Configurations Organizer , select the Enable package configurations check box
and the Add... button to open the Package Configuration Wizard .
On the Package Configuration Wizard , select the XML configuration file item in Configuration
type dropdown menu and the Specify configuration settings directly button, enter your
configuration file name, and select the Next > button.
Finally, select the package properties whose path you want and the Next > button. You can now see,
copy & paste the package property paths you want and save them in your configuration file. With this
info, you can override the values of any package properties at run-time.

2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive package properties to assign
values to them on this tab.
When you assign values to your package properties, you can add dynamic content by using expressions,
functions, Data Factory system variables, and Data Factory pipeline parameters or variables.
The values assigned in configuration files and on the SSIS Parameters tab can be overridden by using the
Connection Managers or Proper ty Overrides tabs. The values assigned on the Connection Managers tab
can also be overridden by using the Proper ty Overrides tab.
To validate the pipeline configuration, select Validate on the toolbar. To close the Pipeline Validation Repor t ,
select >> .
To publish the pipeline to Data Factory, select Publish All .
Run the pipeline
In this step, you trigger a pipeline run.
1. To trigger a pipeline run, select Trigger on the toolbar, and select Trigger now .
2. In the Pipeline Run window, select Finish .
Monitor the pipeline
1. Switch to the Monitor tab on the left. You see the pipeline run and its status along with other
information, such as the Run Star t time. To refresh the view, select Refresh .

2. Select the View Activity Runs link in the Actions column. You see only one activity run because the
pipeline has only one activity. It's the Execute SSIS Package activity.

3. Run the following query against the SSISDB database in your SQL server to verify that the package
executed.

select * from catalog.executions


4. You can also get the SSISDB execution ID from the output of the pipeline activity run and use the ID to
check more comprehensive execution logs and error messages in SQL Server Management Studio.

Schedule the pipeline with a trigger


You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule, such as hourly
or daily. For an example, see Create a data factory - Data Factory UI.

Run a package with PowerShell


In this section, you use Azure PowerShell to create a Data Factory pipeline with an Execute SSIS Package activity
that runs your SSIS package.
Install the latest Azure PowerShell modules by following the step-by-step instructions in How to install and
configure Azure PowerShell.
Create a data factory with Azure -SSIS IR
You can either use an existing data factory that already has Azure-SSIS IR provisioned or create a new data
factory with Azure-SSIS IR. Follow the step-by-step instructions in the Tutorial: Deploy SSIS packages to Azure
via PowerShell.
Create a pipeline with an Execute SSIS Package activity
In this step, you create a pipeline with an Execute SSIS Package activity. The activity runs your SSIS package.
1. Create a JSON file named RunSSISPackagePipeline.json in the C:\ADF\RunSSISPackage folder with content
similar to the following example.

IMPORTANT
Replace object names, descriptions, and paths, property or parameter values, passwords, and other variable values
before you save the file.

{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [{
"name": "MySSISActivity",
"description": "My SSIS package/activity description",
"type": "ExecuteSSISPackage",
"typeProperties": {
"connectVia": {
"referenceName": "MyAzureSSISIR",
"type": "IntegrationRuntimeReference"
},
"executionCredential": {
"domain": "MyExecutionDomain",
"username": "MyExecutionUsername",
"password": {
"type": "SecureString",
"value": "MyExecutionPassword"
}
},
"runtime": "x64",
"loggingLevel": "Basic",
"packageLocation": {
"type": "SSISDB",
"packagePath": "MyFolder/MyProject/MyPackage.dtsx"
},
"environmentPath": "MyFolder/MyEnvironment",
"projectParameters": {
"project_param_1": {
"value": "123"
},
"project_param_2": {
"value": {
"value": "@pipeline().parameters.MyProjectParameter",
"type": "Expression"
}
}
},
"packageParameters": {
"package_param_1": {
"value": "345"
"value": "345"
},
"package_param_2": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyPackageParameter"
}
}
},
"projectConnectionManagers": {
"MyAdonetCM": {
"username": {
"value": "MyConnectionUsername"
},
"password": {
"value": {
"type": "SecureString",
"value": "MyConnectionPassword"
}
}
}
},
"packageConnectionManagers": {
"MyOledbCM": {
"username": {
"value": {
"value": "@pipeline().parameters.MyConnectionUsername",
"type": "Expression"
}
},
"password": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyConnectionPassword",
"secretVersion": "MyConnectionPasswordVersion"
}
}
}
},
"propertyOverrides": {
"\\Package.MaxConcurrentExecutables": {
"value": 8,
"isSensitive": false
}
}
},
"policy": {
"timeout": "0.01:00:00",
"retry": 0,
"retryIntervalInSeconds": 30
}
}]
}
}

To execute packages stored in file system/Azure Files, enter the values for your package and log location
properties as follows:
{
{
{
{
"packageLocation": {
"type": "File",
"packagePath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyPackage.dtsx",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
}
}
},
"logLocation": {
"type": "File",
"logPath": "//MyStorageAccount.file.core.windows.net/MyFileShare/MyLogFolder",
"typeProperties": {
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyAccountKey"
}
}
}
}
}
}
}
}

To execute packages within projects stored in file system/Azure Files, enter the values for your package
location properties as follows:
{
{
{
{
"packageLocation": {
"type": "File",
"packagePath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyProject.ispac:MyPackage.dtsx",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"userName": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
}
}
}
}
}
}
}

To execute embedded packages, enter the values for your package location properties as follows:

{
{
{
{
"packageLocation": {
"type": "InlinePackage",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"packageName": "MyPackage.dtsx",
"packageContent":"My compressed/uncompressed package content",
"packageLastModifiedDate": "YYYY-MM-DDTHH:MM:SSZ UTC-/+HH:MM"
}
}
}
}
}
}

To execute packages stored in package stores, enter the values for your package and configuration
location properties as follows:
{
{
{
{
"packageLocation": {
"type": "PackageStore",
"packagePath": "myPackageStore/MyFolder/MyPackage",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
},
"configurationPath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyConfiguration.dtsConfig",
"configurationAccessCredential": {
"domain": "Azure",
"userName": "MyStorageAccount",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyAccountKey"
}
}
}
}
}
}
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. To create the pipeline RunSSISPackagePipeline , run the Set-AzDataFactor yV2Pipeline cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "RunSSISPackagePipeline"
-DefinitionFile ".\RunSSISPackagePipeline.json"

Here's the sample output:

PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification], [outputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Run the pipeline


Use the Invoke-AzDataFactor yV2Pipeline cmdlet to run the pipeline. The cmdlet returns the pipeline run ID
for future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName `


-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name

Monitor the pipeline


Run the following PowerShell script to continuously check the pipeline run status until it finishes copying the
data. Copy or paste the following script in the PowerShell window, and select Enter.

while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

You can also monitor the pipeline by using the Azure portal. For step-by-step instructions, see Monitor the
pipeline.
Schedule the pipeline with a trigger
In the previous step, you ran the pipeline on demand. You can also create a schedule trigger to run the pipeline
on a schedule, such as hourly or daily.
1. Create a JSON file named MyTrigger.json in the C:\ADF\RunSSISPackage folder with the following
content:

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}]
}
}

2. In Azure PowerShell, switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactor yV2Trigger cmdlet, which creates the trigger.

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName `


-DataFactoryName $DataFactory.DataFactoryName `
-Name "MyTrigger" -DefinitionFile ".\MyTrigger.json"

4. By default, the trigger is in stopped state. Start the trigger by running the Star t-
AzDataFactor yV2Trigger cmdlet.

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName `


-DataFactoryName $DataFactory.DataFactoryName `
-Name "MyTrigger"

5. Confirm that the trigger is started by running the Get-AzDataFactor yV2Trigger cmdlet.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name "MyTrigger"

6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-TriggerName "MyTrigger" `
-TriggerRunStartedAfter "2017-12-06" `
-TriggerRunStartedBefore "2017-12-09"

Run the following query against the SSISDB database in your SQL server to verify that the package
executed.

select * from catalog.executions

Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in Azure Data Factory pipelines
Run an SSIS package with the Stored Procedure
activity in Azure Data Factory
7/2/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to run an SSIS package in an Azure Data Factory pipeline by using a Stored Procedure
activity.

Prerequisites
Azure SQL Database
The walkthrough in this article uses Azure SQL Database to host the SSIS catalog. You can also use Azure SQL
Managed Instance.

Create an Azure-SSIS integration runtime


Create an Azure-SSIS integration runtime if you don't have one by following the step-by-step instruction in the
Tutorial: Deploy SSIS packages.

Data Factory UI (Azure portal)


In this section, you use Data Factory UI to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Create a data factory
First step is to create a data factory by using the Azure portal.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only
in Microsoft Edge and Google Chrome web browsers.
2. Navigate to the Azure portal.
3. Click New on the left menu, click Data + Analytics , and click Data Factor y .
4. In the New data factor y page, enter ADFTutorialDataFactor y for the name .
The name of the Azure data factory must be globally unique . If you see the following error for the
name field, change the name of the data factory (for example, yournameADFTutorialDataFactory). See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.

5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version .
8. Select the location for the data factory. Only locations that are supported by Data Factory are shown in
the drop-down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight,
etc.) used by data factory can be in other locations.
9. Select Pin to dashboard .
10. Click Create .
11. On the dashboard, you see the following tile with status: Deploying data factor y .
12. After the creation is complete, you see the Data Factor y page as shown in the image.

13. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) application in a
separate tab.
Create a pipeline with stored procedure activity
In this step, you use the Data Factory UI to create a pipeline. You add a stored procedure activity to the pipeline
and configure it to run the SSIS package by using the sp_executesql stored procedure.
1. In the home page, click Orchestrate :

2. In the Activities toolbox, expand General , and drag-drop Stored Procedure activity to the pipeline
designer surface.
3. In the properties window for the stored procedure activity, switch to the SQL Account tab, and click +
New . You create a connection to the database in Azure SQL Database that hosts the SSIS Catalog (SSIDB
database).
4. In the New Linked Ser vice window, do the following steps:
a. Select Azure SQL Database for Type .
b. Select the Default Azure Integration Runtime to connect to the Azure SQL Database that hosts the
SSISDB database.

c. Select the Azure SQL Database that hosts the SSISDB database for the Ser ver name field.
d. Select SSISDB for Database name .
e. For User name , enter the name of user who has access to the database.
f. For Password , enter the password of the user.
g. Test the connection to the database by clicking Test connection button.
h. Save the linked service by clicking the Save button.
5. In the properties window, switch to the Stored Procedure tab from the SQL Account tab, and do the
following steps:
a. Select Edit .
b. For the Stored procedure name field, Enter sp_executesql .
c. Click + New in the Stored procedure parameters section.
d. For name of the parameter, enter stmt .
e. For type of the parameter, enter String .
f. For value of the parameter, enter the following SQL query:
In the SQL query, specify the right values for the folder_name , project_name , and
package_name parameters.

DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150) EXEC @return_value=


[SSISDB].[catalog].[create_execution] @folder_name=N'<FOLDER name in SSIS Catalog>',
@project_name=N'<PROJECT name in SSIS Catalog>', @package_name=N'<PACKAGE name>.dtsx',
@use32bitruntime=0, @runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC
[SSISDB].[catalog].[set_execution_parameter_value] @exe_id, @object_type=50,
@parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].[catalog].
[start_execution] @execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].
[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package
execution did not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20))
RAISERROR(@err_msg,15,1) END
6. To validate the pipeline configuration, click Validate on the toolbar. To close the Pipeline Validation
Repor t , click >> .

7. Publish the pipeline to Data Factory by clicking Publish All button.


Run and monitor the pipeline
In this section, you trigger a pipeline run and then monitor it.
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger now .

2. In the Pipeline Run window, select Finish .


3. Switch to the Monitor tab on the left. You see the pipeline run and its status along with other information
(such as Run Start time). To refresh the view, click Refresh .

4. Click View Activity Runs link in the Actions column. You see only one activity run as the pipeline has
only one activity (stored procedure activity).
5. You can run the following quer y against the SSISDB database in SQL Database to verify that the package
executed.

select * from catalog.executions

NOTE
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.). For an
example, see Create a data factory - Data Factory UI.

Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

In this section, you use Azure PowerShell to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
Create a data factory
You can either use the same data factory that has the Azure-SSIS IR or create a separate data factory. The
following procedure provides steps to create a data factory. You create a pipeline with a stored procedure
activity in this data factory. The stored procedure activity executes a stored procedure in the SSISDB database to
run your SSIS package.
1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes,
and then run the command. For example: "adfrg" .
$resourceGroupName = "ADFTutorialResourceGroup";

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again

2. To create the Azure resource group, run the following command:

$ResGrp = New-AzResourceGroup $resourceGroupName -location 'eastus'

If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.

3. Define a variable for the data factory name.

IMPORTANT
Update the data factory name to be globally unique.

$DataFactoryName = "ADFTutorialFactory";

4. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:

$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName -Location


$ResGrp.Location -Name $dataFactoryName

Note the following points:


The name of the Azure data factory must be globally unique. If you receive the following error, change the
name and try again.

The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.

To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
Create an Azure SQL Database linked service
Create a linked service to link your database that hosts the SSIS catalog to your data factory. Data Factory uses
information in this linked service to connect to SSISDB database, and executes a stored procedure to run an SSIS
package.
1. Create a JSON file named AzureSqlDatabaseLinkedSer vice.json in C:\ADF\RunSSISPackage folder
with the following content:
IMPORTANT
Replace <servername>, <username>, and <password> with values of your Azure SQL Database before saving
the file.

{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:
<servername>.database.windows.net,1433;Database=SSISDB;User ID=<username>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

2. In Azure PowerShell , switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service:
AzureSqlDatabaseLinkedSer vice .

Set-AzDataFactoryV2LinkedService -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName


$ResGrp.ResourceGroupName -Name "AzureSqlDatabaseLinkedService" -File
".\AzureSqlDatabaseLinkedService.json"

Create a pipeline with stored procedure activity


In this step, you create a pipeline with a stored procedure activity. The activity invokes the sp_executesql stored
procedure to run your SSIS package.
1. Create a JSON file named RunSSISPackagePipeline.json in the C:\ADF\RunSSISPackage folder with
the following content:

IMPORTANT
Replace <FOLDER NAME>, <PROJECT NAME>, <PACKAGE NAME> with names of folder, project, and package in
the SSIS catalog before saving the file.
{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [
{
"name": "My SProc Activity",
"description":"Runs an SSIS package",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "sp_executesql",
"storedProcedureParameters": {
"stmt": {
"value": "DECLARE @return_value INT, @exe_id BIGINT, @err_msg
NVARCHAR(150) EXEC @return_value=[SSISDB].[catalog].[create_execution] @folder_name=N'<FOLDER
NAME>', @project_name=N'<PROJECT NAME>', @package_name=N'<PACKAGE NAME>', @use32bitruntime=0,
@runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC [SSISDB].[catalog].
[set_execution_parameter_value] @exe_id, @object_type=50, @parameter_name=N'SYNCHRONIZED',
@parameter_value=1 EXEC [SSISDB].[catalog].[start_execution] @execution_id=@exe_id, @retry_count=0
IF(SELECT [status] FROM [SSISDB].[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET
@err_msg=N'Your package execution did not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20))
RAISERROR(@err_msg,15,1) END"
}
}
}
}
]
}
}

2. To create the pipeline: RunSSISPackagePipeline , Run the Set-AzDataFactor yV2Pipeline cmdlet.

$DFPipeLine = Set-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName -


ResourceGroupName $ResGrp.ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile
".\RunSSISPackagePipeline.json"

Here is the sample output:

PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification], [outputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}

Create a pipeline run


Use the Invoke-AzDataFactor yV2Pipeline cmdlet to run the pipeline. The cmdlet returns the pipeline run ID
for future monitoring.

$RunId = Invoke-AzDataFactoryV2Pipeline -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName


$ResGrp.ResourceGroupName -PipelineName $DFPipeLine.Name

Monitor the pipeline run


Run the following PowerShell script to continuously check the pipeline run status until it finishes copying the
data. Copy/paste the following script in the PowerShell window, and press ENTER.
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName
$DataFactory.DataFactoryName -PipelineRunId $RunId

if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}

Start-Sleep -Seconds 10
}

Create a trigger
In the previous step, you invoked the pipeline on-demand. You can also create a schedule trigger to run the
pipeline on a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following
content:

{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}
]
}
}

2. In Azure PowerShell , switch to the C:\ADF\RunSSISPackage folder.


3. Run the Set-AzDataFactor yV2Trigger cmdlet, which creates the trigger.

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName


$DataFactory.DataFactoryName -Name "MyTrigger" -DefinitionFile ".\MyTrigger.json"

4. By default, the trigger is in stopped state. Start the trigger by running the Star t-
AzDataFactor yV2Trigger cmdlet.

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResGrp.ResourceGroupName -DataFactoryName


$DataFactory.DataFactoryName -Name "MyTrigger"
5. Confirm that the trigger is started by running the Get-AzDataFactor yV2Trigger cmdlet.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName


-TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-06" -TriggerRunStartedBefore "2017-12-09"

You can run the following query against the SSISDB database in SQL Database to verify that the package
executed.

select * from catalog.executions

Next steps
You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
How to start and stop Azure-SSIS Integration
Runtime on a schedule
7/2/2021 • 14 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to schedule the starting and stopping of Azure-SSIS Integration Runtime (IR) by using
Azure Data Factory (ADF). Azure-SSIS IR is ADF compute resource dedicated for executing SQL Server
Integration Services (SSIS) packages. Running Azure-SSIS IR has a cost associated with it. Therefore, you
typically want to run your IR only when you need to execute SSIS packages in Azure and stop your IR when you
do not need it anymore. You can use ADF User Interface (UI)/app or Azure PowerShell to manually start or stop
your IR).
Alternatively, you can create Web activities in ADF pipelines to start/stop your IR on schedule, e.g. starting it in
the morning before executing your daily ETL workloads and stopping it in the afternoon after they are done. You
can also chain an Execute SSIS Package activity between two Web activities that start and stop your IR, so your
IR will start/stop on demand, just in time before/after your package execution. For more info about Execute SSIS
Package activity, see Run an SSIS package using Execute SSIS Package activity in ADF pipeline article.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Prerequisites
If you have not provisioned your Azure-SSIS IR already, provision it by following instructions in the tutorial.

Create and schedule ADF pipelines that start and or stop Azure-SSIS
IR
This section shows you how to use Web activities in ADF pipelines to start/stop your Azure-SSIS IR on schedule
or start & stop it on demand. We will guide you to create three pipelines:
1. The first pipeline contains a Web activity that starts your Azure-SSIS IR.
2. The second pipeline contains a Web activity that stops your Azure-SSIS IR.
3. The third pipeline contains an Execute SSIS Package activity chained between two Web activities that
start/stop your Azure-SSIS IR.
After you create and test those pipelines, you can create a schedule trigger and associate it with any pipeline.
The schedule trigger defines a schedule for running the associated pipeline.
For example, you can create two triggers, the first one is scheduled to run daily at 6 AM and associated with the
first pipeline, while the second one is scheduled to run daily at 6 PM and associated with the second pipeline. In
this way, you have a period between 6 AM to 6 PM every day when your IR is running, ready to execute your
daily ETL workloads.
If you create a third trigger that is scheduled to run daily at midnight and associated with the third pipeline, that
pipeline will run at midnight every day, starting your IR just before package execution, subsequently executing
your package, and immediately stopping your IR just after package execution, so your IR will not be running idly.
Create your ADF
1. Sign in to Azure portal.
2. Click New on the left menu, click Data + Analytics , and click Data Factor y .
3. In the New data factor y page, enter MyAzureSsisDataFactor y for Name .

The name of your ADF must be globally unique. If you receive the following error, change the name of
your ADF (e.g. yournameMyAzureSsisDataFactory) and try creating it again. See Data Factory - Naming
Rules article to learn about naming rules for ADF artifacts.
Data factory name MyAzureSsisDataFactory is not available

4. Select your Azure Subscription under which you want to create your ADF.
5. For Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of your new resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources article.
6. For Version , select V2 .
7. For Location , select one of the locations supported for ADF creation from the drop-down list.
8. Select Pin to dashboard .
9. Click Create .
10. On Azure dashboard, you will see the following tile with status: Deploying Data Factor y .

11. After the creation is complete, you can see your ADF page as shown below.
12. Click Author & Monitor to launch ADF UI/app in a separate tab.
Create your pipelines
1. In the home page, select Orchestrate .

2. In Activities toolbox, expand General menu, and drag & drop a Web activity onto the pipeline designer
surface. In General tab of the activity properties window, change the activity name to star tMyIR . Switch
to Settings tab, and do the following actions.
a. For URL , enter the following URL for REST API that starts Azure-SSIS IR, replacing
{subscriptionId} , {resourceGroupName} , {factoryName} , and {integrationRuntimeName} with the
actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRu
api-version=2018-06-01
Alternatively, you can also copy & paste the resource ID of your IR from its monitoring page on
ADF UI/app to replace the following part of the above URL:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeNa

b. For Method , select POST .


c. For Body , enter {"message":"Start my IR"} .
d. For Authentication , select MSI to use the managed identity for your ADF, see Managed identity
for Data Factory article for more info.
e. For Resource , enter https://management.azure.com/ .
3. Clone the first pipeline to create a second one, changing the activity name to stopMyIR and replacing the
following properties.
a. For URL , enter the following URL for REST API that stops Azure-SSIS IR, replacing
{subscriptionId} , {resourceGroupName} , {factoryName} , and {integrationRuntimeName} with the
actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRu
api-version=2018-06-01

b. For Body , enter {"message":"Stop my IR"} .


4. Create a third pipeline, drag & drop an Execute SSIS Package activity from Activities toolbox onto the
pipeline designer surface, and configure it following the instructions in Invoke an SSIS package using
Execute SSIS Package activity in ADF article. Alternatively, you can use a Stored Procedure activity
instead and configure it following the instructions in Invoke an SSIS package using Stored Procedure
activity in ADF article. Next, chain the Execute SSIS Package/Stored Procedure activity between two Web
activities that start/stop your IR, similar to those Web activities in the first/second pipelines.

5. Assign the managed identity for your ADF a Contributor role to itself, so Web activities in its pipelines
can call REST API to start/stop Azure-SSIS IRs provisioned in it. On your ADF page in Azure portal, click
Access control (IAM) , click + Add role assignment , and then on Add role assignment blade, do
the following actions.
a. For Role , select Contributor .
b. For Assign access to , select Azure AD user, group, or ser vice principal .
c. For Select , search for your ADF name and select it.
d. Click Save .

6. Validate your ADF and all pipeline settings by clicking Validate all/Validate on the factory/pipeline
toolbar. Close Factor y/Pipeline Validation Output by clicking >> button.
Test run your pipelines
1. Select Test Run on the toolbar for each pipeline and see Output window in the bottom pane.

2. To test the third pipeline, launch SQL Server Management Studio (SSMS). In Connect to Ser ver window,
do the following actions.
a. For Ser ver name , enter <your ser ver name>.database.windows.net .
b. Select Options >> .
c. For Connect to database , select SSISDB .
d. Select Connect .
e. Expand Integration Ser vices Catalogs -> SSISDB -> Your folder -> Projects -> Your SSIS project
-> Packages .
f. Right-click the specified SSIS package to run and select Repor ts -> Standard Repor ts -> All
Executions .
g. Verify that it ran.

Schedule your pipelines


Now that your pipelines work as you expected, you can create triggers to run them at specified cadences. For
details about associating triggers with pipelines, see Trigger the pipeline on a schedule article.
1. On the pipeline toolbar, select Trigger and select New/Edit .

2. In Add Triggers pane, select + New .


3. In New Trigger pane, do the following actions:
a. For Name , enter a name for the trigger. In the following example, Run daily is the trigger name.
b. For Type , select Schedule .
c. For Star t Date (UTC) , enter a start date and time in UTC.
d. For Recurrence , enter a cadence for the trigger. In the following example, it is Daily once.
e. For End , select No End or enter an end date and time after selecting On Date .
f. Select Activated to activate the trigger immediately after you publish the whole ADF settings.
g. Select Next .

4. In Trigger Run Parameters page, review any warning, and select Finish .
5. Publish the whole ADF settings by selecting Publish All in the factory toolbar.

Monitor your pipelines and triggers in Azure portal


1. To monitor trigger runs and pipeline runs, use Monitor tab on the left of ADF UI/app. For detailed steps,
see Monitor the pipeline article.

2. To view the activity runs associated with a pipeline run, select the first link (View Activity Runs ) in
Actions column. For the third pipeline, you will see three activity runs, one for each chained activity in
the pipeline (Web activity to start your IR, Stored Procedure activity to execute your package, and Web
activity to stop your IR). To view the pipeline runs again, select Pipelines link at the top.
3. To view the trigger runs, select Trigger Runs from the drop-down list under Pipeline Runs at the top.

Monitor your pipelines and triggers with PowerShell


Use scripts like the following examples to monitor your pipelines and triggers.
1. Get the status of a pipeline run.

Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResourceGroupName -DataFactoryName


$DataFactoryName -PipelineRunId $myPipelineRun

2. Get info about a trigger.

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "myTrigger"

3. Get the status of a trigger run.

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName


-TriggerName "myTrigger" -TriggerRunStartedAfter "2018-07-15" -TriggerRunStartedBefore "2018-07-16"

Create and schedule Azure Automation runbook that starts/stops


Azure-SSIS IR
In this section, you will learn to create Azure Automation runbook that executes PowerShell script,
starting/stopping your Azure-SSIS IR on a schedule. This is useful when you want to execute additional scripts
before/after starting/stopping your IR for pre/post-processing.
Create your Azure Automation account
If you do not have an Azure Automation account already, create one by following the instructions in this step.
For detailed steps, see Create an Azure Automation account article. As part of this step, you create an Azure
Run As account (a service principal in your Azure Active Directory) and assign it a Contributor role in your
Azure subscription. Ensure that it is the same subscription that contains your ADF with Azure SSIS IR. Azure
Automation will use this account to authenticate to Azure Resource Manager and operate on your resources.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, ADF UI/app is only supported in
Microsoft Edge and Google Chrome web browsers.
2. Sign in to Azure portal.
3. Select New on the left menu, select Monitoring + Management , and select Automation .
4. In Add Automation Account pane, do the following actions.
a. For Name , enter a name for your Azure Automation account.
b. For Subscription , select the subscription that has your ADF with Azure-SSIS IR.
c. For Resource group , select Create new to create a new resource group or Use existing to select
an existing one.
d. For Location , select a location for your Azure Automation account.
e. Confirm Create Azure Run As account as Yes . A service principal will be created in your Azure
Active Directory and assigned a Contributor role in your Azure subscription.
f. Select Pin to dashboard to display it permanently in Azure dashboard.
g. Select Create .

5. You will see the deployment status of your Azure Automation account in Azure dashboard and
notifications.
6. You will see the homepage of your Azure Automation account after it is created successfully.

Import ADF modules


1. Select Modules in SHARED RESOURCES section on the left menu and verify whether you have
Az.DataFactor y + Az.Profile in the list of modules.

2. If you do not have Az.DataFactor y , go to the PowerShell Gallery for Az.DataFactory module, select
Deploy to Azure Automation , select your Azure Automation account, and then select OK . Go back to
view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of
Az.DataFactor y module changed to Available .

3. If you do not have Az.Profile , go to the PowerShell Gallery for Az.Profile module, select Deploy to
Azure Automation , select your Azure Automation account, and then select OK . Go back to view
Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of the
Az.Profile module changed to Available .
Create your PowerShell runbook
The following section provides steps for creating a PowerShell runbook. The script associated with your runbook
either starts/stops Azure-SSIS IR based on the command you specify for OPERATION parameter. This section
does not provide the complete details for creating a runbook. For more information, see Create a runbook
article.
1. Switch to Runbooks tab and select + Add a runbook from the toolbar.

2. Select Create a new runbook and do the following actions:


a. For Name , enter Star tStopAzureSsisRuntime .
b. For Runbook type , select PowerShell .
c. Select Create .

3. Copy & paste the following PowerShell script to your runbook script window. Save and then publish your
runbook by using Save and Publish buttons on the toolbar.
Param
(
[Parameter (Mandatory= $true)]
[String] $ResourceGroupName,

[Parameter (Mandatory= $true)]


[String] $DataFactoryName,

[Parameter (Mandatory= $true)]


[String] $AzureSSISName,

[Parameter (Mandatory= $true)]


[String] $Operation
)

$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection "
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName

"Logging in to Azure..."
Connect-AzAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}

if($Operation -eq "START" -or $operation -eq "start")


{
"##### Starting #####"
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Name $AzureSSISName -Force
}
elseif($Operation -eq "STOP" -or $operation -eq "stop")
{
"##### Stopping #####"
Stop-AzDataFactoryV2IntegrationRuntime -DataFactoryName $DataFactoryName -Name $AzureSSISName -
ResourceGroupName $ResourceGroupName -Force
}
"##### Completed #####"

4. Test your runbook by selecting Star t button on the toolbar.

5. In Star t Runbook pane, do the following actions:


a. For RESOURCE GROUP NAME , enter the name of resource group that has your ADF with Azure-
SSIS IR.
b. For DATA FACTORY NAME , enter the name of your ADF with Azure-SSIS IR.
c. For AZURESSISNAME , enter the name of Azure-SSIS IR.
d. For OPERATION , enter START .
e. Select OK .

6. In the job window, select Output tile. In the output window, wait for the message ##### Completed
##### after you see ##### Star ting ##### . Starting Azure-SSIS IR takes approximately 20 minutes.
Close Job window and get back to Runbook window.

7. Repeat the previous two steps using STOP as the value for OPERATION . Start your runbook again by
selecting Star t button on the toolbar. Enter your resource group, ADF, and Azure-SSIS IR names. For
OPERATION , enter STOP . In the output window, wait for the message ##### Completed ##### after
you see ##### Stopping ##### . Stopping Azure-SSIS IR does not take as long as starting it. Close Job
window and get back to Runbook window.
8. You can also trigger your runbook via a webhook that can be created by selecting the Webhooks menu
item or on a schedule that can be created by selecting the Schedules menu item as specified below.

Create schedules for your runbook to start/stop Azure-SSIS IR


In the previous section, you have created your Azure Automation runbook that can either start or stop Azure-
SSIS IR. In this section, you will create two schedules for your runbook. When configuring the first schedule, you
specify START for OPERATION . Similarly, when configuring the second one, you specify STOP for
OPERATION . For detailed steps to create schedules, see Create a schedule article.
1. In Runbook window, select Schedules , and select + Add a schedule on the toolbar.
2. In Schedule Runbook pane, do the following actions:
a. Select Link a schedule to your runbook .
b. Select Create a new schedule .
c. In New Schedule pane, enter Star t IR daily for Name .
d. For Star ts , enter a time that is a few minutes past the current time.
e. For Recurrence , select Recurring .
f. For Recur ever y , enter 1 and select Day .
g. Select Create .

3. Switch to Parameters and run settings tab. Specify your resource group, ADF, and Azure-SSIS IR
names. For OPERATION , enter START and select OK . Select OK again to see the schedule on Schedules
page of your runbook.

4. Repeat the previous two steps to create a schedule named Stop IR daily . Enter a time that is at least 30
minutes after the time you specified for Star t IR daily schedule. For OPERATION , enter STOP and
select OK . Select OK again to see the schedule on Schedules page of your runbook.
5. In Runbook window, select Jobs on the left menu. You should see the jobs created by your schedules at
the specified times and their statuses. You can see the job details, such as its output, similar to what you
have seen after you tested your runbook.
6. After you are done testing, disable your schedules by editing them. Select Schedules on the left menu,
select Star t IR daily/Stop IR daily , and select No for Enabled .

Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication
Join an Azure-SSIS integration runtime to a virtual
network
7/16/2021 • 31 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When using SQL Server Integration Services (SSIS) in Azure Data Factory, you should join your Azure-SSIS
integration runtime (IR) to an Azure virtual network in the following scenarios:
You want to connect to on-premises data stores from SSIS packages that run on your Azure-SSIS IR
without configuring or managing a self-hosted IR as proxy.
You want to host SSIS catalog database (SSISDB) in Azure SQL Database with IP firewall rules/virtual
network service endpoints or in SQL Managed Instance with private endpoint.
You want to connect to Azure resources configured with virtual network service endpoints from SSIS
packages that run on your Azure-SSIS IR.
You want to connect to data stores/resources configured with IP firewall rules from SSIS packages that
run on your Azure-SSIS IR.
Data Factory lets you join your Azure-SSIS IR to a virtual network created through the classic deployment model
or the Azure Resource Manager deployment model.

IMPORTANT
The classic virtual network is being deprecated, so use the Azure Resource Manager virtual network instead. If you already
use the classic virtual network, switch to the Azure Resource Manager virtual network as soon as possible.

The configuring an Azure-SQL Server Integration Services (SSIS) integration runtime (IR) to join a virtual
network tutorial shows the minimum steps via Azure portal. This article expands on the tutorial and describes all
the optional tasks:
If you are using virtual network (classic).
If you bring your own public IP addresses for the Azure-SSIS IR.
If you use your own Domain Name System (DNS) server.
If you use a network security group (NSG) on the subnet.
If you use Azure ExpressRoute or a user-defined route (UDR).
If you use customized Azure-SSIS IR.
If you use Azure Powershell provisioning.

Access to on-premises data stores


If your SSIS packages access on-premises data stores, you can join your Azure-SSIS IR to a virtual network that
is connected to the on-premises network. Or you can configure and manage a self-hosted IR as proxy for your
Azure-SSIS IR. For more information, see Configure a self-hosted IR as a proxy for an Azure-SSIS IR.
When joining your Azure-SSIS IR to a virtual network, remember these important points:
If no virtual network is connected to your on-premises network, first create an Azure Resource Manager
virtual network for your Azure-SSIS IR to join. Then configure a site-to-site VPN gateway connection or
ExpressRoute connection from that virtual network to your on-premises network.
If an Azure Resource Manager virtual network is already connected to your on-premises network in the
same location as your Azure-SSIS IR, you can join the IR to that virtual network.
If a classic virtual network is already connected to your on-premises network in a different location from
your Azure-SSIS IR, you can create an Azure Resource Manager virtual network for your Azure-SSIS IR to
join. Then configure a classic-to-Azure Resource Manager virtual network connection.
If an Azure Resource Manager virtual network is already connected to your on-premises network in a
different location from your Azure-SSIS IR, you can first create an Azure Resource Manager virtual
network for your Azure-SSIS IR to join. Then configure an Azure Resource Manager-to-Azure Resource
Manager virtual network connection.

Hosting the SSIS catalog in SQL Database


If you host your SSIS catalog in an Azure SQL Database with virtual network service endpoints, make sure that
you join your Azure-SSIS IR to the same virtual network and subnet.
If you host your SSIS catalog in SQL Managed Instance with private endpoint, make sure that you join your
Azure-SSIS IR to the same virtual network, but in a different subnet than the managed instance. To join your
Azure-SSIS IR to a different virtual network than the SQL Managed Instance, we recommend either virtual
network peering (which is limited to the same region) or a connection from virtual network to virtual network.
For more information, see Connect your application to Azure SQL Managed Instance.

Access to Azure services


If your SSIS packages access Azure resources that support virtual network service endpoints and you want to
secure access to those resources from Azure-SSIS IR, you can join your Azure-SSIS IR to a virtual network
subnet configured for virtual network service endpoints and then add a virtual network rule to the relevant
Azure resources to allow access from the same subnet.

Access to data sources protected by IP firewall rule


If your SSIS packages access data stores/resources that allow only specific static public IP addresses and you
want to secure access to those resources from Azure-SSIS IR, you can associate public IP addresses with Azure-
SSIS IR while joining it to a virtual network and then add an IP firewall rule to the relevant resources to allow
access from those IP addresses. There are two alternative ways to do this:
When you create Azure-SSIS IR, you can bring your own public IP addresses and specify them via Data
Factory UI or SDK. Only the outbound internet connectivity of Azure-SSIS IR will use your provided public IP
addresses and other devices in the subnet will not use them.
You can also setup Virtual Network NAT for the subnet that Azure-SSIS IR will join and all outbound
connectivity in this subnet will use your specified public IP addresses.
In all cases, the virtual network can be deployed only through the Azure Resource Manager deployment model.
The following sections provide more details.

Virtual network configuration


Set up your virtual network to meet these requirements:
Make sure that Microsoft.Batch is a registered provider under the subscription of your virtual network
subnet that hosts the Azure-SSIS IR. If you use a classic virtual network, also join MicrosoftAzureBatch to
the Classic Virtual Machine Contributor role for that virtual network.
Make sure you have the required permissions. For more information, see Set up permissions.
Select the proper subnet to host the Azure-SSIS IR. For more information, see Select the subnet.
If you bring your own public IP addresses for the Azure-SSIS IR, see Select the static public IP addresses
If you use your own Domain Name System (DNS) server on the virtual network, see Set up the DNS
server.
If you use a network security group (NSG) on the subnet, see Set up an NSG.
If you use Azure ExpressRoute or a user-defined route (UDR), see Use Azure ExpressRoute or a UDR.
Make sure the virtual network's resource group (or the public IP addresses' resource group if you bring
your own public IP addresses) can create and delete certain Azure network resources. For more
information, see Set up the resource group.
If you customize your Azure-SSIS IR as described in Custom setup for Azure-SSIS IR, our internal process
to manage its nodes will consume private IP addresses from a predefined range of 172.16.0.0 to
172.31.255.255. Consequently, please make sure that the private IP address ranges of your virtual or on-
premises networks don't collide with this range.
This diagram shows the required connections for your Azure-SSIS IR:

Set up permissions
The user who creates the Azure-SSIS IR must have the following permissions:
If you're joining your SSIS IR to an Azure Resource Manager virtual network, you have two options:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/*
permission, which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission. If you also want to bring your own
public IP addresses for Azure-SSIS IR while joining it to an Azure Resource Manager virtual
network, please also include Microsoft.Network/publicIPAddresses/*/join/action permission in the
role.
If you're joining your SSIS IR to a classic virtual network, we recommend that you use the built-in Classic
Virtual Machine Contributor role. Otherwise you have to define a custom role that includes the
permission to join the virtual network.
Select the subnet
As you choose a subnet:
Don't select the GatewaySubnet to deploy an Azure-SSIS IR. It's dedicated for virtual network gateways.
Ensure that the subnet you select has enough available address space for the Azure-SSIS IR to use. Leave
available IP addresses for at least two times the IR node number. Azure reserves some IP addresses
within each subnet. These addresses can't be used. The first and last IP addresses of the subnets are
reserved for protocol conformance, and three more addresses are used for Azure services. For more
information, see Are there any restrictions on using IP addresses within these subnets?
Don’t use a subnet that is exclusively occupied by other Azure services (for example, SQL Database SQL
Managed Instance, App Service, and so on).
Select the static public IP addresses
If you want to bring your own static public IP addresses for Azure-SSIS IR while joining it to a virtual network,
make sure they meet the following requirements:
Exactly two unused ones that are not already associated with other Azure resources should be provided.
The extra one will be used when we periodically upgrade your Azure-SSIS IR. Note that one public IP
address cannot be shared among your active Azure-SSIS IRs.
They should both be static ones of standard type. Refer to SKUs of Public IP Address for more details.
They should both have a DNS name. If you have not provided a DNS name when creating them, you can
do so on Azure portal.
They and the virtual network should be under the same subscription and in the same region.
Set up the DNS server
If you need to use your own DNS server in a virtual network joined by your Azure-SSIS IR to resolve your
private host name, make sure it can also resolve global Azure host names (for example, an Azure Storage blob
named <your storage account>.blob.core.windows.net ).
One recommended approach is below:
Configure the custom DNS to forward requests to Azure DNS. You can forward unresolved DNS records to
the IP address of the Azure recursive resolvers (168.63.129.16) on your own DNS server.
For more information, see Name resolution that uses your own DNS server.

NOTE
Please use a Fully Qualified Domain Name (FQDN) for your private host name (for example, use
<your_private_server>.contoso.com instead of <your_private_server> ). Alternatively, you can use a standard
custom setup on your Azure-SSIS IR to automatically append your own DNS suffix (for example contoso.com ) to any
unqualified single label domain name and turn it into an FQDN before using it in DNS queries, see standard custom setup
samples.

Set up an NSG
If you need to implement an NSG for the subnet used by your Azure-SSIS IR, allow inbound and outbound
traffic through the following ports:
Inbound requirement of Azure-SSIS IR

T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

Inbound TCP BatchNodeM * VirtualNetwor 29876, 29877 The Data


anagement k (if you join Factory
the IR to a service uses
Resource these ports to
Manager communicate
virtual with the
network) nodes of your
Azure-SSIS IR
10100, in the virtual
20100, 30100 network.
(if you join
the IR to a Whether or
classic virtual not you
network) create a
subnet-level
NSG, Data
Factory
always
configures an
NSG at the
level of the
network
interface
cards (NICs)
attached to
the virtual
machines that
host the
Azure-SSIS IR.
Only inbound
traffic from
Data Factory
IP addresses
on the
specified
ports is
allowed by
that NIC-level
NSG. Even if
you open
these ports to
internet traffic
at the subnet
level, traffic
from IP
addresses
that aren't
Data Factory
IP addresses
is blocked at
the NIC level.
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

Inbound TCP CorpNetSaw * VirtualNetwor 3389 (Optional)


k This rule is
only required
when
Microsoft
supporter ask
customer to
open for
advanced
troubleshooti
ng, and can
be closed
right after
troubleshooti
ng.
CorpNetSaw
service tag
permits only
secure access
workstations
on the
Microsoft
corporate
network to
use remote
desktop. And
this service
tag can't be
selected from
portal and is
only available
via Azure
PowerShell or
Azure CLI.

At NIC level
NSG, port
3389 is open
by default
and we allow
you to control
port 3389 at
subnet level
NSG,
meanwhile
Azure-SSIS IR
has
disallowed
port 3389
outbound by
default at
windows
firewall rule
on each IR
node for
protection.

Outbound requirement of Azure-SSIS IR


T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

Outbound TCP VirtualNetwor * AzureCloud 443 The nodes of


k your Azure-
SSIS IR in the
virtual
network use
this port to
access Azure
services, such
as Azure
Storage and
Azure Event
Hubs.

Outbound TCP VirtualNetwor * Internet 80 (Optional) The


k nodes of your
Azure-SSIS IR
in the virtual
network use
this port to
download a
certificate
revocation list
from the
internet. If
you block this
traffic, you
might
experience
performance
downgrade
when start IR
and lose
capability to
check
certificate
revocation list
for certificate
usage. If you
want to
further
narrow down
destination to
certain
FQDNs,
please refer to
Use Azure
ExpressRout
e or UDR
section
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

Outbound TCP VirtualNetwor * Sql 1433, 11000- (Optional)


k 11999 This rule is
only required
when the
nodes of your
Azure-SSIS IR
in the virtual
network
access an
SSISDB
hosted by
your server. If
your server
connection
policy is set to
Proxy instead
of Redirect ,
only port
1433 is
needed.

This
outbound
security rule
isn't
applicable to
an SSISDB
hosted by
your SQL
Managed
Instance in
the virtual
network or
SQL Database
configured
with private
endpoint.
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

Outbound TCP VirtualNetwor * VirtualNetwor 1433, 11000- (Optional)


k k 11999 This rule is
only required
when the
nodes of your
Azure-SSIS IR
in the virtual
network
access an
SSISDB
hosted by
your SQL
Managed
Instance in
the virtual
network or
SQL Database
configured
with private
endpoint. If
your server
connection
policy is set to
Proxy instead
of Redirect ,
only port
1433 is
needed.

Outbound TCP VirtualNetwor * Storage 445 (Optional)


k This rule is
only required
when you
want to
execute SSIS
package
stored in
Azure Files.

Use Azure ExpressRoute or UDR


If you want to inspect outbound traffic from Azure-SSIS IR, you can route traffic initiated from Azure-SSIS IR to
on-premises firewall appliance via Azure ExpressRoute force tunneling (advertising a BGP route, 0.0.0.0/0, to the
virtual network) or to Network Virtual Appliance (NVA) as a firewall or Azure Firewall via UDRs.
You need to do below things to make whole scenario working
Inbound traffic between Azure Batch management services and the Azure-SSIS IR can't be routed via firewall
appliance.
The firewall appliance shall allow outbound traffic required by Azure-SSIS IR.
Inbound traffic between Azure Batch management services and the Azure-SSIS IR can't be routed to firewall
appliance otherwise the traffic will be broken due to asymmetric routing problem. Routes must be defined for
inbound traffic so the traffic can reply back the same way it came in. You can define specific UDRs to route traffic
between Azure Batch management services and the Azure-SSIS IR with next hop type as Internet .
For example, if your Azure-SSIS IR is located at UK South and you want to inspect outbound traffic through
Azure Firewall, you would firstly get an IP range list of service tag BatchNodeManagement.UKSouth from the service
tags IP range download link or through the Service Tag Discovery API. Then apply the following UDRs of related
IP range routes with the next hop type as Internet along with the 0.0.0.0/0 route with the next hop type as
Vir tual appliance .

NOTE
This approach incurs an additional maintenance cost. Regularly check the IP range and add new IP ranges into your UDR
to avoid breaking the Azure-SSIS IR. We recommend checking the IP range monthly because when the new IP appears in
the service tag, the IP will take another month go into effect.
To make the setup of UDR rules easier, you can run following Powershell script to add UDR rules for Azure Batch
management services:

$Location = "[location of your Azure-SSIS IR]"


$RouteTableResourceGroupName = "[name of Azure resource group that contains your Route Table]"
$RouteTableResourceName = "[resource name of your Azure Route Table ]"
$RouteTable = Get-AzRouteTable -ResourceGroupName $RouteTableResourceGroupName -Name $RouteTableResourceName
$ServiceTags = Get-AzNetworkServiceTag -Location $Location
$BatchServiceTagName = "BatchNodeManagement." + $Location
$UdrRulePrefixForBatch = $BatchServiceTagName
if ($ServiceTags -ne $null)
{
$BatchIPRanges = $ServiceTags.Values | Where-Object { $_.Name -ieq $BatchServiceTagName }
if ($BatchIPRanges -ne $null)
{
Write-Host "Start to add rule for your route table..."
for ($i = 0; $i -lt $BatchIPRanges.Properties.AddressPrefixes.Count; $i++)
{
$UdrRuleName = "$($UdrRulePrefixForBatch)_$($i)"
Add-AzRouteConfig -Name $UdrRuleName `
-AddressPrefix $BatchIPRanges.Properties.AddressPrefixes[$i] `
-NextHopType "Internet" `
-RouteTable $RouteTable `
| Out-Null
Write-Host "Add rule $UdrRuleName to your route table..."
}
Set-AzRouteTable -RouteTable $RouteTable
}
}
else
{
Write-Host "Failed to fetch service tags, please confirm that your Location is valid."
}

For firewall appliance to allow outbound traffic, you need to allow outbound to below ports same as
requirement in NSG outbound rules.
Port 443 with destination as Azure Cloud services.
If you use Azure Firewall, you can specify network rule with AzureCloud Service Tag. For firewall of the
other types, you can either simply allow destination as all for port 443 or allow below FQDNs based on
the type of your Azure environment:

A Z URE EN VIRO N M EN T EN DP O IN T S

Azure Public Azure Data Factor y (Management)


*.frontend.clouddatahub.net
Azure Storage (Management)
*.blob.core.windows.net
*.table.core.windows.net
Azure Container Registr y (Custom Setup)
*.azurecr.io
Event Hub (Logging)
*.servicebus.windows.net
Microsoft Logging ser vice (Internal Use)
gcs.prod.monitoring.core.windows.net
prod.warmpath.msftcloudes.com
azurewatsonanalysis-
prod.core.windows.net
A Z URE EN VIRO N M EN T EN DP O IN T S

Azure Government Azure Data Factor y (Management)


*.frontend.datamovement.azure.us
Azure Storage (Management)
*.blob.core.usgovcloudapi.net
*.table.core.usgovcloudapi.net
Azure Container Registr y (Custom Setup)
*.azurecr.us
Event Hub (Logging)
*.servicebus.usgovcloudapi.net
Microsoft Logging ser vice (Internal Use)
fairfax.warmpath.usgovcloudapi.net
azurewatsonanalysis.usgovcloudapp.net

Azure China 21Vianet Azure Data Factor y (Management)


*.frontend.datamovement.azure.cn
Azure Storage (Management)
*.blob.core.chinacloudapi.cn
*.table.core.chinacloudapi.cn
Azure Container Registr y (Custom Setup)
*.azurecr.cn
Event Hub (Logging)
*.servicebus.chinacloudapi.cn
Microsoft Logging ser vice (Internal Use)
mooncake.warmpath.chinacloudapi.cn
azurewatsonanalysis.chinacloudapp.cn

As for the FQDNs of Azure Storage, Azure Container Registry and Event Hub, you can also choose to
enable the following service endpoints for your virtual network so that network traffic to these endpoints
goes through Azure backbone network instead of being routed to your firewall appliance:
Microsoft.Storage
Microsoft.ContainerRegistry
Microsoft.EventHub
Port 80 with destination as CRL download sites.
You shall allow below FQDNs which are used as CRL (Certificate Revocation List) download sites of
certificates for Azure-SSIS IR management purpose:
crl.microsoft.com:80
mscrl.microsoft.com:80
crl3.digicert.com:80
crl4.digicert.com:80
ocsp.digicert.com:80
cacerts.digicert.com:80
If you are using certificates having different CRL, you are suggested to include them as well. You can read
this to understand more on Certificate Revocation List.
If you disallow this traffic, you might experience performance downgrade when start Azure-SSIS IR and
lose capability to check certificate revocation list for certificate usage which is not recommended from
security point of view.
Port 1433, 11000-11999 with destination as Azure SQL Database (only required when the nodes of your
Azure-SSIS IR in the virtual network access an SSISDB hosted by your server).
If you use Azure Firewall, you can specify network rule with Azure SQL Service Tag, otherwise you might
allow destination as specific azure sql url in firewall appliance.
Port 445 with destination as Azure Storage (only required when you execute SSIS package stored in
Azure Files).
If you use Azure Firewall, you can specify network rule with Storage Service Tag, otherwise you might
allow destination as specific azure file storage url in firewall appliance.

NOTE
For Azure SQL and Storage, if you configure Virtual Network service endpoints on your subnet, then traffic between
Azure-SSIS IR and Azure SQL in same region \ Azure Storage in same region or paired region will be routed to Microsoft
Azure backbone network directly instead of your firewall appliance.

If you don't need capability of inspecting outbound traffic of Azure-SSIS IR, you can simply apply route to force
all traffic to next hop type Internet :
In an Azure ExpressRoute scenario, you can apply a 0.0.0.0/0 route with the next hop type as Internet on the
subnet that hosts the Azure-SSIS IR.
In a NVA scenario, you can modify the existing 0.0.0.0/0 route applied on the subnet that hosts the Azure-
SSIS IR from the next hop type as Vir tual appliance to Internet .

NOTE
Specify route with next hop type Internet doesn't mean all traffic will go over Internet. As long as destination address is
for one of Azure's services, Azure routes the traffic directly to the service over Azure's backbone network, rather than
routing the traffic to the Internet.

Set up the resource group


The Azure-SSIS IR needs to create certain network resources under the same resource group as the virtual
network. These resources include:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer.
An Azure public IP address, with the name <Guid>-azurebatch-cloudservicepublicip.
A network work security group, with the name <Guid>-azurebatch-cloudservicenetworksecuritygroup.

NOTE
You can now bring your own static public IP addresses for Azure-SSIS IR. In this scenario, we will create only the Azure
load balancer and network security group under the same resource group as your static public IP addresses instead of the
virtual network.

Those resources will be created when your Azure-SSIS IR starts. They'll be deleted when your Azure-SSIS IR
stops. If you bring your own static public IP addresses for Azure-SSIS IR, your own static public IP addresses
won't be deleted when your Azure-SSIS IR stops. To avoid blocking your Azure-SSIS IR from stopping, don't
reuse these network resources in your other resources.
Make sure that you have no resource lock on the resource group/subscription to which the virtual network/your
static public IP addresses belong. If you configure a read-only/delete lock, starting and stopping your Azure-
SSIS IR will fail, or it will stop responding.
Make sure that you don't have an Azure Policy assignment that prevents the following resources from being
created under the resource group/subscription to which the virtual network/your static public IP addresses
belong:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Microsoft.Network/PublicIPAddresses
Make sure that the resource quota of your subscription is enough for the above three network resources.
Specifically, for each Azure-SSIS IR created in virtual network, you need to reserve two free quotas for each of
the above three network resources. The extra one quota will be used when we periodically upgrade your Azure-
SSIS IR.
FAQ
How can I protect the public IP address exposed on my Azure-SSIS IR for inbound connection? Is it
possible to remove the public IP address?
Right now, a public IP address will be automatically created when your Azure-SSIS IR joins a virtual
network. We do have an NIC-level NSG to allow only Azure Batch management services to inbound-
connect to your Azure-SSIS IR. You can also specify a subnet-level NSG for inbound protection.
If you don't want any public IP address to be exposed, consider configuring a self-hosted IR as proxy for
your Azure-SSIS IR instead of joining your Azure-SSIS IR to a virtual network, if this applies to your
scenario.
Can I add the public IP address of my Azure-SSIS IR to the firewall's allow list for my data sources?
You can now bring your own static public IP addresses for Azure-SSIS IR. In this case, you can add your IP
addresses to the firewall's allow list for your data sources. You can also consider other options below to
secure data access from your Azure-SSIS IR depending on your scenario:
If your data source is on premises, after connecting a virtual network to your on-premises network
and joining your Azure-SSIS IR to the virtual network subnet, you can then add the private IP address
range of that subnet to the firewall's allow list for your data source.
If your data source is an Azure service that supports virtual network service endpoints, you can
configure a virtual network service endpoint on your virtual network subnet and join your Azure-SSIS
IR to that subnet. You can then add a virtual network rule with that subnet to the firewall for your data
source.
If your data source is a non-Azure cloud service, you can use a UDR to route outbound traffic from
your Azure-SSIS IR to an NVA/Azure Firewall via a static public IP address. You can then add the static
public IP address of your NVA/Azure Firewall to the firewall's allow list for your data source.
If none of the above options meets your needs, consider configuring a self-hosted IR as proxy for your
Azure-SSIS IR. You can then add the static public IP address of the machine that hosts your self-hosted
IR to the firewall's allow list for your data source.
Why do I need to provide two static public addresses if I want to bring my own for Azure-SSIS IR?
Azure-SSIS IR is automatically updated on a regular basis. New nodes are created during upgrade and
old ones will be deleted. However, to avoid downtime, the old nodes will not be deleted until the new
ones are ready. Thus, your first static public IP address used by the old nodes cannot be released
immediately and we need your second static public IP address to create the new nodes.
I have brought my own static public IP addresses for Azure-SSIS IR, but why it still cannot access my data
sources?
Confirm that the two static public IP addresses are both added to the firewall's allow list for your data
sources. Each time your Azure-SSIS IR is upgraded, its static public IP address is switched between the
two brought by you. If you add only one of them to the allow list, data access for your Azure-SSIS IR
will be broken after its upgrade.
If your data source is an Azure service, please check whether you have configured it with virtual
network service endpoints. If that's the case, the traffic from Azure-SSIS IR to your data source will
switch to use the private IP addresses managed by Azure services and adding your own static public
IP addresses to the firewall's allow list for your data source will not take effect.

Azure portal (Data Factory UI)


This section shows you how to join an existing Azure-SSIS IR to a virtual network (classic or Azure Resource
Manager) by using the Azure portal and Data Factory UI.
Before joining your Azure-SSIS IR to the virtual network, you need to properly configure the virtual network.
Follow the steps in the section that applies to your type of virtual network (classic or Azure Resource Manager).
Then follow the steps in the third section to join your Azure-SSIS IR to the virtual network.
Configure an Azure Resource Manager virtual network
Use the portal to configure an Azure Resource Manager virtual network before you try to join an Azure-SSIS IR
to it.
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. Sign in to the Azure portal.
3. Select More ser vices . Filter for and select Vir tual networks .
4. Filter for and select your virtual network in the list.
5. On the Vir tual network page, select Proper ties .
6. Select the copy button for RESOURCE ID to copy the resource ID for the virtual network to the
clipboard. Save the ID from the clipboard in OneNote or a file.
7. On the left menu, select Subnets . Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS IR.
8. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory portal,
the Azure Batch provider is automatically registered for you.)
a. In the Azure portal, on the left menu, select Subscriptions .
b. Select your subscription.
c. On the left, select Resource providers , and confirm that Microsoft.Batch is a registered
provider.
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Configure a classic virtual network
Use the portal to configure a classic virtual network before you try to join an Azure-SSIS IR to it.
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. Sign in to the Azure portal.
3. Select More ser vices . Filter for and select Vir tual networks (classic) .
4. Filter for and select your virtual network in the list.
5. On the Vir tual network (classic) page, select Proper ties .

6. Select the copy button for RESOURCE ID to copy the resource ID for the classic network to the
clipboard. Save the ID from the clipboard in OneNote or a file.
7. On the left menu, select Subnets . Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS IR.
8. Join MicrosoftAzureBatch to the Classic Vir tual Machine Contributor role for the virtual network.
a. On the left menu, select Access control (IAM) , and select the Role assignments tab.

b. Select Add role assignment .


c. On the Add role assignment page, for Role , select Classic Vir tual Machine Contributor . In
the Select box, paste ddbf3205-c6bd-46ae-8127-60eb93363864 , and then select Microsoft
Azure Batch from the list of search results.

d. Select Save to save the settings and close the page.


e. Confirm that you see Microsoft Azure Batch in the list of contributors.

9. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory portal,
the Azure Batch provider is automatically registered for you.)
a. In the Azure portal, on the left menu, select Subscriptions .
b. Select your subscription.
c. On the left, select Resource providers , and confirm that Microsoft.Batch is a registered
provider.
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Join the Azure -SSIS IR to a virtual network
After you've configured your Azure Resource Manager virtual network or classic virtual network, you can join
the Azure-SSIS IR to the virtual network:
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. In the Azure portal, on the left menu, select Data factories . If you don't see Data factories on the
menu, select More ser vices , and then in the INTELLIGENCE + ANALYTICS section, select Data
factories .

3. Select your data factory with the Azure-SSIS IR in the list. You see the home page for your data factory.
Select the Author & Monitor tile. You see the Data Factory UI on a separate tab.
4. In the Data Factory UI, switch to the Edit tab, select Connections , and switch to the Integration
Runtimes tab.

5. If your Azure-SSIS IR is running, in the Integration Runtimes list, in the Actions column, select the
Stop button for your Azure-SSIS IR. You can't edit your Azure-SSIS IR until you stop it.

6. In the Integration Runtimes list, in the Actions column, select the Edit button for your Azure-SSIS IR.

7. On the integration runtime setup panel, advance through the General Settings and SQL Settings
sections by selecting the Next button.
8. On the Advanced Settings section:
a. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to
create cer tain network resources, and optionally bring your own static public IP
addresses check box.
b. For Subscription , select the Azure subscription that has your virtual network.
c. For Location , the same location of your integration runtime is selected.
d. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
e. For VNet Name , select the name of your virtual network. It should be the same one used for SQL
Database with virtual network service endpoints or SQL Managed Instance with private endpoint
to host SSISDB. Or it should be the same one connected to your on-premises network. Otherwise,
it can be any virtual network to bring your own static public IP addresses for Azure-SSIS IR.
f. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for SQL Database with virtual network service endpoints to host SSISDB. Or it should be a
different subnet from the one used for SQL Managed Instance with private endpoint to host
SSISDB. Otherwise, it can be any subnet to bring your own static public IP addresses for Azure-
SSIS IR.
g. Select the Bring static public IP addresses for your Azure-SSIS Integration Runtime
check box to choose whether you want to bring your own static public IP addresses for Azure-SSIS
IR, so you can allow them on the firewall for your data sources.
If you select the check box, complete the following steps.
a. For First static public IP address , select the first static public IP address that meets the
requirements for your Azure-SSIS IR. If you don't have any, click Create new link to create
static public IP addresses on Azure portal and then click the refresh button here, so you can
select them.
b. For Second static public IP address , select the second static public IP address that meets
the requirements for your Azure-SSIS IR. If you don't have any, click Create new link to
create static public IP addresses on Azure portal and then click the refresh button here, so
you can select them.
h. Select VNet Validation . If the validation is successful, select Continue .
9. On the Summar y section, review all settings for your Azure-SSIS IR. Then select Update .
10. Start your Azure-SSIS IR by selecting the Star t button in the Actions column for your Azure-SSIS IR. It
takes about 20 to 30 minutes to start the Azure-SSIS IR that joins a virtual network.

Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Define the variables


$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
$AzureSSISName = "[your Azure-SSIS IR name]"
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use SQL Database with IP
firewall rules/virtual network service endpoints or SQL Managed Instance with private endpoint to host
SSISDB, or if you require access to on-premises data without configuring a self-hosted IR. We recommend an
Azure Resource Manager virtual network, because classic virtual networks will be deprecated soon.
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Use the same subnet as the one used for SQL
Database with virtual network service endpoints, or a different subnet from the one used for SQL Managed
Instance with a private endpoint
# Public IP address info: OPTIONAL to provide two standard static public IP addresses with DNS name under
the same subscription and in the same region as your virtual network
$FirstPublicIP = "[your first public IP address resource ID or leave it empty]"
$SecondPublicIP = "[your second public IP address resource ID or leave it empty]"

Configure a virtual network


Before you can join your Azure-SSIS IR to a virtual network, you need to configure the virtual network. To
automatically configure virtual network permissions and settings for your Azure-SSIS IR to join the virtual
network, add the following script:

# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Create an Azure -SSIS IR and join it to a virtual network


You can create an Azure-SSIS IR and join it to a virtual network at the same time. For the complete script and
instructions, see Create an Azure-SSIS IR.
Join an existing Azure -SSIS IR to a virtual network
The Create an Azure-SSIS IR article shows you how to create an Azure-SSIS IR and join it to a virtual network in
the same script. If you already have an Azure-SSIS IR, follow these steps to join it to the virtual network:
1. Stop the Azure-SSIS IR.
2. Configure the Azure-SSIS IR to join the virtual network.
3. Start the Azure-SSIS IR.
Stop the Azure -SSIS IR
You have to stop the Azure-SSIS IR before you can join it to a virtual network. This command releases all of its
nodes and stops billing:
Stop-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

Configure virtual network settings for the Azure -SSIS IR to join


To configure settings for the virtual network that the Azure-SSIS will join, use this script:

# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}

Configure the Azure -SSIS IR


To join your Azure-SSIS IR to a virtual network, run the Set-AzDataFactoryV2IntegrationRuntime command:

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-VnetId $VnetId `
-Subnet $SubnetName

# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}

Start the Azure -SSIS IR


To start the Azure-SSIS IR, run the following command:

Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

This command takes 20 to 30 minutes to finish.

Next steps
For more information about Azure-SSIS IR, see the following articles:
Azure-SSIS IR. This article provides general conceptual information about IRs, including Azure-SSIS IR.
Tutorial: Deploy SSIS packages to Azure. This tutorial provides step-by-step instructions to create your Azure-
SSIS IR. It uses Azure SQL Database to host the SSIS catalog.
Create an Azure-SSIS IR. This article expands on the tutorial. It provides instructions about using Azure SQL
Database with virtual network service endpoints or SQL Managed Instance in a virtual network to host the
SSIS catalog. It shows how to join your Azure-SSIS IR to a virtual network.
Monitor an Azure-SSIS IR. This article shows you how to get information about your Azure-SSIS IR. It
provides status descriptions for the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or delete your Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding nodes.
Configure a self-hosted IR as a proxy for an Azure-
SSIS IR in Azure Data Factory
7/21/2021 • 11 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to run SQL Server Integration Services (SSIS) packages on an Azure-SSIS Integration
Runtime (Azure-SSIS IR) in Azure Data Factory (ADF) with a self-hosted integration runtime (self-hosted IR)
configured as a proxy.
With this feature, you can access data and run tasks on premises without having to join your Azure-SSIS IR to a
virtual network. The feature is useful when your corporate network has a configuration too complex or a policy
too restrictive for you to inject your Azure-SSIS IR into it.
This feature can only be enabled on SSIS Data Flow Task and Execute SQL/Process Tasks for now.
Enabled on Data Flow Task, this feature will break it down into two staging tasks whenever applicable:
On-premises staging task : This task runs your data flow component that connects to an on-premises data
store on your self-hosted IR. It moves data from the on-premises data store into a staging area in your Azure
Blob Storage or vice versa.
Cloud staging task : This task runs your data flow component that doesn't connect to an on-premises data
store on your Azure-SSIS IR. It moves data from the staging area in your Azure Blob Storage to a cloud data
store or vice versa.
If your Data Flow Task moves data from on premises to cloud, then the first and second staging tasks will be on-
premises and cloud staging tasks, respectively. If your Data Flow Task moves data from cloud to on premises,
then the first and second staging tasks will be cloud and on-premises staging tasks, respectively. If your Data
Flow Task moves data from on premises to on premises, then the first and second staging tasks will be both on-
premises staging tasks. If your Data Flow Task moves data from cloud to cloud, then this feature isn't applicable.
Enabled on Execute SQL/Process Tasks, this feature will run them on your self-hosted IR.
Other benefits and capabilities of this feature allow you to, for example, set up your self-hosted IR in regions that
are not yet supported by an Azure-SSIS IR, and allow the public static IP address of your self-hosted IR on the
firewall of your data sources.

Prepare the self-hosted IR


To use this feature, you first create a data factory and set up an Azure-SSIS IR in it. If you have not already done
so, follow the instructions in Set up an Azure-SSIS IR.
You then set up your self-hosted IR in the same data factory where your Azure-SSIS IR is set up. To do so, see
Create a self-hosted IR.
Finally, you download and install the latest version of self-hosted IR, as well as the additional drivers and
runtime, on your on-premises machine or Azure virtual machine (VM), as follows:
Download and install the latest version of self-hosted IR.
If you use Object Linking and Embedding Database (OLEDB), Open Database Connectivity (ODBC), or
ADO.NET connectors in your packages, download and install the relevant drivers on the same machine
where your self-hosted IR is installed, if you haven't done so already.
If you use the earlier version of the OLEDB driver for SQL Server (SQL Server Native Client [SQLNCLI]),
download the 64-bit version.
If you use the latest version of OLEDB driver for SQL Server (MSOLEDBSQL), download the 64-bit
version.
If you use OLEDB/ODBC/ADO.NET drivers for other database systems, such as PostgreSQL, MySQL,
Oracle, and so on, you can download the 64-bit versions from their websites.
If you use data flow components from Azure Feature Pack in your packages, download and install Azure
Feature Pack for SQL Server 2017 on the same machine where your self-hosted IR is installed, if you
haven't done so already.
If you haven't done so already, download and install the 64-bit version of Visual C++ (VC) runtime on the
same machine where your self-hosted IR is installed.
Enable Windows authentication for on-premises tasks
If on-premises staging tasks and Execute SQL/Process Tasks on your self-hosted IR require Windows
authentication, you must also configure Windows authentication feature on your Azure-SSIS IR.
Your on-premises staging tasks and Execute SQL/Process Tasks will be invoked with the self-hosted IR service
account (NT SERVICE\DIAHostService, by default), and your data stores will be accessed with the Windows
authentication account. Both accounts require certain security policies to be assigned to them. On the self-
hosted IR machine, go to Local Security Policy > Local Policies > User Rights Assignment , and then do
the following:
1. Assign the Adjust memory quotas for a process and Replace a process level token policies to the self-
hosted IR service account. This should occur automatically when you install your self-hosted IR with the
default service account. If it doesn't, assign those policies manually. If you use a different service account,
assign the same policies to it.
2. Assign the Log on as a service policy to the Windows Authentication account.

Prepare the Azure Blob Storage linked service for staging


If you haven't already done so, create an Azure Blob Storage linked service in the same data factory where your
Azure-SSIS IR is set up. To do so, see Create an Azure Data Factory linked service. Be sure to do the following:
For Data Store , select Azure Blob Storage .
For Connect via integration runtime , select AutoResolveIntegrationRuntime (not your self-hosted IR),
so we can ignore it and use your Azure-SSIS IR instead to fetch access credentials for your Azure Blob
Storage.
For Authentication method , select Account key , SAS URI , Ser vice Principal , Managed Identity , or
User-Assigned Managed Identity .

TIP
If you select the Ser vice Principal method, grant your service principal at least a Storage Blob Data Contributor role.
For more information, see Azure Blob Storage connector. If you select the Managed Identity /User-Assigned
Managed Identity method, grant the specified system/user-assigned managed identity for your ADF a proper role to
access Azure Blob Storage. For more information, see Access Azure Blob Storage using Azure Active Directory (Azure AD)
authentication with the specified system/user-assigned managed identity for your ADF.
Configure an Azure-SSIS IR with your self-hosted IR as a proxy
Having prepared your self-hosted IR and Azure Blob Storage linked service for staging, you can now configure
your new or existing Azure-SSIS IR with the self-hosted IR as a proxy in your data factory portal or app. Before
you do so, though, if your existing Azure-SSIS IR is already running, you can stop, edit, and then restart it.
1. In the Integration runtime setup pane, skip past the General settings and Deployment settings
pages by selecting the Continue button.
2. On the Advanced settings page, do the following:
a. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS
Integration Runtime check box.
b. In the Self-Hosted Integration Runtime drop-down list, select your existing self-hosted IR as a
proxy for the Azure-SSIS IR.
c. In the Staging storage linked ser vice drop-down list, select your existing Azure Blob Storage
linked service or create a new one for staging.
d. In the Staging path box, specify a blob container in your selected Azure Storage account or leave
it empty to use a default one for staging.
e. Select the Continue button.
You can also configure your new or existing Azure-SSIS IR with the self-hosted IR as a proxy by using
PowerShell.
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
$AzureSSISName = "[your Azure-SSIS IR name]"
# Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access

# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName

if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

Enable SSIS packages to use a proxy


By using the latest SSDT as either the SSIS Projects extension for Visual Studio or a standalone installer, you can
find a new ConnectByProxy property in the connection managers for supported data flow components and
ExecuteOnProxy property in Execute SQL/Process Tasks.

Download the SSIS Projects extension for Visual Studio


Download the standalone installer
When you design new packages containing Data Flow Tasks with components that access data on premises, you
can enable the ConnectByProxy property by setting it to True in the Proper ties pane of relevant connection
managers.
When you design new packages containing Execute SQL/Process Tasks that run on premises, you can enable the
ExecuteOnProxy property by setting it to True in the Proper ties pane of relevant tasks themselves.
You can also enable the ConnectByProxy / ExecuteOnProxy properties when you run existing packages, without
having to manually change them one by one. There are two options:
Option A : Open, rebuild, and redeploy the project containing those packages with the latest SSDT to run
on your Azure-SSIS IR. You can then enable the ConnectByProxy property by setting it to True for the
relevant connection managers that appear on the Connection Managers tab of Execute Package pop-
up window when you're running packages from SSMS.

You can also enable the ConnectByProxy property by setting it to True for the relevant connection
managers that appear on the Connection Managers tab of Execute SSIS Package activity when you're
running packages in Data Factory pipelines.
Option B: Redeploy the project containing those packages to run on your SSIS IR. You can then enable
the ConnectByProxy / ExecuteOnProxy properties by providing their property paths,
\Package.Connections[YourConnectionManagerName].Properties[ConnectByProxy] /
\Package\YourExecuteSQLTaskName.Properties[ExecuteOnProxy] /
\Package\YourExecuteProcessTaskName.Properties[ExecuteOnProxy] , and setting them to True as property
overrides on the Advanced tab of Execute Package pop-up window when you're running packages
from SSMS.

You can also enable the / ExecuteOnProxy properties by providing their property paths,
ConnectByProxy
\Package.Connections[YourConnectionManagerName].Properties[ConnectByProxy] /
\Package\YourExecuteSQLTaskName.Properties[ExecuteOnProxy] /
\Package\YourExecuteProcessTaskName.Properties[ExecuteOnProxy] , and setting them to True as property
overrides on the Proper ty Overrides tab of Execute SSIS Package activity when you're running
packages in Data Factory pipelines.
Debug the on-premises tasks and cloud staging tasks
On your self-hosted IR, you can find the runtime logs in the C:\ProgramData\SSISTelemetry folder and the
execution logs of on-premises staging tasks and Execute SQL/Process Tasks in the
C:\ProgramData\SSISTelemetry\ExecutionLog folder. You can find the execution logs of cloud staging tasks in
your SSISDB, specified logging file paths, or Azure Monitor depending on whether you store your packages in
SSISDB, enable Azure Monitor integration, etc. You can also find the unique IDs of on-premises staging tasks in
the execution logs of cloud staging tasks.

If you've raised customer support tickets, you can select the Send logs button on Diagnostics tab of
Microsoft Integration Runtime Configuration Manager that's installed on your self-hosted IR to send
recent operation/execution logs for us to investigate.

Billing for the on-premises tasks and cloud staging tasks


The on-premises staging tasks and Execute SQL/Process Tasks that run on your self-hosted IR are billed
separately, just as any data movement activities that run on a self-hosted IR are billed. This is specified in the
Azure Data Factory data pipeline pricing article.
The cloud staging tasks that run on your Azure-SSIS IR are not be billed separately, but your running Azure-SSIS
IR is billed as specified in the Azure-SSIS IR pricing article.

Enable custom/3rd party data flow components


To enable your custom/3rd party data flow components to access data on premises using self-hosted IR as a
proxy for Azure-SSIS IR, follow these instructions:
1. Install your custom/3rd party data flow components targeting SQL Server 2017 on Azure-SSIS IR via
standard/express custom setups.
2. Create the following DTSPath registry keys on self-hosted IR if they don’t exist already:
a. Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\140\SSIS\Setup\DTSPath set to
C:\Program Files\Microsoft SQL Server\140\DTS\
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\Microsoft SQL
b. Server\140\SSIS\Setup\DTSPath
set to C:\Program Files (x86)\Microsoft SQL Server\140\DTS\
3. Install your custom/3rd party data flow components targeting SQL Server 2017 on self-hosted IR under
the DTSPath above and ensure that your installation process:
a. Creates <DTSPath> , <DTSPath>/Connections , <DTSPath>/PipelineComponents , and
<DTSPath>/UpgradeMappings folders if they don't exist already.
b. Creates your own XML file for extension mappings in <DTSPath>/UpgradeMappings folder.
c. Installs all assemblies referenced by your custom/3rd party data flow component assemblies in
the global assembly cache (GAC).
Here are examples from our partners, Theobald Software and Aecorsoft, who have adapted their data flow
components to use our express custom setup and self-hosted IR as a proxy for Azure-SSIS IR.

Enforce TLS 1.2


If you need to use strong cryptography/more secure network protocol (TLS 1.2) and disable older SSL/TLS
versions at the same time on your self-hosted IR, you can download and run the main.cmd script that can be
found in the CustomSetupScript/UserScenarios/TLS 1.2 folder of our public preview blob container. Using Azure
Storage Explorer, you can connect to our public preview blob container by entering the following SAS URI:
https://ssisazurefileshare.blob.core.windows.net/publicpreview?sp=rl&st=2020-03-25T04:00:00Z&se=2025-03-
25T04:00:00Z&sv=2019-02-02&sr=c&sig=WAD3DATezJjhBCO3ezrQ7TUZ8syEUxZZtGIhhP6Pt4I%3D

Current limitations
Only data flow components that are built-in/preinstalled on Azure-SSIS IR Standard Edition, except
Hadoop/HDFS/DQS components, are currently supported, see all built-in/preinstalled components on Azure-
SSIS IR.
Only custom/3rd party data flow components that are written in managed code (.NET Framework) are
currently supported - Those written in native code (C++) are currently unsupported.
Changing variable values in both on-premises and cloud staging tasks is currently unsupported.
Changing variable values of type object in on-premises staging tasks won't be reflected in other tasks.
ParameterMapping in OLEDB Source is currently unsupported. As a workaround, please use SQL Command
From Variable as the AccessMode and use Expression to insert your variables/parameters in a SQL
command. As an illustration, see the ParameterMappingSample.dtsx package that can be found in the
SelfHostedIRProxy/Limitations folder of our public preview blob container. Using Azure Storage Explorer, you
can connect to our public preview blob container by entering the above SAS URI.

Next steps
After you've configured your self-hosted IR as a proxy for your Azure-SSIS IR, you can deploy and run your
packages to access data on-premises as Execute SSIS Package activities in Data Factory pipelines. To learn how,
see Run SSIS packages as Execute SSIS Package activities in Data Factory pipelines.
Enable Azure Active Directory authentication for
Azure-SSIS integration runtime
7/21/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article shows you how to enable Azure Active Directory (Azure AD) authentication with the specified
system/user-assigned managed identity for your Azure Data Factory (ADF) and use it instead of conventional
authentication methods (like SQL authentication) to:
Create an Azure-SSIS integration runtime (IR) that will in turn provision SSIS catalog database (SSISDB)
in Azure SQL Database server/Managed Instance on your behalf.
Connect to various Azure resources when running SSIS packages on Azure-SSIS IR.
For more info about the managed identity for your ADF, see Managed identity for Data Factory.

NOTE
In this scenario, Azure AD authentication with the specified system/user-assigned managed identity for your ADF
is only used in the creation and subsequent starting operations of your Azure-SSIS IR that will in turn provision
and connect to SSISDB. For SSIS package executions, your Azure-SSIS IR will still connect to SSISDB using SQL
authentication with fully managed accounts that are created during SSISDB provisioning.
If you have already created your Azure-SSIS IR using SQL authentication, you can not reconfigure it to use Azure
AD authentication via PowerShell at this time, but you can do so via Azure portal/ADF app.

NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Enable Azure AD authentication on Azure SQL Database


Azure SQL Database supports creating a database with an Azure AD user. First, you need to create an Azure AD
group with the specified system/user-assigned managed identity for your ADF as a member. Next, you need to
set an Azure AD user as the Active Directory admin for your Azure SQL Database server and then connect to it
on SQL Server Management Studio (SSMS) using that user. Finally, you need to create a contained user
representing the Azure AD group, so the specified system/user-assigned managed identity for your ADF can be
used by Azure-SSIS IR to create SSISDB on your behalf.
Create an Azure AD group with the specified system/user-assigned managed identity for your ADF as a
member
You can use an existing Azure AD group or create a new one using Azure AD PowerShell.
1. Install the Azure AD PowerShell module.
2. Sign in using Connect-AzureAD , run the following cmdlet to create a group, and save it in a variable:
$Group = New-AzureADGroup -DisplayName "SSISIrGroup" `
-MailEnabled $false `
-SecurityEnabled $true `
-MailNickName "NotSet"

The result looks like the following example, which also displays the variable value:

$Group

ObjectId DisplayName Description


-------- ----------- -----------
6de75f3c-8b2f-4bf4-b9f8-78cc60a18050 SSISIrGroup

3. Add the specified system/user-assigned managed identity for your ADF to the group. You can follow the
Managed identity for Data Factory article to get the Object ID of specified system/user-assigned managed
identity for your ADF (e.g. 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc, but do not use the Application ID
for this purpose).

Add-AzureAdGroupMember -ObjectId $Group.ObjectId -RefObjectId 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc

You can also check the group membership afterwards.

Get-AzureAdGroupMember -ObjectId $Group.ObjectId

Configure Azure AD authentication for Azure SQL Database


You can Configure and manage Azure AD authentication for Azure SQL Database using the following steps:
1. In Azure portal, select All ser vices -> SQL ser vers from the left-hand navigation.
2. Select your Azure SQL Database server to be configured with Azure AD authentication.
3. In the Settings section of the blade, select Active Director y admin .
4. In the command bar, select Set admin .
5. Select an Azure AD user account to be made administrator of the server, and then select Select.
6. In the command bar, select Save.
Create a contained user in Azure SQL Database representing the Azure AD group
For this next step, you need SSMS.
1. Start SSMS.
2. In the Connect to Ser ver dialog, enter your server name in the Ser ver name field.
3. In the Authentication field, select Active Director y - Universal with MFA suppor t (you can also use
the other two Active Directory authentication types, see Configure and manage Azure AD authentication
for Azure SQL Database).
4. In the User name field, enter the name of Azure AD account that you set as the server administrator, e.g.
[email protected].
5. select Connect and complete the sign-in process.
6. In the Object Explorer , expand the Databases -> System Databases folder.
7. Right-click on master database and select New quer y .
8. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [SSISIrGroup] FROM EXTERNAL PROVIDER

The command should complete successfully, creating a contained user to represent the group.
9. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.

ALTER ROLE dbmanager ADD MEMBER [SSISIrGroup]

The command should complete successfully, granting the contained user the ability to create a database
(SSISDB).
10. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, first make sure that the steps to grant permissions to
the master database have finished successfully. Then, right-click on the SSISDB database and select
New quer y .
11. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [SSISIrGroup] FROM EXTERNAL PROVIDER

The command should complete successfully, creating a contained user to represent the group.
12. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.

ALTER ROLE db_owner ADD MEMBER [SSISIrGroup]

The command should complete successfully, granting the contained user the ability to access SSISDB.

Enable Azure AD authentication on Azure SQL Managed Instance


Azure SQL Managed Instance supports creating a database with the specified system/user-assigned managed
identity for your ADF directly. You need not join the specified system/user-assigned managed identity for your
ADF to an Azure AD group nor create a contained user representing that group in Azure SQL Managed Instance.
Configure Azure AD authentication for Azure SQL Managed Instance
Follow the steps in Provision an Azure AD administrator for Azure SQL Managed Instance.
Add the specified system/user-assigned managed identity for your ADF as a user in Azure SQL Managed
Instance
For this next step, you need SSMS.
1. Start SSMS.
2. Connect to Azure SQL Managed Instance using SQL Server account that is a sysadmin . This is a
temporary limitation that will be removed once the support for Azure AD server principals (logins) on
Azure SQL Managed Instance becomes generally available. You will see the following error if you try to
use an Azure AD admin account to create the login: Msg 15247, Level 16, State 1, Line 1 User does not
have permission to perform this action.
3. In the Object Explorer , expand the Databases -> System Databases folder.
4. Right-click on master database and select New quer y .
5. In the query window, execute the following T-SQL script to add the specified system/user-assigned
managed identity for your ADF as a user.

CREATE LOGIN [{your managed identity name}] FROM EXTERNAL PROVIDER


ALTER SERVER ROLE [dbcreator] ADD MEMBER [{your managed identity name}]
ALTER SERVER ROLE [securityadmin] ADD MEMBER [{your managed identity name}]

If you use the system managed identity for your ADF, then your managed identity name should be your
ADF name. If you use a user-assigned managed identity for your ADF, then your managed identity name
should be the specified user-assigned managed identity name.
The command should complete successfully, granting the system/user-assigned managed identity for
your ADF the ability to create a database (SSISDB).
6. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, first make sure that the steps to grant permissions to
the master database have finished successfully. Then, right-click on the SSISDB database and select
New quer y .
7. In the query window, enter the following T-SQL command, and select Execute on the toolbar.

CREATE USER [{your managed identity name}] FOR LOGIN [{your managed identity name}] WITH
DEFAULT_SCHEMA = dbo
ALTER ROLE db_owner ADD MEMBER [{your managed identity name}]

The command should complete successfully, granting the system/user-assigned managed identity for
your ADF the ability to access SSISDB.

Provision Azure-SSIS IR in Azure portal/ADF app


When you provision your Azure-SSIS IR in Azure portal/ADF app, on the Deployment settings page, select the
Create SSIS catalog (SSISDB) hosted by Azure SQL Database ser ver/Managed Instance to store
your projects/packages/environments/execution logs check box and select either the Use AAD
authentication with the system managed identity for Data Factor y or Use AAD authentication with
a user-assigned managed identity for Data Factor y check box to choose Azure AD authentication method
for Azure-SSIS IR to access your database server that hosts SSISDB.
For more information, see Create an Azure-SSIS IR in ADF.

Provision Azure-SSIS IR with PowerShell


To provision your Azure-SSIS IR with PowerShell, do the following things:
1. Install Azure PowerShell module.
2. In your script, do not set CatalogAdminCredential parameter. For example:
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Description $AzureSSISDescription `
-Type Managed `
-Location $AzureSSISLocation `
-NodeSize $AzureSSISNodeSize `
-NodeCount $AzureSSISNodeNumber `
-Edition $AzureSSISEdition `
-MaxParallelExecutionsPerNode
$AzureSSISMaxParallelExecutionsPerNode `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier

Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName

Run SSIS packages using Azure AD authentication with the specified


system/user-assigned managed identity for your ADF
When you run SSIS packages on Azure-SSIS IR, you can use Azure AD authentication with the specified
system/user-assigned managed identity for your ADF to connect to various Azure resources. Currently we
support Azure AD authentication with the specified system/user-assigned managed identity for your ADF on the
following connection managers.
OLEDB Connection Manager
ADO.NET Connection Manager
Azure Storage Connection Manager
Access data stores and file shares with Windows
authentication from SSIS packages in Azure
3/22/2021 • 7 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can use Windows authentication to access data stores, such as SQL Servers, file shares, Azure Files, etc. from
SSIS packages running on your Azure-SSIS Integration Runtime (IR) in Azure Data Factory (ADF). Your data
stores can be on premises, hosted on Azure Virtual Machines (VMs), or running in Azure as managed services. If
they are on premises, you need to join your Azure-SSIS IR to a Virtual Network (Microsoft Azure Virtual
Network) connected to your on-premises network, see Join Azure-SSIS IR to a Microsoft Azure Virtual Network.
There are four methods to access data stores with Windows authentication from SSIS packages running on your
Azure-SSIS IR:

N UM B ER O F
C REDEN T IA L
A C C ESS SET S A N D T YPE OF
C O N N EC T IO N EF F EC T IVE M ET H O D IN C O N N EC T ED C O N N EC T ED
M ET H O D SC O P E SET UP ST EP PA C K A GES RESO URC ES RESO URC ES

Setting up an Per Execute SSIS Configure the Access resources Support only - File shares on
activity-level Package activity Windows directly in one credential premises/Azure
execution authentication packages, for set for all VMs
context property to set example, use connected
up an UNC path to resources - Azure Files, see
"Execution/Run access file shares Use an Azure file
as" context when or Azure Files: share
running SSIS \\YourFileShareServerName\YourFolderName
packages as or - SQL Servers on
Execute SSIS premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
Package activities VMs with
in ADF pipelines. Windows
authentication
For more info,
see Configure - Other
Execute SSIS resources with
Package activity. Windows
authentication

Setting up a Per Azure-SSIS Execute SSISDB Access resources Support only - File shares on
catalog-level IR, but is directly in
catalog.set_execution_credential one credential premises/Azure
execution overridden when stored procedure packages, for set for all VMs
context setting up an to set up an example, use connected
activity-level "Execution/Run UNC path to resources - Azure Files, see
execution as" context. access file shares Use an Azure file
context (see or Azure Files: share
above) For more info, \\YourFileShareServerName\YourFolderName
see the rest of or - SQL Servers on
this article below. premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
VMs with
Windows
authentication

- Other
resources with
Windows
authentication
N UM B ER O F
C REDEN T IA L
A C C ESS SET S A N D T YPE OF
C O N N EC T IO N EF F EC T IVE M ET H O D IN C O N N EC T ED C O N N EC T ED
M ET H O D SC O P E SET UP ST EP PA C K A GES RESO URC ES RESO URC ES

Persisting Per Azure-SSIS Execute cmdkey Access resources Support multiple - File shares on
credentials via IR, but is command in a directly in credential sets premises/Azure
cmdkey overridden when custom setup packages, for for different VMs
command setting up an script ( example, use connected
activity/catalog - main.cmd ) UNC path to resources - Azure Files, see
level execution when access file shares Use an Azure file
context (see provisioning or Azure Files: share
above) your Azure-SSIS \\YourFileShareServerName\YourFolderName
IR, for example, if or - SQL Servers on
you use file premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
shares, Azure VMs with
Files, or SQL Windows
Server: authentication

cmdkey - Other
/add:YourFileShareServerName resources with
/user:YourDomainName\YourUsername Windows
/pass:YourPassword
authentication
,

cmdkey
/add:YourAzureStorageAccountName.file.core.windows.net
/user:azure\YourAzureStorageAccountName
/pass:YourAccessKey
, or

cmdkey
/add:YourSQLServerFullyQualifiedDomainNameOrIPAddress:YorSQLServerPort
/user:YourDomainName\YourUsername /pass:YourPassword
.

For more info,


see Customize
setup for Azure-
SSIS IR.

Mounting drives Per package Execute Access file shares Support multiple - File shares on
at package net use via mapped drives for premises/Azure
execution time command in drives different file VMs
(non-persistent) Execute Process shares
Task that is - Azure Files, see
added at the Use an Azure file
beginning of share
control flow in
your packages,
for example,
net use D:
\\YourFileShareServerName\YourFolderName

WARNING
If you do not use any of the above methods to access data stores with Windows authentication, your packages that
depend on Windows authentication are not able to access them and fail at run time.

The rest of this article describes how to configure SSIS catalog (SSISDB) hosted in SQL Database/SQL Managed
Instance to run packages on Azure-SSIS IR that use Windows authentication to access data stores.

You can only use one set of credentials


When you use Windows authentication in an SSIS package, you can only use one set of credentials. The domain
credentials that you provide when you follow the steps in this article apply to all package executions - interactive
or scheduled - on your Azure-SSIS IR until you change or remove them. If your package has to connect to
multiple data stores with different sets of credentials, you should consider the above alternative methods.

Provide domain credentials for Windows authentication


To provide domain credentials that let packages use Windows authentication to access data stores on premises,
do the following things:
1. With SQL Server Management Studio (SSMS) or another tool, connect to SQL Database/SQL Managed
Instance that hosts SSISDB. For more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure and provide the appropriate domain credentials:

catalog.set_execution_credential @user='<your user name>', @domain='<your domain name>',


@password='<your password>'

4. Run your SSIS packages. The packages use the credentials that you provided to access data stores on
premises with Windows authentication.
View domain credentials
To view the active domain credentials, do the following things:
1. With SSMS or another tool, connect to SQL Database/SQL Managed Instance that hosts SSISDB. For
more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure and check the output:

SELECT *
FROM catalog.master_properties
WHERE property_name = 'EXECUTION_DOMAIN' OR property_name = 'EXECUTION_USER'

Clear domain credentials


To clear and remove the credentials that you provided as described in this article, do the following things:
1. With SSMS or another tool, connect to SQL Database/SQL Managed Instance that hosts SSISDB. For
more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure:

catalog.set_execution_credential @user='', @domain='', @password=''

Connect to a SQL Server on premises


To check whether you can connect to a SQL Server on premises, do the following things:
1. To run this test, find a non-domain-joined computer.
2. On the non-domain-joined computer, run the following command to start SSMS with the domain
credentials that you want to use:
runas.exe /netonly /user:<domain>\<username> SSMS.exe

3. From SSMS, check whether you can connect to the SQL Server on premises.
Prerequisites
To access a SQL Server on premises from packages running in Azure, do the following things:
1. In SQL Server Configuration Manager, enable TCP/IP protocol.
2. Allow access through Windows firewall. For more info, see Configure Windows firewall to access SQL
Server.
3. Join your Azure-SSIS IR to a Microsoft Azure Virtual Network that is connected to the SQL Server on
premises. For more info, see Join Azure-SSIS IR to a Microsoft Azure Virtual Network.
4. Use SSISDB catalog.set_execution_credential stored procedure to provide credentials as described in
this article.

Connect to a file share on premises


To check whether you can connect to a file share on premises, do the following things:
1. To run this test, find a non-domain-joined computer.
2. On the non-domain-joined computer, run the following commands. These commands open a command
prompt window with the domain credentials that you want to use and then test connectivity to the file
share on premises by getting a directory listing.

runas.exe /netonly /user:<domain>\<username> cmd.exe


dir \\fileshare

3. Check whether the directory listing is returned for the file share on premises.
Prerequisites
To access a file share on premises from packages running in Azure, do the following things:
1. Allow access through Windows firewall.
2. Join your Azure-SSIS IR to a Microsoft Azure Virtual Network that is connected to the file share on
premises. For more info, see Join Azure-SSIS IR to a Microsoft Azure Virtual Network.
3. Use SSISDB catalog.set_execution_credential stored procedure to provide credentials as described in
this article.

Connect to a file share on Azure VM


To access a file share on Azure VM from packages running in Azure, do the following things:
1. With SSMS or another tool, connect to SQL Database/SQL Managed Instance that hosts SSISDB. For
more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure and provide the appropriate domain credentials:

catalog.set_execution_credential @domain = N'.', @user = N'username of local account on Azure virtual


machine', @password = N'password'
Connect to a file share in Azure Files
For more info about Azure Files, see Azure Files.
To access a file share in Azure Files from packages running in Azure, do the following things:
1. With SSMS or another tool, connect to SQL Database/SQL Managed Instance that hosts SSISDB. For
more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure and provide the appropriate domain credentials:

catalog.set_execution_credential @domain = N'Azure', @user = N'<storage-account-name>', @password =


N'<storage-account-key>'

Next steps
Deploy your packages. For more info, see Deploy an SSIS project to Azure with SSMS.
Run your packages. For more info, see Run SSIS packages in Azure with SSMS.
Schedule your packages. For more info, see Schedule SSIS packages in Azure.
Open and save files on premises and in Azure with
SSIS packages deployed in Azure
3/22/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to open and save files on premises and in Azure when you lift and shift SSIS packages
that use local file systems into SSIS in Azure.

Save temporary files


If you need to store and process temporary files during a single package execution, packages can use the
current working directory ( . ) or temporary folder ( %TEMP% ) of your Azure-SSIS Integration Runtime nodes.

Use on-premises file shares


To continue to use on-premises file shares when you lift and shift packages that use local file systems into
SSIS in Azure, do the following things:
1. Transfer files from local file systems to on-premises file shares.
2. Join the on-premises file shares to an Azure virtual network.
3. Join your Azure-SSIS IR to the same virtual network. For more info, see Join an Azure-SSIS integration
runtime to a virtual network.
4. Connect your Azure-SSIS IR to the on-premises file shares inside the same virtual network by setting up
access credentials that use Windows authentication. For more info, see Connect to data and file shares
with Windows Authentication.
5. Update local file paths in your packages to UNC paths pointing to on-premises file shares. For example,
update C:\abc.txt to \\<on-prem-server-name>\<share-name>\abc.txt .

Use Azure file shares


To use Azure Files when you lift and shift packages that use local file systems into SSIS in Azure, do the
following things:
1. Transfer files from local file systems to Azure Files. For more info, see Azure Files.
2. Connect your Azure-SSIS IR to Azure Files by setting up access credentials that use Windows
authentication. For more info, see Connect to data and file shares with Windows Authentication.
3. Update local file paths in your packages to UNC paths pointing to Azure Files. For example, update
C:\abc.txt to \\<storage-account-name>.file.core.windows.net\<share-name>\abc.txt .

Next steps
Deploy your packages. For more info, see Deploy an SSIS project to Azure with SSMS.
Run your packages. For more info, see Run SSIS packages in Azure with SSMS.
Schedule your packages. For more info, see Schedule SSIS packages in Azure.
Provision Enterprise Edition for the Azure-SSIS
Integration Runtime
3/5/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


The Enterprise Edition of the Azure-SSIS Integration Runtime lets you use the following advanced and premium
features:
Change Data Capture (CDC) components
Oracle, Teradata, and SAP BW connectors
SQL Server Analysis Services (SSAS) and Azure Analysis Services (AAS) connectors and transformations
Fuzzy Grouping and Fuzzy Lookup transformations
Term Extraction and Term Lookup transformations
Some of these features require you to install additional components to customize the Azure-SSIS IR. For more
info about how to install additional components, see Custom setup for the Azure-SSIS integration runtime.

Enterprise features
EN T ERP RISE F EAT URES DESC RIP T IO N S

CDC components The CDC Source, Control Task, and Splitter Transformation
are preinstalled on the Azure-SSIS IR Enterprise Edition. To
connect to Oracle, you also need to install the CDC Designer
and Service on another computer.

Oracle connectors The Oracle Connection Manager, Source, and Destination are
preinstalled on the Azure-SSIS IR Enterprise Edition. You also
need to install the Oracle Call Interface (OCI) driver, and if
necessary configure the Oracle Transport Network Substrate
(TNS), on the Azure-SSIS IR. For more info, see Custom setup
for the Azure-SSIS integration runtime.

Teradata connectors You need to install the Teradata Connection Manager,


Source, and Destination, as well as the Teradata Parallel
Transporter (TPT) API and Teradata ODBC driver, on the
Azure-SSIS IR Enterprise Edition. For more info, see Custom
setup for the Azure-SSIS integration runtime.

SAP BW connectors The SAP BW Connection Manager, Source, and Destination


are preinstalled on the Azure-SSIS IR Enterprise Edition. You
also need to install the SAP BW driver on the Azure-SSIS IR.
These connectors support SAP BW 7.0 or earlier versions. To
connect to later versions of SAP BW or other SAP products,
you can purchase and install SAP connectors from third-
party ISVs on the Azure-SSIS IR. For more info about how to
install additional components, see Custom setup for the
Azure-SSIS integration runtime.
EN T ERP RISE F EAT URES DESC RIP T IO N S

Analysis Services components The Data Mining Model Training Destination, the Dimension
Processing Destination, and the Partition Processing
Destination, as well as the Data Mining Query
Transformation, are preinstalled on the Azure-SSIS IR
Enterprise Edition. All these components support SQL Server
Analysis Services (SSAS), but only the Partition Processing
Destination supports Azure Analysis Services (AAS). To
connect to SSAS, you also need to configure Windows
Authentication credentials in SSISDB. In addition to these
components, the Analysis Services Execute DDL Task, the
Analysis Services Processing Task, and the Data Mining
Query Task are also preinstalled on the Azure-SSIS IR
Standard/Enterprise Edition.

Fuzzy Grouping and Fuzzy Lookup transformations The Fuzzy Grouping and Fuzzy Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.

Term Extraction and Term Lookup transformations The Term Extraction and Term Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.

Instructions
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

1. Download and install Azure PowerShell.


2. When you provision or reconfigure the Azure-SSIS IR with PowerShell, run
Set-AzDataFactoryV2IntegrationRuntime with Enterprise as the value for the Edition parameter before
you start the Azure-SSIS IR. Here is a sample script:

$MyAzureSsisIrEdition = "Enterprise"

Set-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName


-Name $MyAzureSsisIrName
-ResourceGroupName $MyResourceGroupName
-Edition $MyAzureSsisIrEdition

Start-AzDataFactoryV2IntegrationRuntime -DataFactoryName $MyDataFactoryName


-Name $MyAzureSsisIrName
-ResourceGroupName $MyResourceGroupName

Next steps
Custom setup for the Azure-SSIS integration runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Built-in and preinstalled components on Azure-SSIS
Integration Runtime
3/5/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article lists all built-in and preinstalled components, such as clients, drivers, providers, connection
managers, data sources/destinations/transformations, and tasks on SSIS Integration Runtime (IR) in Azure Data
Factory (ADF). To provision SSIS IR in ADF, follow the instructions in Provision Azure-SSIS IR.

Built-in and preinstalled clients, drivers, and providers on Azure-SSIS


IR
TYPE N A M E - VERSIO N - P L AT F O RM

Built-in clients/drivers/providers Access Database Engine 2016 Redistributable - RTM - X64

Microsoft Analysis Management Objects - 15.0.1000.81 -


X64

Microsoft Analysis Services OLEDB Provider - 15.0.1000.81 -


X64

Microsoft SQL Server 2012 Native Client - 11.4.7462.6 -


X64

Microsoft ODBC Driver 13 for SQL Server - 14.0.900.902 -


X64

Microsoft OLEDB Driver 18 for SQL Server - 18.1.0.0 - X64

Microsoft OLEDB Provider for DB2 - 6.0 - X64

SharePoint Online Client Components SDK - 15.4711.1001 -


X64

Built-in and preinstalled connection managers on Azure-SSIS IR


TYPE NAME
TYPE NAME

Built-in connection managers ADO Connection Manager

ADO.NET Connection Manager

Analysis Services Connection Manager

Excel Connection Manager

File Connection Manager

Flat File Connection Manager

FTP Connection Manager

Hadoop Connection Manager

HTTP Connection Manager

MSMQ Connection Manager

Multiple Files Connection Manager

Multiple Flat Files Connection Manager

ODBC Connection Manager

OLEDB Connection Manager

SAP BW Connection Manager (Enterprise Edition)

SMO Connection Manager

SMTP Connection Manager

SQL Server Compact Edition Connection Manager

WMI Connection Manager

Preinstalled connection managers ( Azure Feature Azure Data Lake Analytics Connection Manager
Pack )
Azure Data Lake Store Connection Manager

Azure HDInsight Connection Manager

Azure Resource Manager Connection Manager

Azure Storage Connection Manager

Azure Subscription Connection Manager

Built-in and preinstalled data sources on Azure-SSIS IR


TYPE NAME
TYPE NAME

Built-in data sources ADO.NET Source

CDC Source (Enterprise Edition)

Excel Source

Flat File Source

HDFS File Source

OData Source

ODBC Source

OLEDB Source

Raw File Source

SAP BW Source (Enterprise Edition)

XML Source

Preinstalled data sources ( Azure Feature Pack + Azure Blob Source


Power Quer y Source )
Azure Data Lake Store Source

Flexible File Source

Power Query Source

Built-in and preinstalled data destinations on Azure-SSIS IR


TYPE NAME
TYPE NAME

Built-in data destinations ADO.NET Destination

Data Mining Model Training Destination (Enterprise Edition)

DataReader Destination

Data Streaming Destination

Dimension Processing Destination (Enterprise Edition)

Excel Destination

Flat File Destination

HDFS File Destination

ODBC Destination

OLEDB Destination

Partition Processing Destination (Enterprise Edition)

Raw File Destination

Recordset Destination

SAP BW Destination (Enterprise Edition)

SQL Server Compact Edition Destination

SQL Server Destination

Preinstalled data destinations ( Azure Feature Pack ) Azure Blob Destination

Azure Data Lake Store Destination

Flexible File Destination

Built-in and preinstalled data transformations on Azure-SSIS IR


TYPE NAME

Built-in auditing transformations Audit Transformation

Row Count Transformation

Built-in BI transformations Data Mining Query Transformation (Enterprise Edition)

DQS Cleansing Transformation

Fuzzy Grouping Transformation (Enterprise Edition)

Fuzzy Lookup Transformation (Enterprise Edition)

Term Extraction Transformation (Enterprise Edition)

Term Lookup Transformation (Enterprise Edition)


TYPE NAME

Built-in row transformations Character Map Transformation

Copy Column Transformation

Data Conversion Transformation

Derived Column Transformation

Export Column Transformation

Import Column Transformation

OLE DB Command Transformation

Script Component

Built-in rowset transformations Aggregate Transformation

Percentage Sampling Transformation

Pivot Transformation

Row Sampling Transformation

Sort Transformation

Unpivot Transformation

Built-in split and join transformations Balanced Data Distributor Transformation

Cache Transform

CDC Splitter (Enterprise Edition)

Conditional Split Transformation

Lookup Transformation

Merge Join Transformation

Merge Transformation

Multicast Transformation

Slowly Changing Dimension Transformation

Union All Transformation

Built-in and preinstalled tasks on Azure-SSIS IR


TYPE NAME

Built-in Analysis Ser vices tasks Analysis Services Execute DDL Task

Analysis Services Processing Task

Data Mining Query Task


TYPE NAME

Built-in data flow tasks Data Flow Task

Built-in data preparation tasks CDC Control Task (Enterprise Edition)

Check Database Integrity Task

Data Profiling Task

File System Task

FTP Task

Hadoop File System Task

Hadoop Hive Task

Hadoop Pig Task

Web Service Task

XML Task

Built-in maintenance tasks Back Up Database Task

Execute T-SQL Statement Task

History Cleanup Task

Maintenance Cleanup Task

Notify Operator Task

Rebuild Index Task

Reorganize Index Task

Select Objects to Transfer

Shrink Database Task

Transfer Database Task

Transfer Error Messages Task

Transfer Jobs Task

Transfer Logins Task

Transfer Master Stored Procedures Task

Transfer SQL Server Objects Task

Update Statistics Task

Built-in scripting tasks Script Task

Built-in SQL Ser ver tasks Bulk Insert Task

Execute SQL Task


TYPE NAME

Built-in workflow tasks Execute Package Task

Execute Process Task

Execute SQL Server Agent Job Task

Expression Task

Message Queue Task

Send Mail Task

WMI Data Reader Task

WMI Event Watcher Task

Preinstalled tasks ( Azure Feature Pack ) Azure Blob Download Task

Azure Blob Upload Task

Azure Data Lake Analytics Task

Azure Data Lake Store File System Task

Azure HDInsight Create Cluster Task

Azure HDInsight Delete Cluster Task

Azure HDInsight Hive Task

Azure HDInsight Pig Task

Azure SQL Azure Synapse Analytics Upload Task

Flexible File Task

Next steps
To install additional custom/Open Source/3rd party components on your SSIS IR, follow the instructions in
Customize Azure-SSIS IR.
Customize the setup for an Azure-SSIS Integration
Runtime
5/25/2021 • 19 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can customize your Azure-SQL Server Integration Services (SSIS) Integration Runtime (IR) in Azure Data
Factory (ADF) via custom setups. They allow you to add your own steps during the provisioning or
reconfiguration of your Azure-SSIS IR.
By using custom setups, you can alter the default operating configuration or environment of your Azure-SSIS IR.
For example, to start additional Windows services, persist access credentials for file shares, or use only strong
cryptography/more secure network protocol (TLS 1.2). Or you can install additional components, such as
assemblies, drivers, or extensions, on each node of your Azure-SSIS IR. They can be custom-made, Open Source,
or 3rd party components. For more information about built-in/preinstalled components, see Built-in/preinstalled
components on Azure-SSIS IR.
You can do custom setups on your Azure-SSIS IR in either of two ways:
Standard custom setup with a script : Prepare a script and its associated files, and upload them all
together to a blob container in your Azure Storage account. You then provide a Shared Access Signature
(SAS) Uniform Resource Identifier (URI) for your blob container when you set up or reconfigure your Azure-
SSIS IR. Each node of your Azure-SSIS IR then downloads the script and its associated files from your blob
container and runs your custom setup with elevated permissions. When your custom setup is finished, each
node uploads the standard output of execution and other logs to your blob container.
Express custom setup without a script : Run some common system configurations and Windows
commands or install some popular or recommended additional components without using any scripts.
You can install both free (unlicensed) and paid (licensed) components with standard and express custom setups.
If you're an independent software vendor (ISV), see Develop paid or licensed components for Azure-SSIS IR.

IMPORTANT
To benefit from future enhancements, we recommend using v3 or later series of nodes for your Azure-SSIS IR with custom
setup.

Current limitations
The following limitations apply only to standard custom setups:
If you want to use gacutil.exe in your script to install assemblies in the global assembly cache (GAC), you
need to provide gacutil.exe as part of your custom setup. Or you can use the copy that's provided in the
Sample folder of our Public Preview blob container, see the Standard custom setup samples section
below.
If you want to reference a subfolder in your script, msiexec.exe doesn't support the .\ notation to
reference the root folder. Use a command such as msiexec /i "MySubfolder\MyInstallerx64.msi" ...
instead of msiexec /i ".\MySubfolder\MyInstallerx64.msi" ... .
Administrative shares, or hidden network shares that are automatically created by Windows, are currently
not supported on the Azure-SSIS IR.
The IBM iSeries Access ODBC driver is not supported on the Azure-SSIS IR. You might see installation
errors during your custom setup. If you do, contact IBM support for assistance.

Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

To customize your Azure-SSIS IR, you need the following items:


An Azure subscription
Provision your Azure-SSIS IR
An Azure Storage account. Not required for express custom setups. For standard custom setups, you
upload and store your custom setup script and its associated files in a blob container. The custom setup
process also uploads its execution logs to the same blob container.

Instructions
You can provision or reconfigure your Azure-SSIS IR with custom setups on ADF UI. If you want to do the same
using PowerShell, download and install Azure PowerShell.
Standard custom setup
To provision or reconfigure your Azure-SSIS IR with standard custom setups on ADF UI, complete the following
steps.
1. Prepare your custom setup script and its associated files (for example, .bat, .cmd, .exe, .dll, .msi, or .ps1
files).
You must have a script file named main.cmd, which is the entry point of your custom setup.
To ensure that the script can be silently executed, you should test it on your local machine first.
If you want additional logs generated by other tools (for example, msiexec.exe) to be uploaded to your
blob container, specify the predefined environment variable, CUSTOM_SETUP_SCRIPT_LOG_DIR , as the log
folder in your scripts (for example, msiexec /i xxx.msi /quiet /lv
%CUSTOM_SETUP_SCRIPT_LOG_DIR%\install.log).
2. Download, install, and open Azure Storage Explorer.
a. Under Local and Attached , right-click Storage Accounts , and then select Connect to Azure
Storage .
b. Select Storage account or ser vice , select Account name and key , and then select Next .
c. Enter your Azure Storage account name and key, select Next , and then select Connect .

d. Under your connected Azure Storage account, right-click Blob Containers , select Create Blob
Container , and name the new blob container.
e. Select the new blob container, and upload your custom setup script and its associated files. Make sure
that you upload main.cmd at the top level of your blob container, not in any folder. Your blob container
should contain only the necessary custom setup files, so downloading them to your Azure-SSIS IR later
won't take a long time. The maximum duration of a custom setup is currently set at 45 minutes before it
times out. This includes the time to download all files from your blob container and install them on the
Azure-SSIS IR. If setup requires more time, raise a support ticket.

f. Right-click the blob container, and then select Get Shared Access Signature .
g. Create the SAS URI for your blob container with a sufficiently long expiration time and with
read/write/list permission. You need the SAS URI to download and run your custom setup script and its
associated files. This happens whenever any node of your Azure-SSIS IR is reimaged or restarted. You
also need write permission to upload setup execution logs.

IMPORTANT
Ensure that the SAS URI doesn't expire and the custom setup resources are always available during the whole
lifecycle of your Azure-SSIS IR, from creation to deletion, especially if you regularly stop and start your Azure-SSIS
IR during this period.

h. Copy and save the SAS URI of your blob container.


3. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box on the Advanced settings page of Integration
runtime setup pane. Next, enter the SAS URI of your blob container in the Custom setup container
SAS URI text box.
After your standard custom setup finishes and your Azure-SSIS IR starts, you can find all custom setup logs in
the main.cmd.log folder of your blob container. They include the standard output of main.cmd and other
execution logs.
Express custom setup
To provision or reconfigure your Azure-SSIS IR with express custom setups on ADF UI, complete the following
steps.
1. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box on the Advanced settings page of Integration
runtime setup pane.
2. Select New to open the Add express custom setup pane, and then select a type in the Express
custom setup type drop-down list. We currently offer express custom setups for running cmdkey
command, adding environment variables, installing Azure PowerShell, and installing licensed
components.
Running cmdkey command
If you select the Run cmdkey command type for your express custom setup, you can run the Windows
cmdkey command on your Azure-SSIS IR. To do so, enter your targeted computer name or domain name,
username or account name, and password or account key in the /Add , /User , and /Pass text boxes, respectively.
This will allow you to persist access credentials for SQL Servers, file shares, or Azure Files on your Azure-SSIS IR.
For example, to access Azure Files, you can enter YourAzureStorageAccountName.file.core.windows.net ,
azure\YourAzureStorageAccountName , and YourAzureStorageAccountKey for /Add , /User , and /Pass , respectively.
This is similar to running the Windows cmdkey command on your local machine. Only one express custom
setup to run cmdkey command is supported for now. To run multiple cmdkey commands, use a standard
custom setup instead.
Adding environment variables
If you select the Add environment variable type for your express custom setup, you can add a Windows
environment variable on your Azure-SSIS IR. To do so, enter your environment variable name and value in the
Variable name and Variable value text boxes, respectively. This will allow you to use the environment
variable in your packages that run on Azure-SSIS IR, for example in Script Components/Tasks. This is similar to
running the Windows set command on your local machine.
Installing Azure PowerShell
If you select the Install Azure PowerShell type for your express custom setup, you can install the Az module
of PowerShell on your Azure-SSIS IR. To do so, enter the Az module version number (x.y.z) you want from a list
of supported ones. This will allow you to run Azure PowerShell cmdlets/scripts in your packages to manage
Azure resources, for example Azure Analysis Services (AAS).
Installing licensed components
If you select the Install licensed component type for your express custom setup, you can then select an
integrated component from our ISV partners in the Component name drop-down list:
If you select the Sentr yOne's Task Factor y component, you can install the Task Factory suite of
components from SentryOne on your Azure-SSIS IR by entering the product license key that you
purchased from them in the License key box. The current integrated version is 2020.21.2 .
If you select the oh22's HEDDA.IO component, you can install the HEDDA.IO data quality/cleansing
component from oh22 on your Azure-SSIS IR. To do so, you need to purchase their service beforehand.
The current integrated version is 1.0.14 .
If you select the oh22's SQLPhonetics.NET component, you can install the SQLPhonetics.NET data
quality/matching component from oh22 on your Azure-SSIS IR. To do so, enter the product license key
that you purchased from them beforehand in the License key text box. The current integrated version is
1.0.45 .
If you select the KingswaySoft's SSIS Integration Toolkit component, you can install the SSIS
Integration Toolkit suite of connectors for CRM/ERP/marketing/collaboration apps, such as Microsoft
Dynamics/SharePoint/Project Server, Oracle/Salesforce Marketing Cloud, etc. from KingswaySoft on your
Azure-SSIS IR by entering the product license key that you purchased from them in the License key box.
The current integrated version is 20.2 .
If you select the KingswaySoft's SSIS Productivity Pack component, you can install the SSIS
Productivity Pack suite of components from KingswaySoft on your Azure-SSIS IR by entering the product
license key that you purchased from them in the License key box. The current integrated version is 20.2 .
If you select the Theobald Software's Xtract IS component, you can install the Xtract IS suite of
connectors for SAP system (ERP, S/4HANA, BW) from Theobald Software on your Azure-SSIS IR by
dragging & dropping/uploading the product license file that you purchased from them into the License
file box. The current integrated version is 6.5.13.18 .
If you select the AecorSoft's Integration Ser vice component, you can install the Integration Service
suite of connectors for SAP and Salesforce systems from AecorSoft on your Azure-SSIS IR. To do so, enter
the product license key that you purchased from them beforehand in the License key text box. The
current integrated version is 3.0.00 .
If you select the CData's SSIS Standard Package component, you can install the SSIS Standard
Package suite of most popular components from CData, such as Microsoft SharePoint connectors, on
your Azure-SSIS IR. To do so, enter the product license key that you purchased from them beforehand in
the License key text box. The current integrated version is 19.7354 .
If you select the CData's SSIS Extended Package component, you can install the SSIS Extended
Package suite of all components from CData, such as Microsoft Dynamics 365 Business Central
connectors and other components in their SSIS Standard Package , on your Azure-SSIS IR. To do so,
enter the product license key that you purchased from them beforehand in the License key text box. The
current integrated version is 19.7354 . Due to its large size, to avoid installation timeout, please ensure
that your Azure-SSIS IR has at least 4 CPU cores per node.
Your added express custom setups will appear on the Advanced settings page. To remove them, select their
check boxes, and then select Delete .
Azure PowerShell
To provision or reconfigure your Azure-SSIS IR with custom setups using Azure PowerShell, complete the
following steps.
1. If your Azure-SSIS IR is already started/running, stop it first.
2. You can then add or remove custom setups by running the Set-AzDataFactoryV2IntegrationRuntime
cmdlet before you start your Azure-SSIS IR.

$ResourceGroupName = "[your Azure resource group name]"


$DataFactoryName = "[your data factory name]"
$AzureSSISName = "[your Azure-SSIS IR name]"
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard
custom setup where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NE
T|oh22is.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|Aec
orSoft.IntegrationService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure
an express custom setup without script

# Add custom setup parameters if you use standard/express custom setups


if(![string]::IsNullOrEmpty($SetupScriptContainerSasUri))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-SetupScriptContainerSasUri $SetupScriptContainerSasUri
}
if(![string]::IsNullOrEmpty($ExpressCustomSetup))
{
if($ExpressCustomSetup -eq "RunCmdkey")
{
$addCmdkeyArgument = "YourFileShareServerName or
YourAzureStorageAccountName.file.core.windows.net"
$userCmdkeyArgument = "YourDomainName\YourUsername or azure\YourAzureStorageAccountName"
$passCmdkeyArgument = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourPassword or YourAccessKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.CmdkeySetup($addCmdkeyArgument, $userCmdkeyArgument,
$passCmdkeyArgument)
}
if($ExpressCustomSetup -eq "SetEnvironmentVariable")
{
$variableName = "YourVariableName"
$variableValue = "YourVariableValue"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.EnvironmentVariableSetup($variableName, $variableValue)
}
}
if($ExpressCustomSetup -eq "InstallAzurePowerShell")
{
$moduleVersion = "YourAzModuleVersion"
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.AzPowerShellSetup($moduleVersion)
}
if($ExpressCustomSetup -eq "SentryOne.TaskFactory")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.SQLPhonetics.NET")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "oh22is.HEDDA.IO")
{
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup)
}
if($ExpressCustomSetup -eq "KingswaySoft.IntegrationToolkit")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "KingswaySoft.ProductivityPack")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "Theobald.XtractIS")
{
$jsonData = Get-Content -Raw -Path YourLicenseFile.json
$jsonData = $jsonData -replace '\s',''
$jsonData = $jsonData.replace('"','\"')
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString($jsonData)
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "AecorSoft.IntegrationService")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Standard")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
if($ExpressCustomSetup -eq "CData.Extended")
{
$licenseKey = New-Object
Microsoft.Azure.Management.DataFactory.Models.SecureString("YourLicenseKey")
$setup = New-Object
Microsoft.Azure.Management.DataFactory.Models.ComponentSetup($ExpressCustomSetup, $licenseKey)
}
# Create an array of one or more express custom setups
$setups = New-Object System.Collections.ArrayList
$setups.Add($setup)

Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `


-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-ExpressCustomSetup $setups
}
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force

Standard custom setup samples


To view and reuse some samples of standard custom setups, complete the following steps.
1. Connect to our Public Preview blob container using Azure Storage Explorer.
a. Under Local and Attached , right-click Storage Accounts , and then select Connect to Azure
Storage .

b. Select Blob container , select Shared access signature URL (SAS) , and then select Next .
c. In the Blob container SAS URL text box, enter the SAS URI for our Public Preview blob container
below, select Next , and then select Connect .
https://ssisazurefileshare.blob.core.windows.net/publicpreview?sp=rl&st=2020-03-25T04:00:00Z&se=2025-
03-25T04:00:00Z&sv=2019-02-02&sr=c&sig=WAD3DATezJjhBCO3ezrQ7TUZ8syEUxZZtGIhhP6Pt4I%3D

d. In the left pane, select the connected publicpreview blob container, and then double-click the
CustomSetupScript folder. In this folder are the following items:
A Sample folder, which contains a custom setup to install a basic task on each node of your Azure-
SSIS IR. The task does nothing but sleep for a few seconds. The folder also contains a gacutil folder,
whose entire content (gacutil.exe, gacutil.exe.config, and 1033\gacutlrc.dll) can be copied as is to
your blob container.
A UserScenarios folder, which contains several custom setup samples from real user scenarios. If
you want to install multiple samples on your Azure-SSIS IR, you can combine their custom setup
script (main.cmd) files into a single one and upload it with all of their associated files into your
blob container.

e. Double-click the UserScenarios folder to find the following items:


A .NET FRAMEWORK 3.5 folder, which contains a custom setup script (main.cmd) to install an
earlier version of the .NET Framework on each node of your Azure-SSIS IR. This version might be
required by some custom components.
A BCP folder, which contains a custom setup script (main.cmd) to install SQL Server command-line
utilities (MsSqlCmdLnUtils.msi) on each node of your Azure-SSIS IR. One of those utilities is the
bulk copy program (bcp).
A DNS SUFFIX folder, which contains a custom setup script (main.cmd) to append your own DNS
suffix (for example test.com) to any unqualified single label domain name and turn it into a Fully
Qualified Domain Name (FQDN) before using it in DNS queries from your Azure-SSIS IR.
An EXCEL folder, which contains a custom setup script (main.cmd) to install some C# assemblies
and libraries on each node of your Azure-SSIS IR. You can use them in Script Tasks to dynamically
read and write Excel files.
First, download ExcelDataReader.dll and DocumentFormat.OpenXml.dll, and then upload them all
together with main.cmd to your blob container. Alternatively, if you just want to use the standard
Excel connectors (Connection Manager, Source, and Destination), the Access Redistributable that
contains them is already preinstalled on your Azure-SSIS IR, so you don't need any custom setup.
A MYSQL ODBC folder, which contains a custom setup script (main.cmd) to install the MySQL
ODBC drivers on each node of your Azure-SSIS IR. This setup lets you use the ODBC connectors
(Connection Manager, Source, and Destination) to connect to the MySQL server.
First, download the latest 64-bit and 32-bit versions of the MySQL ODBC driver installers (for
example, mysql-connector-odbc-8.0.13-winx64.msi and mysql-connector-odbc-8.0.13-win32.msi),
and then upload them all together with main.cmd to your blob container.
If Data Source Name (DSN) is used in connection, DSN configuration is needed in setup script. For
example: C:\Windows\SysWOW64\odbcconf.exe /A {CONFIGSYSDSN "MySQL ODBC 8.0 Unicode
Driver" "DSN=<dsnname>|PORT=3306|SERVER=<servername>"}
An ORACLE ENTERPRISE folder, which contains a custom setup script (main.cmd) and silent
installation config file (client.rsp) to install the Oracle connectors and OCI driver on each node of
your Azure-SSIS IR Enterprise Edition. This setup lets you use the Oracle Connection Manager,
Source, and Destination to connect to the Oracle server.
First, download Microsoft Connectors v5.0 for Oracle (AttunitySSISOraAdaptersSetup.msi and
AttunitySSISOraAdaptersSetup64.msi) from Microsoft Download Center and the latest Oracle
client (for example, winx64_12102_client.zip) from Oracle. Next, upload them all together with
main.cmd and client.rsp to your blob container. If you use TNS to connect to Oracle, you also need
to download tnsnames.ora, edit it, and upload it to your blob container. In this way, it can be copied
to the Oracle installation folder during setup.
An ORACLE STANDARD ADO.NET folder, which contains a custom setup script (main.cmd) to install
the Oracle ODP.NET driver on each node of your Azure-SSIS IR. This setup lets you use the
ADO.NET Connection Manager, Source, and Destination to connect to the Oracle server.
First, download the latest Oracle ODP.NET driver (for example,
ODP.NET_Managed_ODAC122cR1.zip), and then upload it together with main.cmd to your blob
container.
An ORACLE STANDARD ODBC folder, which contains a custom setup script (main.cmd) to install
the Oracle ODBC driver on each node of your Azure-SSIS IR. The script also configures the Data
Source Name (DSN). This setup lets you use the ODBC Connection Manager, Source, and
Destination or Power Query Connection Manager and Source with the ODBC data source type to
connect to the Oracle server.
First, download the latest Oracle Instant Client (Basic Package or Basic Lite Package) and ODBC
Package, and then upload them all together with main.cmd to your blob container:
Download 64-bit packages (Basic Package: instantclient-basic-windows.x64-18.3.0.0.0dbru.zip;
Basic Lite Package: instantclient-basiclite-windows.x64-18.3.0.0.0dbru.zip; ODBC Package:
instantclient-odbc-windows.x64-18.3.0.0.0dbru.zip)
Download 32-bit packages (Basic Package: instantclient-basic-nt-18.3.0.0.0dbru.zip; Basic Lite
Package: instantclient-basiclite-nt-18.3.0.0.0dbru.zip; ODBC Package: instantclient-odbc-nt-
18.3.0.0.0dbru.zip)
An ORACLE STANDARD OLEDB folder, which contains a custom setup script (main.cmd) to install
the Oracle OLEDB driver on each node of your Azure-SSIS IR. This setup lets you use the OLEDB
Connection Manager, Source, and Destination to connect to the Oracle server.
First, download the latest Oracle OLEDB driver (for example, ODAC122010Xcopy_x64.zip), and
then upload it together with main.cmd to your blob container.
A POSTGRESQL ODBC folder, which contains a custom setup script (main.cmd) to install the
PostgreSQL ODBC drivers on each node of your Azure-SSIS IR. This setup lets you use the ODBC
Connection Manager, Source, and Destination to connect to the PostgreSQL server.
First, download the latest 64-bit and 32-bit versions of PostgreSQL ODBC driver installers (for
example, psqlodbc_x64.msi and psqlodbc_x86.msi), and then upload them all together with
main.cmd to your blob container.
A SAP BW folder, which contains a custom setup script (main.cmd) to install the SAP .NET
connector assembly (librfc32.dll) on each node of your Azure-SSIS IR Enterprise Edition. This setup
lets you use the SAP BW Connection Manager, Source, and Destination to connect to the SAP BW
server.
First, upload the 64-bit or the 32-bit version of librfc32.dll from the SAP installation folder
together with main.cmd to your blob container. The script then copies the SAP assembly to the
%windir%\SysWow64 or %windir%\System32 folder during setup.
A STORAGE folder, which contains a custom setup script (main.cmd) to install Azure PowerShell on
each node of your Azure-SSIS IR. This setup lets you deploy and run SSIS packages that run Azure
PowerShell cmdlets/scripts to manage your Azure Storage.
Copy main.cmd, a sample AzurePowerShell.msi (or use the latest version), and storage.ps1 to your
blob container. Use PowerShell.dtsx as a template for your packages. The package template
combines an Azure Blob Download Task, which downloads a modifiable PowerShell script
(storage.ps1), and an Execute Process Task, which executes the script on each node.
A TERADATA folder, which contains a custom setup script (main.cmd), its associated file
(install.cmd), and installer packages (.msi). These files install the Teradata connectors, the Teradata
Parallel Transporter (TPT) API, and the ODBC driver on each node of your Azure-SSIS IR Enterprise
Edition. This setup lets you use the Teradata Connection Manager, Source, and Destination to
connect to the Teradata server.
First, download the Teradata Tools and Utilities 15.x zip file (for example,
TeradataToolsAndUtilitiesBase__windows_indep.15.10.22.00.zip), and then upload it together with
the previously mentioned .cmd and .msi files to your blob container.
A TLS 1.2 folder, which contains a custom setup script (main.cmd) to use only strong
cryptography/more secure network protocol (TLS 1.2) on each node of your Azure-SSIS IR. The
script also disables older SSL/TLS versions (SSL 3.0, TLS 1.0, TLS 1.1) at the same time.
A ZULU OPENJDK folder, which contains a custom setup script (main.cmd) and PowerShell file
(install_openjdk.ps1) to install the Zulu OpenJDK on each node of your Azure-SSIS IR. This setup
lets you use Azure Data Lake Store and Flexible File connectors to process ORC and Parquet files.
For more information, see Azure Feature Pack for Integration Services.
First, download the latest Zulu OpenJDK (for example, zulu8.33.0.1-jdk8.0.192-win_x64.zip), and
then upload it together with main.cmd and install_openjdk.ps1 to your blob container.

f. To reuse these standard custom setup samples, copy the content of selected folder to your blob
container.
2. When you provision or reconfigure your Azure-SSIS IR on ADF UI, select the Customize your Azure-
SSIS Integration Runtime with additional system configurations/component installations
check box on the Advanced settings page of Integration runtime setup pane. Next, enter the SAS
URI of your blob container in the Custom setup container SAS URI text box.
3. When you provision or reconfigure your Azure-SSIS IR using Azure PowerShell, stop it if it's already
started/running, run the Set-AzDataFactoryV2IntegrationRuntime cmdlet with the SAS URI of your blob
container as the value for SetupScriptContainerSasUri parameter, and then start your Azure-SSIS IR.
4. After your standard custom setup finishes and your Azure-SSIS IR starts, you can find all custom setup
logs in the main.cmd.log folder of your blob container. They include the standard output of main.cmd and
other execution logs.

Next steps
Set up the Enterprise Edition of Azure-SSIS IR
Develop paid or licensed components for Azure-SSIS IR
Install paid or licensed custom components for the
Azure-SSIS integration runtime
3/5/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how an ISV can develop and install paid or licensed custom components for SQL Server
Integration Services (SSIS) packages that run in Azure in the Azure-SSIS integration runtime.

The problem
The nature of the Azure-SSIS integration runtime presents several challenges, which make the typical licensing
methods used for the on-premises installation of custom components inadequate. As a result, the Azure-SSIS IR
requires a different approach.
The nodes of the Azure-SSIS IR are volatile and can be allocated or released at any time. For example, you
can start or stop nodes to manage the cost, or scale up and down through various node sizes. As a result,
binding a third-party component license to a particular node by using machine-specific info such as MAC
address or CPU ID is no longer viable.
You can also scale the Azure-SSIS IR in or out, so that the number of nodes can shrink or expand at any
time.

The solution
As a result of the limitations of traditional licensing methods described in the previous section, the Azure-SSIS IR
provides a new solution. This solution uses Windows environment variables and SSIS system variables for the
license binding and validation of third-party components. ISVs can use these variables to obtain unique and
persistent info for an Azure-SSIS IR, such as Cluster ID and Cluster Node Count. With this info, ISVs can then
bind the license for their component to an Azure-SSIS IR as a cluster. This binding uses an ID that doesn't
change when customers start or stop, scale up or down, scale in or out, or reconfigure the Azure-SSIS IR in any
way.
The following diagram shows the typical installation, activation and license binding, and validation flows for
third-party components that use these new variables:
Instructions
1. ISVs can offer their licensed components in various SKUs or tiers (for example, single node, up to 5 nodes,
up to 10 nodes, and so forth). The ISV provides the corresponding Product Key when customers purchase
a product. The ISV can also provide an Azure Storage blob container that contains an ISV Setup script and
associated files. Customers can copy these files into their own storage container and modify them with
their own Product Key (for example, by running IsvSetup.exe -pid xxxx-xxxx-xxxx ). Customers can then
provision or reconfigure the Azure-SSIS IR with the SAS URI of their container as parameter. For more
info, see Custom setup for the Azure-SSIS integration runtime.
2. When the Azure-SSIS IR is provisioned or reconfigured, ISV Setup runs on each node to query the
Windows environment variables, SSIS_CLUSTERID and SSIS_CLUSTERNODECOUNT . Then the Azure-SSIS IR
submits its Cluster ID and the Product Key for the licensed product to the ISV Activation Server to
generate an Activation Key.
3. After receiving the Activation Key, ISV Setup can store the key locally on each node (for example, in the
Registry).
4. When customers run a package that uses the ISV's licensed component on a node of the Azure-SSIS IR,
the package reads the locally stored Activation Key and validates it against the node's Cluster ID. The
package can also optionally report the Cluster Node Count to the ISV activation server.
Here is an example of code that validates the activation key and reports the cluster node count:
public override DTSExecResult Validate(Connections, VariableDispenser, IDTSComponentEvents
componentEvents, IDTSLogging log)

Variables vars = null;

variableDispenser.LockForRead("System::ClusterID");

variableDispenser.LockForRead("System::ClusterNodeCount");

variableDispenser.GetVariables(ref vars);

// Validate Activation Key with ClusterID

// Report on ClusterNodeCount

vars.Unlock();

return base.Validate(connections, variableDispenser, componentEvents, log);

ISV partners
You can find a list of ISV partners who have adapted their components and extensions for the Azure-SSIS IR at
the end of this blog post - Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.

Next steps
Custom setup for the Azure-SSIS integration runtime
Enterprise Edition of the Azure-SSIS Integration Runtime
Configure the Azure-SSIS Integration Runtime for
high performance
3/5/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how to configure an Azure-SSIS Integration Runtime (IR) for high performance. The Azure-
SSIS IR allows you to deploy and run SQL Server Integration Services (SSIS) packages in Azure. For more
information about Azure-SSIS IR, see Integration runtime article. For information about deploying and running
SSIS packages on Azure, see Lift and shift SQL Server Integration Services workloads to the cloud.

IMPORTANT
This article contains performance results and observations from in-house testing done by members of the SSIS
development team. Your results may vary. Do your own testing before you finalize your configuration settings, which
affect both cost and performance.

Properties to configure
The following portion of a configuration script shows the properties that you can configure when you create an
Azure-SSIS Integration Runtime. For the complete PowerShell script and description, see Deploy SQL Server
Integration Services packages to Azure.
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$DataFactoryLocation = "EastUS"

### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory&regions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
max(2 x number of cores, 8) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/SQL Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your SQL Managed Instance

### SSISDB info


$SSISDBServerEndpoint = "[your server name or managed instance name.DNS prefix].database.windows.net" #
WARNING: Please ensure that there is no existing SSISDB, so we can prepare and manage one on your behalf
# Authentication info: SQL or Azure Active Directory (AAD)
$SSISDBServerAdminUserName = "[your server admin username for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBServerAdminPassword = "[your server admin password for SQL authentication or leave it empty for AAD
authentication]"
$SSISDBPricingTier = "[Basic|S0|S1|S2|S3|S4|S6|S7|S9|S12|P1|P2|P4|P6|P11|P15|…|ELASTIC_POOL(name =
<elastic_pool_name>) for Azure SQL Database or leave it empty for SQL Managed Instance]"

AzureSSISLocation
AzureSSISLocation is the location for the integration runtime worker node. The worker node maintains a
constant connection to the SSIS Catalog database (SSISDB) in Azure SQL Database. Set the AzureSSISLocation
to the same location as logical SQL server that hosts SSISDB, which lets the integration runtime to work as
efficiently as possible.

AzureSSISNodeSize
Data Factory, including the Azure-SSIS IR, supports the following options:
Standard_A4_v2
Standard_A8_v2
Standard_D1_v2
Standard_D2_v2
Standard_D3_v2
Standard_D4_v2
Standard_D2_v3
Standard_D4_v3
Standard_D8_v3
Standard_D16_v3
Standard_D32_v3
Standard_D64_v3
Standard_E2_v3
Standard_E4_v3
Standard_E8_v3
Standard_E16_v3
Standard_E32_v3
Standard_E64_v3
In the unofficial in-house testing by the SSIS engineering team, the D series appears to be more suitable for SSIS
package execution than the A series.
The performance/price ratio of the D series is higher than the A series and the performance/price ratio of the
v3 series is higher than the v2 series.
The throughput for the D series is higher than the A series at the same price and the throughput for the v3
series is higher than the v2 series at the same price.
The v2 series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3 series nodes
instead. If you already use the v2 series nodes, please switch to use the v3 series nodes as soon as possible.
The E series is memory optimized VM sizes that provides a higher memory-to-CPU ratio than other
machines.If your package requires a lot of memory, you can consider choosing E series VM.
Configure for execution speed
If you don't have many packages to run, and you want packages to run quickly, use the information in the
following chart to choose a virtual machine type suitable for your scenario.
This data represents a single package execution on a single worker node. The package loads 3 million records
with first name and last name columns from Azure Blob Storage, generates a full name column, and writes the
records that have the full name longer than 20 characters to Azure Blob Storage.
The y-axis is the number of packages that completed execution in one hour. Please note that this is only a test
result of one memory-consuming package. If you want to know the throughput of your package, it is
recommended to perform the test by yourself.
Configure for overall throughput
If you have lots of packages to run, and you care most about the overall throughput, use the information in the
following chart to choose a virtual machine type suitable for your scenario.
The y-axis is the number of packages that completed execution in one hour. Please note that this is only a test
result of one memory-consuming package. If you want to know the throughput of your package, it is
recommended to perform the test by yourself.

AzureSSISNodeNumber
AzureSSISNodeNumber adjusts the scalability of the integration runtime. The throughput of the integration
runtime is proportional to the AzureSSISNodeNumber . Set the AzureSSISNodeNumber to a small value at
first, monitor the throughput of the integration runtime, then adjust the value for your scenario. To reconfigure
the worker node count, see Manage an Azure-SSIS integration runtime.

AzureSSISMaxParallelExecutionsPerNode
When you're already using a powerful worker node to run packages, increasing
AzureSSISMaxParallelExecutionsPerNode may increase the overall throughput of the integration runtime. If
you want to increase max value, you need use Azure PowerShell to update
AzureSSISMaxParallelExecutionsPerNode . You can estimate the appropriate value based on the cost of your
package and the following configurations for the worker nodes. For more information, see General-purpose
virtual machine sizes.

M A X T EM P
STO RA GE M A X N IC S /
T H RO UGH P U M A X DATA EXP EC T ED
T EM P T : IO P S / DISK S / N ET W O RK
STO RA GE REA D M B P S / T H RO UGH P U P ERF O RM A N
SIZ E VC P U M EM O RY : GIB ( SSD) GIB W RIT E M B P S T : IO P S C E ( M B P S)

Standard_D1_ 1 3.5 50 3000 / 46 / 2 / 2x500 2 / 750


v2 23

Standard_D2_ 2 7 100 6000 / 93 / 4 / 4x500 2 / 1500


v2 46

Standard_D3_ 4 14 200 12000 / 187 / 8 / 8x500 4 / 3000


v2 93

Standard_D4_ 8 28 400 24000 / 375 / 16 / 16x500 8 / 6000


v2 187

Standard_A4_ 4 8 40 4000 / 80 / 8 / 8x500 4 / 1000


v2 40

Standard_A8_ 8 16 80 8000 / 160 / 16 / 16x500 8 / 2000


v2 80

Standard_D2_ 2 8 50 3000 / 46 / 4 / 6x500 2 / 1000


v3 23

Standard_D4_ 4 16 100 6000 / 93 / 8 / 12x500 2 / 2000


v3 46

Standard_D8_ 8 32 200 12000 / 187 / 16 / 24x500 4 / 4000


v3 93

Standard_D16 16 64 400 24000 / 375 / 32/ 48x500 8 / 8000


_v3 187

Standard_D32 32 128 800 48000 / 750 / 32 / 96x500 8 / 16000


_v3 375

Standard_D64 64 256 1600 96000 / 1000 32 / 192x500 8 / 30000


_v3 / 500

Standard_E2_ 2 16 50 3000 / 46 / 4 / 6x500 2 / 1000


v3 23
M A X T EM P
STO RA GE M A X N IC S /
T H RO UGH P U M A X DATA EXP EC T ED
T EM P T : IO P S / DISK S / N ET W O RK
STO RA GE REA D M B P S / T H RO UGH P U P ERF O RM A N
SIZ E VC P U M EM O RY : GIB ( SSD) GIB W RIT E M B P S T : IO P S C E ( M B P S)

Standard_E4_ 4 32 100 6000 / 93 / 8 / 12x500 2 / 2000


v3 46

Standard_E8_ 8 64 200 12000 / 187 / 16 / 24x500 4 / 4000


v3 93

Standard_E16 16 128 400 24000 / 375 / 32 / 48x500 8 / 8000


_v3 187

Standard_E32 32 256 800 48000 / 750 / 32 / 96x500 8 / 16000


_v3 375

Standard_E64 64 432 1600 96000 / 1000 32 / 192x500 8 / 30000


_v3 / 500

Here are the guidelines for setting the right value for the AzureSSISMaxParallelExecutionsPerNode
property:
1. Set it to a small value at first.
2. Increase it by a small amount to check whether the overall throughput is improved.
3. Stop increasing the value when the overall throughput reaches the maximum value.

SSISDBPricingTier
SSISDBPricingTier is the pricing tier for the SSIS Catalog database (SSISDB) on in Azure SQL Database. This
setting affects the maximum number of workers in the IR instance, the speed to queue a package execution, and
the speed to load the execution log.
If you don't care about the speed to queue package execution and to load the execution log, you can
choose the lowest database pricing tier. Azure SQL Database with Basic pricing supports 8 workers in an
integration runtime instance.
Choose a more powerful database than Basic if the worker count is more than 8, or the core count is
more than 50. Otherwise the database becomes the bottleneck of the integration runtime instance and
the overall performance is negatively impacted.
Choose a more powerful database such as s3 if the logging level is set to verbose. According our
unofficial in-house testing, s3 pricing tier can support SSIS package execution with 2 nodes, 128 parallel
counts and verbose logging level.
You can also adjust the database pricing tier based on database transaction unit (DTU) usage information
available on the Azure portal.

Design for high performance


Designing an SSIS package to run on Azure is different from designing a package for on-premises execution.
Instead of combining multiple independent tasks in the same package, separate them into several packages for
more efficient execution in the Azure-SSIS IR. Create a package execution for each package, so that they don’t
have to wait for each other to finish. This approach benefits from the scalability of the Azure-SSIS integration
runtime and improves the overall throughput.
Next steps
Learn more about the Azure-SSIS Integration Runtime. See Azure-SSIS Integration Runtime.
Configure Azure-SSIS integration runtime for
business continuity and disaster recovery (BCDR)
3/26/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure SQL Database/Managed Instance and SQL Server Integration Services (SSIS) in Azure Data Factory (ADF)
can be combined as the recommended all-Platform as a Service (PaaS) solution for SQL Server migration. You
can deploy your SSIS projects into SSIS catalog database (SSISDB) hosted by Azure SQL Database/Managed
Instance and run your SSIS packages on Azure SSIS integration runtime (IR) in ADF.
For business continuity and disaster recovery (BCDR), Azure SQL Database/Managed Instance can be configured
with a geo-replication/failover group, where SSISDB in a primary Azure region with read-write access (primary
role) will be continuously replicated to a secondary region with read-only access (secondary role). When a
disaster occurs in the primary region, a failover will be triggered, where the primary and secondary SSISDBs will
swap roles.
For BCDR, you can also configure a dual standby Azure SSIS IR pair that works in sync with Azure SQL
Database/Managed Instance failover group. This allows you to have a pair of running Azure-SSIS IRs that at any
given time, only one can access the primary SSISDB to fetch and execute packages, as well as write package
execution logs (primary role), while the other can only do the same for packages deployed somewhere else, for
example in Azure Files (secondary role). When SSISDB failover occurs, the primary and secondary Azure-SSIS
IRs will also swap roles and if both are running, there'll be a near-zero downtime.
This article describes how to configure Azure-SSIS IR with Azure SQL Database/Managed Instance failover
group for BCDR.

Configure a dual standby Azure-SSIS IR pair with Azure SQL Database


failover group
To configure a dual standby Azure-SSIS IR pair that works in sync with Azure SQL Database failover group,
complete the following steps.
1. Using Azure portal/ADF UI, you can create a new Azure-SSIS IR with your primary Azure SQL Database
server to host SSISDB in the primary region. If you have an existing Azure-SSIS IR that's already attached
to SSIDB hosted by your primary Azure SQL Database server and it's still running, you need to stop it first
to reconfigure it. This will be your primary Azure-SSIS IR.
When selecting to use SSISDB on the Deployment settings page of Integration runtime setup pane,
select also the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover
check box. For Dual standby pair name , enter a name to identify your pair of primary and secondary
Azure-SSIS IRs. When you complete the creation of your primary Azure-SSIS IR, it will be started and
attached to a primary SSISDB that will be created on your behalf with read-write access. If you've just
reconfigured it, you need to restart it.
2. Using Azure portal, you can check whether the primary SSISDB has been created on the Over view page
of your primary Azure SQL Database server. Once it's created, you can create a failover group for your
primary and secondary Azure SQL Database servers and add SSISDB to it on the Failover groups page.
Once your failover group is created, you can check whether the primary SSISDB has been replicated to a
secondary one with read-only access on the Over view page of your secondary Azure SQL Database
server.
3. Using Azure portal/ADF UI, you can create another Azure-SSIS IR with your secondary Azure SQL
Database server to host SSISDB in the secondary region. This will be your secondary Azure-SSIS IR. For
complete BCDR, make sure that all resources it depends on are also created in the secondary region, for
example Azure Storage for storing custom setup script/files, ADF for orchestration/scheduling package
executions, etc.
When selecting to use SSISDB on the Deployment settings page of Integration runtime setup pane,
select also the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover
check box. For Dual standby pair name , enter the same name to identify your pair of primary and
secondary Azure-SSIS IRs. When you complete the creation of your secondary Azure-SSIS IR, it will be
started and attached to the secondary SSISDB.
4. If you want to have a near-zero downtime when SSISDB failover occurs, keep both of your Azure-SSIS IRs
running. Only your primary Azure-SSIS IR can access the primary SSISDB to fetch and execute packages,
as well as write package execution logs, while your secondary Azure-SSIS IR can only do the same for
packages deployed somewhere else, for example in Azure Files.
If you want to minimize your running cost, you can stop your secondary Azure-SSIS IR after it's created.
When SSISDB failover occurs, your primary and secondary Azure-SSIS IRs will swap roles. If your
primary Azure-SSIS IR is stopped, you need to restart it. Depending on whether it's injected into a virtual
network and the injection method used, it will take within 5 minutes or around 20 - 30 minutes for it to
run.
5. If you use ADF for orchestration/scheduling package executions, make sure that all relevant ADF pipelines
with Execute SSIS Package activities and associated triggers are copied to your secondary ADF with the
triggers initially disabled. When SSISDB failover occurs, you need to enable them.
6. You can test your Azure SQL Database failover group and check on Azure-SSIS IR monitoring page in
ADF portal whether your primary and secondary Azure-SSIS IRs have swapped roles.

Configure a dual standby Azure-SSIS IR pair with Azure SQL Managed


Instance failover group
To configure a dual standby Azure-SSIS IR pair that works in sync with Azure SQL Managed Instance failover
group, complete the following steps.
1. Using Azure portal, you can create a failover group for your primary and secondary Azure SQL Managed
Instances on the Failover groups page of your primary Azure SQL Managed Instance.
2. Using Azure portal/ADF UI, you can create a new Azure-SSIS IR with your primary Azure SQL Managed
Instance to host SSISDB in the primary region. If you have an existing Azure-SSIS IR that's already
attached to SSIDB hosted by your primary Azure SQL Managed Instance and it's still running, you need to
stop it first to reconfigure it. This will be your primary Azure-SSIS IR.
When selecting to use SSISDB on the Deployment settings page of Integration runtime setup pane,
select also the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover
check box. For Dual standby pair name , enter a name to identify your pair of primary and secondary
Azure-SSIS IRs. When you complete the creation of your primary Azure-SSIS IR, it will be started and
attached to a primary SSISDB that will be created on your behalf with read-write access. If you've just
reconfigured it, you need to restart it. You can also check whether the primary SSISDB has been
replicated to a secondary one with read-only access on the Over view page of your secondary Azure
SQL Managed Instance.
3. Using Azure portal/ADF UI, you can create another Azure-SSIS IR with your secondary Azure SQL
Managed Instance to host SSISDB in the secondary region. This will be your secondary Azure-SSIS IR. For
complete BCDR, make sure that all resources it depends on are also created in the secondary region, for
example Azure Storage for storing custom setup script/files, ADF for orchestration/scheduling package
executions, etc.
When selecting to use SSISDB on the Deployment settings page of Integration runtime setup pane,
select also the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover
check box. For Dual standby pair name , enter the same name to identify your pair of primary and
secondary Azure-SSIS IRs. When you complete the creation of your secondary Azure-SSIS IR, it will be
started and attached to the secondary SSISDB.
4. Azure SQL Managed Instance can secure sensitive data in databases, such as SSISDB, by encrypting them
using Database Master Key (DMK). DMK itself is in turn encrypted using Service Master Key (SMK) by
default. At the time of writing, Azure SQL Managed Instance failover group doesn't replicate SMK from
the primary Azure SQL Managed Instance, so DMK and in turn SSISDB can't be decrypted on the
secondary Azure SQL Managed Instance after failover occurs. To work around this, you can add a
password encryption for DMK to be decrypted on the secondary Azure SQL Managed Instance. Using
SSMS, complete the following steps.
a. Run the following command for SSISDB in your primary Azure SQL Managed Instance to add a
password for encrypting DMK.

ALTER MASTER KEY ADD ENCRYPTION BY PASSWORD = 'YourPassword'

b. Run the following command for SSISDB in both your primary and secondary Azure SQL Managed
Instances to add the new password for decrypting DMK.

EXEC sp_control_dbmasterkey_password @db_name = N'SSISDB', @password = N'YourPassword',


@action = N'add'

5. If you want to have a near-zero downtime when SSISDB failover occurs, keep both of your Azure-SSIS IRs
running. Only your primary Azure-SSIS IR can access the primary SSISDB to fetch and execute packages,
as well as write package execution logs, while your secondary Azure-SSIS IR can only do the same for
packages deployed somewhere else, for example in Azure Files.
If you want to minimize your running cost, you can stop your secondary Azure-SSIS IR after it's created.
When SSISDB failover occurs, your primary and secondary Azure-SSIS IRs will swap roles. If your
primary Azure-SSIS IR is stopped, you need to restart it. Depending on whether it's injected into a virtual
network and the injection method used, it will take within 5 minutes or around 20 - 30 minutes for it to
run.
6. If you use Azure SQL Managed Instance Agent for orchestration/scheduling package executions, make
sure that all relevant SSIS jobs with their job steps and associated schedules are copied to your
secondary Azure SQL Managed Instance with the schedules initially disabled. Using SSMS, complete the
following steps.
a. For each SSIS job, right-click and select the Script Job as , CREATE To , and New Quer y Editor
Window dropdown menu items to generate its script.
b. For each generated SSIS job script, find the command to execute sp_add_job stored procedure
and modify/remove the value assignment to @owner_login_name argument as necessary.
c. For each updated SSIS job script, run it on your secondary Azure SQL Managed Instance to copy
the job with its job steps and associated schedules.
d. Using the following script, create a new T-SQL job to enable/disable SSIS job schedules based on
the primary/secondary SSISDB role, respectively, in both your primary and secondary Azure SQL
Managed Instances and run it regularly. When SSISDB failover occurs, SSIS job schedules that
were disabled will be enabled and vice versa.

IF (SELECT Top 1 role_desc FROM SSISDB.sys.dm_geo_replication_link_status WHERE


partner_database = 'SSISDB') = 'PRIMARY'
BEGIN
IF (SELECT enabled FROM msdb.dbo.sysschedules WHERE schedule_id = <ScheduleID>) = 0
EXEC msdb.dbo.sp_update_schedule @schedule_id = <ScheduleID >, @enabled = 1
END
ELSE
BEGIN
IF (SELECT enabled FROM msdb.dbo.sysschedules WHERE schedule_id = <ScheduleID>) = 1
EXEC msdb.dbo.sp_update_schedule @schedule_id = <ScheduleID >, @enabled = 0
END

7. If you use ADF for orchestration/scheduling package executions, make sure that all relevant ADF pipelines
with Execute SSIS Package activities and associated triggers are copied to your secondary ADF with the
triggers initially disabled. When SSISDB failover occurs, you need to enable them.
8. You can test your Azure SQL Managed Instance failover group and check on Azure-SSIS IR monitoring
page in ADF portal whether your primary and secondary Azure-SSIS IRs have swapped roles.

Attach a new Azure-SSIS IR to existing SSISDB hosted by Azure SQL


Database/Managed Instance
If a disaster occurs and impacts your existing Azure-SSIS IR but not Azure SQL Database/Managed Instance in
the same region, you can replace it with a new one in another region. To attach your existing SSISDB hosted by
Azure SQL Database/Managed Instance to a new Azure-SSIS IR, complete the following steps.
1. If your existing Azure-SSIS IR is still running, you need to stop it first using Azure portal/ADF UI or Azure
PowerShell. If the disaster also impacts ADF in the same region, you can skip this step.
2. Using SSMS, run the following command for SSISDB in your Azure SQL Database/Managed Instance to
update the metadata that will allow connections from your new ADF/Azure-SSIS IR.

EXEC [catalog].[failover_integration_runtime] @data_factory_name = 'YourNewADF',


@integration_runtime_name = 'YourNewAzureSSISIR'

3. Using Azure portal/ADF UI or Azure PowerShell, create your new ADF/Azure-SSIS IR named
YourNewADF/YourNewAzureSSISIR, respectively, in another region. If you use Azure portal/ADF UI, you
can ignore the test connection error on Deployment settings page of Integration runtime setup
pane.

Next steps
You can consider these other configuration options for your Azure-SSIS IR:
Configure package stores for your Azure-SSIS IR
Configure custom setups for your Azure-SSIS IR
Configure virtual network injection for your Azure-SSIS IR
Configure self-hosted IR as a proxy for your Azure-SSIS IR
How to clean up SSISDB logs automatically
7/19/2021 • 12 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Once you provision an Azure-SQL Server Integration Services (SSIS) integration runtime (IR) in Azure Data
Factory (ADF), you can use it to run SSIS packages deployed into:
SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed Instance (Project Deployment Model)
file system, Azure Files, or SQL Server database (MSDB) hosted by Azure SQL Managed Instance (Package
Deployment Model)
In the Project Deployment Model, your Azure-SSIS IR will deploy SSIS projects into SSISDB, fetch SSIS packages
to run from SSISDB, and write package execution logs back into SSISDB. To manage the accumulated logs, we've
provided relevant SSISDB properties and stored procedure that can be invoked automatically via ADF, Azure
SQL Managed Instance Agent, or Elastic Database Jobs.

SSISDB log clean-up properties and stored procedure


To configure SSISDB log clean-up properties, you can connect to SSISDB hosted by your Azure SQL Database
server/Managed Instance using SQL Server Management Studio (SSMS), see Connecting to SSISDB. Once
connected, on the Object Explorer window of SSMS, you can expand the Integration Ser vices Catalogs
node, right-click on the SSISDB subnode, and select the Proper ties menu item to open Catalog Proper ties
dialog box. On the Catalog Proper ties dialog box, you can find the following SSISDB log clean-up properties:
Clean Logs Periodically : Enables automatic clean-up of package execution logs, by default set to True.
Retention Period (days) : Specifies the maximum age of retained logs (in days), by default set to 365 and
older logs are deleted by automatic clean-up.
Periodically Remove Old Versions : Enables automatic clean-up of stored project versions, by default set
to True.
Maximum Number of Versions per Project : Specifies the maximum number of stored project versions,
by default set to 10 and older versions are deleted by automatic clean-up.
Once SSISDB log clean-up properties are configured, you can invoke the relevant SSISDB stored procedure,
[internal].[cleanup_server_retention_window_exclusive] , to clean up logs automatically via ADF, Azure SQL
Managed Instance Agent, or Elastic Database Jobs.

Clean up SSISDB logs automatically via ADF


Regardless whether you use Azure SQL database server/Managed Instance to host SSISDB, you can always use
ADF to clean up SSISDB logs automatically. To do so, you can prepare an Execute SSIS Package activity in ADF
pipeline with an embedded package containing a single Execute SQL Task that invokes the relevant SSISDB
stored procedure. See example 4) in our blog: Run Any SQL Anywhere in 3 Easy Steps with SSIS in Azure Data
Factory.
Once your ADF pipeline is prepared, you can attach a schedule trigger to run it periodically, see How to trigger
ADF pipeline on a schedule.

Clean up SSISDB logs automatically via Azure SQL Managed Instance


Agent
If you use Azure SQL Managed Instance to host SSISDB, you can also use its built-in job orchestrator/scheduler,
Azure SQL Managed Instance Agent, to clean up SSISDB logs automatically. If SSISDB is recently created in your
Azure SQL Managed Instance, we've also created a T-SQL job called SSIS Ser ver Maintenance Job under
Azure SQL Managed Instance Agent for this purpose. It's by default disabled and configured with a schedule to
run daily. If you want to enable it and or reconfigure its schedule, you can do so by connecting to your Azure
SQL Managed Instance using SSMS. Once connected, on the Object Explorer window of SSMS, you can
expand the SQL Ser ver Agent node, expand the Jobs subnode, and double click on the SSIS Ser ver
Maintenance Job to enable/reconfigure it.
If your Azure SQL Managed Instance Agent doesn't yet have the SSIS Ser ver Maintenance Job created under
it, you can add it manually by running the following T-SQL script on your Azure SQL Managed Instance.

USE msdb
IF EXISTS(SELECT * FROM sys.server_principals where name = '##MS_SSISServerCleanupJobLogin##')
DROP LOGIN ##MS_SSISServerCleanupJobLogin##

DECLARE @loginPassword nvarchar(256)


SELECT @loginPassword = REPLACE (CONVERT( nvarchar(256), CRYPT_GEN_RANDOM( 64 )), N'''', N'''''')
EXEC ('CREATE LOGIN ##MS_SSISServerCleanupJobLogin## WITH PASSWORD =''' +@loginPassword + ''', CHECK_POLICY
= OFF')
ALTER LOGIN ##MS_SSISServerCleanupJobLogin## DISABLE

USE master
GRANT VIEW SERVER STATE TO ##MS_SSISServerCleanupJobLogin##

USE SSISDB
IF EXISTS (SELECT name FROM sys.database_principals WHERE name = '##MS_SSISServerCleanupJobUser##')
DROP USER ##MS_SSISServerCleanupJobUser##
CREATE USER ##MS_SSISServerCleanupJobUser## FOR LOGIN ##MS_SSISServerCleanupJobLogin##
GRANT EXECUTE ON [internal].[cleanup_server_retention_window_exclusive] TO ##MS_SSISServerCleanupJobUser##
GRANT EXECUTE ON [internal].[cleanup_server_project_version] TO ##MS_SSISServerCleanupJobUser##

USE msdb
EXEC dbo.sp_add_job
@job_name = N'SSIS Server Maintenance Job',
@enabled = 0,
@owner_login_name = '##MS_SSISServerCleanupJobLogin##',
@description = N'Runs every day. The job removes operation records from the database that are outside
the retention window and maintains a maximum number of versions per project.'

DECLARE @IS_server_name NVARCHAR(30)


SELECT @IS_server_name = CONVERT(NVARCHAR, SERVERPROPERTY('ServerName'))
EXEC sp_add_jobserver @job_name = N'SSIS Server Maintenance Job',
@server_name = @IS_server_name

EXEC sp_add_jobstep
@job_name = N'SSIS Server Maintenance Job',
@step_name = N'SSIS Server Operation Records Maintenance',
@subsystem = N'TSQL',
@command = N'
DECLARE @role int
SET @role = (SELECT [role] FROM [sys].[dm_hadr_availability_replica_states] hars INNER JOIN [sys].
[availability_databases_cluster] adc ON hars.[group_id] = adc.[group_id] WHERE hars.[is_local] = 1 AND adc.
[database_name] =''SSISDB'')
IF DB_ID(''SSISDB'') IS NOT NULL AND (@role IS NULL OR @role = 1)
EXEC [SSISDB].[internal].[cleanup_server_retention_window_exclusive]',
@database_name = N'msdb',
@on_success_action = 3,
@retry_attempts = 3,
@retry_interval = 3;

EXEC sp_add_jobstep
@job_name = N'SSIS Server Maintenance Job',
@step_name = N'SSIS Server Max Version Per Project Maintenance',
@subsystem = N'TSQL',
@command = N'
DECLARE @role int
SET @role = (SELECT [role] FROM [sys].[dm_hadr_availability_replica_states] hars INNER JOIN [sys].
[availability_databases_cluster] adc ON hars.[group_id] = adc.[group_id] WHERE hars.[is_local] = 1 AND adc.
[database_name] =''SSISDB'')
IF DB_ID(''SSISDB'') IS NOT NULL AND (@role IS NULL OR @role = 1)
EXEC [SSISDB].[internal].[cleanup_server_project_version]',
@database_name = N'msdb',
@retry_attempts = 3,
@retry_interval = 3;

EXEC sp_add_jobschedule
@job_name = N'SSIS Server Maintenance Job',
@name = 'SSISDB Scheduler',
@enabled = 1,
@freq_type = 4, /*daily*/
@freq_interval = 1,/*every day*/
@freq_subday_type = 0x1,
@active_start_date = 20001231,
@active_end_date = 99991231,
@active_start_time = 0,
@active_end_time = 120000

Clean up SSISDB logs automatically via Elastic Database Jobs


If you use Azure SQL Database server to host SSISDB, it doesn't have a built-in job orchestrator/scheduler, so
you must use an external component, e.g. ADF (see above) or Elastic Database Jobs (see the rest of this section),
to clean up SSISDB logs automatically.
Elastic Database Jobs is an Azure service that can automate and run jobs against a database or group of
databases. You can schedule, run, and monitor these jobs by using Azure portal, Azure PowerShell, T-SQL, or
REST APIs. Use Elastic Database Jobs to invoke the relevant SSISDB stored procedure for log clean-up one time
or on a schedule. You can choose the schedule interval based on SSISDB resource usage to avoid heavy
database load.
For more info, see Manage groups of databases with Elastic Database Jobs.
The following sections describe how to invoke the relevant SSISDB stored procedure,
[internal].[cleanup_server_retention_window_exclusive] , which removes SSISDB logs that are outside the
configured retention window.
Configure Elastic Database Jobs using Azure PowerShell
IMPORTANT
Using this Azure feature from PowerShell requires the AzureRM module installed. This is an older module only available
for Windows PowerShell 5.1 that no longer receives new features. The Az and AzureRM modules are not compatible
when installed for the same versions of PowerShell. If you need both versions:
1. Uninstall the Az module from a PowerShell 5.1 session.
2. Install the AzureRM module from a PowerShell 5.1 session.
3. Download and install PowerShell Core 6.x or later.
4. Install the Az module in a PowerShell Core session.

The following Azure PowerShell scripts create a new Elastic Job that invokes SSISDB log clean-up stored
procedure. For more info, see Create an Elastic Job agent using PowerShell.
Create parameters

# Parameters needed to create your job database


param(
$ResourceGroupName = $(Read-Host "Please enter an existing resource group name"),
$AgentServerName = $(Read-Host "Please enter the name of an existing Azure SQL Database server, for example
myjobserver, to hold your job database"),
$SSISDBLogCleanupJobDB = $(Read-Host "Please enter a name for your job database to be created in the given
Azure SQL Database server"),

# Your job database should be a clean, empty S0 or higher service tier. We set S0 as default.
$PricingTier = "S0",

# Parameters needed to create your Elastic Job agent


$SSISDBLogCleanupAgentName = $(Read-Host "Please enter a name for your Elastic Job agent"),

# Parameters needed to create credentials in your job database for connecting to SSISDB
$PasswordForSSISDBCleanupUser = $(Read-Host "Please provide a new password for the log clean-up job user to
connect to SSISDB"),

# Parameters needed to create the login and user for SSISDB


$SSISDBServerEndpoint = $(Read-Host "Please enter the name of target Azure SQL Database server that contains
SSISDB, for example myssisdbserver") + '.database.windows.net',
$SSISDBServerAdminUserName = $(Read-Host "Please enter the target server admin username for SQL
authentication"),
$SSISDBServerAdminPassword = $(Read-Host "Please enter the target server admin password for SQL
authentication"),
$SSISDBName = "SSISDB",

# Parameters needed to set the job schedule for invoking SSISDB log clean-up stored procedure
$RunJobOrNot = $(Read-Host "Please indicate whether you want to run the job to clean up SSISDB logs outside
the retention window immediately (Y/N). Make sure the retention window is set properly before running the
following scripts as deleted logs cannot be recovered."),
$IntervalType = $(Read-Host "Please enter the interval type for SSISDB log clean-up schedule: Year, Month,
Day, Hour, Minute, Second are supported."),
$IntervalCount = $(Read-Host "Please enter the count of interval type for SSISDB log clean-up schedule."),

# The start time for SSISDB log clean-up schedule is set to current time by default.
$StartTime = (Get-Date)

Invoke SSISDB log clean-up stored procedure

# Install the latest PowerShell PackageManagement module that PowerShellGet v1.6.5 depends on
Find-Package PackageManagement -RequiredVersion 1.1.7.2 | Install-Package -Force

# You may need to restart your PowerShell session


# Install the latest PowerShellGet module that adds the -AllowPrerelease flag to Install-Module
Find-Package PowerShellGet -RequiredVersion 1.6.5 | Install-Package -Force

# Install AzureRM.Sql preview cmdlets side by side with the existing AzureRM.Sql version
# Install AzureRM.Sql preview cmdlets side by side with the existing AzureRM.Sql version
Install-Module -Name AzureRM.Sql -AllowPrerelease -Force

# Sign in to your Azure account


Connect-AzureRmAccount

# Create your job database for defining SSISDB log clean-up job and tracking the job history
Write-Output "Creating a blank SQL database to be used as your job database ..."
$JobDatabase = New-AzureRmSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $AgentServerName -
DatabaseName $SSISDBLogCleanupJobDB -RequestedServiceObjectiveName $PricingTier
$JobDatabase

# Enable Elastic Database Jobs preview in your Azure subscription


Register-AzureRmProviderFeature -FeatureName sqldb-JobAccounts -ProviderNamespace Microsoft.Sql

# Create your Elastic Job agent


Write-Output "Creating your Elastic Job agent..."
$JobAgent = $JobDatabase | New-AzureRmSqlElasticJobAgent -Name $SSISDBLogCleanupAgentName
$JobAgent

# Create job credentials in your job database for connecting to SSISDB in target server
Write-Output "Creating job credentials for connecting to SSISDB..."
$JobCredSecure = ConvertTo-SecureString -String $PasswordForSSISDBCleanupUser -AsPlainText -Force
$JobCred = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList
"SSISDBLogCleanupUser", $JobCredSecure
$JobCred = $JobAgent | New-AzureRmSqlElasticJobCredential -Name "SSISDBLogCleanupUser" -Credential $JobCred

# Create the job user login in master database of target server


Write-Output "Grant permissions on the master database of target server..."
$Params = @{
'Database' = 'master'
'ServerInstance' = $SSISDBServerEndpoint
'Username' = $SSISDBServerAdminUserName
'Password' = $SSISDBServerAdminPassword
'OutputSqlErrors' = $true
'Query' = "CREATE LOGIN SSISDBLogCleanupUser WITH PASSWORD = '" + $PasswordForSSISDBCleanupUser + "'"
}
Invoke-SqlCmd @Params

# Create SSISDB log clean-up user from login in SSISDB and grant it permissions to invoke SSISDB log clean-
up stored procedure
Write-Output "Grant appropriate permissions on SSISDB..."
$TargetDatabase = $SSISDBName
$CreateJobUser = "CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser"
$GrantStoredProcedureExecution = "GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO
SSISDBLogCleanupUser"

$TargetDatabase | ForEach-Object -Process {


$Params.Database = $_
$Params.Query = $CreateJobUser
Invoke-SqlCmd @Params
$Params.Query = $GrantStoredProcedureExecution
Invoke-SqlCmd @Params
}

# Create your target group that includes only SSISDB to clean up


Write-Output "Creating your target group that includes only SSISDB to clean up..."
$SSISDBTargetGroup = $JobAgent | New-AzureRmSqlElasticJobTargetGroup -Name "SSISDBTargetGroup"
$SSISDBTargetGroup | Add-AzureRmSqlElasticJobTarget -ServerName $SSISDBServerEndpoint -Database $SSISDBName

# Create your job to invoke SSISDB log clean-up stored procedure


Write-Output "Creating your job to invoke SSISDB log clean-up stored procedure..."
$JobName = "CleanupSSISDBLog"
$Job = $JobAgent | New-AzureRmSqlElasticJob -Name $JobName -RunOnce
$Job

# Add your job step to invoke internal.cleanup_server_retention_window_exclusive


Write-Output "Adding your job step to invoke SSISDB log clean-up stored procedure..."
$SqlText = "EXEC internal.cleanup_server_retention_window_exclusive"
$Job | Add-AzureRmSqlElasticJobStep -Name "Step to invoke SSISDB log clean-up stored procedure" -
TargetGroupName $SSISDBTargetGroup.TargetGroupName -CredentialName $JobCred.CredentialName -CommandText
$SqlText

# Run your job to immediately invoke SSISDB log clean-up stored procedure once
if ($RunJobOrNot -eq 'Y')
{
Write-Output "Invoking SSISDB log clean-up stored procedure immediately..."
$JobExecution = $Job | Start-AzureRmSqlElasticJob
$JobExecution
}

# Schedule your job to invoke SSISDB log clean-up stored procedure periodically, deleting SSISDB logs
outside the retention window
Write-Output "Starting your schedule to invoke SSISDB log clean-up stored procedure periodically..."
$Job | Set-AzureRmSqlElasticJob -IntervalType $IntervalType -IntervalCount $IntervalCount -StartTime
$StartTime -Enable

Configure Elastic Database Jobs using T -SQL


The following T-SQL scripts create a new Elastic Job that invokes SSISDB log clean-up stored procedure. For
more info, see Use T-SQL to create and manage Elastic Database Jobs.
1. Identify an empty S0/higher service tier of Azure SQL Database or create a new one for your job
database. Then create an Elastic Job Agent in Azure portal.
2. In your job database, create credentials for connecting to SSISDB in your target server.

-- Connect to the job database specified when creating your job agent.
-- Create a database master key if one doesn't already exist, using your own password.
CREATE MASTER KEY ENCRYPTION BY PASSWORD= '<EnterStrongPasswordHere>';

-- Create credentials for SSISDB log clean-up.


CREATE DATABASE SCOPED CREDENTIAL SSISDBLogCleanupCred WITH IDENTITY = 'SSISDBLogCleanupUser', SECRET
= '<EnterStrongPasswordHere>';

3. Define your target group that includes only SSISDB to clean up.

-- Connect to your job database.


-- Add your target group.
EXEC jobs.sp_add_target_group 'SSISDBTargetGroup'

-- Add SSISDB to your target group


EXEC jobs.sp_add_target_group_member 'SSISDBTargetGroup',
@target_type = 'SqlDatabase',
@server_name = '<EnterSSISDBTargetServerName>',
@database_name = 'SSISDB'

-- View your recently created target group and its members.


SELECT * FROM jobs.target_groups WHERE target_group_name = 'SSISDBTargetGroup';
SELECT * FROM jobs.target_group_members WHERE target_group_name = 'SSISDBTargetGroup';

4. Create SSISDB log clean-up user from login in SSISDB and grant it permissions to invoke SSISDB log
clean-up stored procedure. For detailed guidance, see Manage logins.

-- Connect to the master database of target server that hosts SSISDB


CREATE LOGIN SSISDBLogCleanupUser WITH PASSWORD = '<strong_password>';

-- Connect to SSISDB
CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser;
GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO SSISDBLogCleanupUser
5. Create your job and add your job step to invoke SSISDB log clean-up stored procedure.

-- Connect to your job database.


-- Add your job to invoke SSISDB log clean-up stored procedure.
EXEC jobs.sp_add_job @job_name='CleanupSSISDBLog', @description='Remove SSISDB logs outside the
configured retention window'

-- Add your job step to invoke internal.cleanup_server_retention_window_exclusive


EXEC jobs.sp_add_jobstep @job_name='CleanupSSISDBLog',
@command=N'EXEC internal.cleanup_server_retention_window_exclusive',
@credential_name='SSISDBLogCleanupCred',
@target_group_name='SSISDBTargetGroup'

6. Before continuing, make sure you set the retention window properly. SSISDB logs outside this window
will be deleted and can't be recovered. You can then run your job immediately to start SSISDB log clean-
up.

-- Connect to your job database.


-- Run your job immediately to invoke SSISDB log clean-up stored procedure.
declare @je uniqueidentifier
exec jobs.sp_start_job 'CleanupSSISDBLog', @job_execution_id = @je output

-- Watch SSISDB log clean-up results


select @je
select * from jobs.job_executions where job_execution_id = @je

7. Optionally, you can delete SSISDB logs outside the retention window on a schedule. Configure your job
parameters as follows.

-- Connect to your job database.


EXEC jobs.sp_update_job
@job_name='CleanupSSISDBLog',
@enabled=1,
@schedule_interval_type='<EnterIntervalType(Month,Day,Hour,Minute,Second)>',
@schedule_interval_count='<EnterDetailedIntervalValue>',
@schedule_start_time='<EnterProperStartTimeForSchedule>',
@schedule_end_time='<EnterProperEndTimeForSchedule>'

Monitor SSISDB log clean-up job using Azure portal


You can monitor SSISDB log clean-up job in Azure portal. For each execution, you can see its status, start time,
and end time.
Monitor SSISDB log clean-up job using T -SQL
You can also use T-SQL to view the execution history of SSISDB log clean-up job.

-- Connect to your job database.


-- View all SSISDB log clean-up job executions.
SELECT * FROM jobs.job_executions WHERE job_name = 'CleanupSSISDBLog'
ORDER BY start_time DESC

-- View all active executions.


SELECT * FROM jobs.job_executions WHERE is_active = 1
ORDER BY start_time DESC

Next steps
To manage and monitor your Azure-SSIS IR, see the following articles.
Reconfigure the Azure-SSIS integration runtime
Monitor the Azure-SSIS integration runtime.
Use Azure SQL Managed Instance with SQL Server
Integration Services (SSIS) in Azure Data Factory
3/26/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


You can now move your SQL Server Integration Services (SSIS) projects, packages, and workloads to the Azure
cloud. Deploy, run, and manage SSIS projects and packages on Azure SQL Database or SQL Managed Instance
with familiar tools such as SQL Server Management Studio (SSMS). This article highlights the following specific
areas when using Azure SQL Managed Instance with Azure-SSIS integration runtime (IR):
Provision an Azure-SSIS IR with SSIS catalog (SSISDB) hosted by Azure SQL Managed Instance
Execute SSIS packages by Azure SQL Managed Instance Agent job
Clean up SSISDB logs by Azure SQL Managed Instance Agent job
Azure-SSIS IR failover with Azure SQL Managed Instance
Migrate on-premises SSIS workloads to SSIS in ADF with Azure SQL Managed Instance as database
workload destination

Provision Azure-SSIS IR with SSISDB hosted by Azure SQL Managed


Instance
Prerequisites
1. Enable Azure Active Directory (Azure AD) on Azure SQL Managed Instance, when choosing Azure Active
Directory authentication.
2. Choose how to connect SQL Managed Instance, over private endpoint or over public endpoint:
Over private endpoint (preferred)
a. Choose the virtual network for Azure-SSIS IR to join:
Inside the same virtual network as the managed instance, with different subnet .
Inside a different virtual network than the the managed instance, via virtual network
peering (which is limited to the same region due to Global VNet peering constraints) or
a connection from virtual network to virtual network.
For more info on SQL Managed Instance connectivity, see Connect your application to
Azure SQL Managed Instance.
b. Configure virtual network.
Over public endpoint
Azure SQL Managed Instances can provide connectivity over public endpoints. Inbound and
outbound requirements need to meet to allow traffic between SQL Managed Instance and Azure-
SSIS IR:
when Azure-SSIS IR not inside a virtual network (preferred)
Inbound requirement of SQL Managed Instance , to allow inbound traffic from Azure-
SSIS IR.
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE

TCP Azure Cloud * VirtualNetwork 3342


service tag

For more information, see Allow public endpoint traffic on the network security group.
when Azure-SSIS IR inside a virtual network
There is a special scenario when SQL Managed Instance is in a region that Azure-SSIS IR
does not support, Azure-SSIS IR is inside a virtual network without VNet peering due to
Global VNet peering limitation. In this scenario, Azure-SSIS IR inside a vir tual network
connects SQL Managed Instance over public endpoint . Use below Network Security
Group(NSG) rules to allow traffic between SQL Managed Instance and Azure-SSIS IR:
a. Inbound requirement of SQL Managed Instance , to allow inbound traffic from
Azure-SSIS IR.

T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE

TCP Static IP * VirtualNetwork 3342


address of
Azure-SSIS IR
For details, see
Bring Your
Own Public IP
for Azure-SSIS
IR.

b. Outbound requirement of Azure-SSIS IR , to allow outbound traffic to SQL


Managed Instance.

T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE

TCP VirtualNetwork * SQL Managed 3342


Instance public
endpoint IP
address

Configure virtual network


1. User permission . The user who creates the Azure-SSIS IR must have the role assignment at least on
Azure Data Factory resource with one of the options below:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/* permission,
which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission. If you also want to bring your own
public IP addresses for Azure-SSIS IR while joining it to an Azure Resource Manager virtual network,
also include Microsoft.Network/publicIPAddresses/*/join/action permission in the role.
2. Vir tual network .
a. Make sure that the virtual network's resource group can create and delete certain Azure network
resources.
The Azure-SSIS IR needs to create certain network resources under the same resource group as
the virtual network. These resources include:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer
A network security group, with the name *<Guid>-azurebatch-
cloudservicenetworksecuritygroup
An Azure public IP address, with the name -azurebatch-cloudservicepublicip
Those resources will be created when your Azure-SSIS IR starts. They'll be deleted when your
Azure-SSIS IR stops. To avoid blocking your Azure-SSIS IR from stopping, don't reuse these
network resources in your other resources.
b. Make sure that you have no resource lock on the resource group/subscription to which the virtual
network belongs. If you configure a read-only/delete lock, starting and stopping your Azure-SSIS
IR will fail, or it will stop responding.
c. Make sure that you don't have an Azure policy that prevents the following resources from being
created under the resource group/subscription to which the virtual network belongs:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
d. Allow traffic on Network Security Group (NSG) rule, to allow traffic between SQL Managed
Instance and Azure-SSIS IR, and traffic needed by Azure-SSIS IR.
a. Inbound requirement of SQL Managed Instance , to allow inbound traffic from Azure-
SSIS IR.

DEST IN AT IO
T RA N SP O RT SO URC E DEST IN AT IO N P O RT
P ROTO C O L SO URC E P O RT RA N GE N RA N GE C O M M EN T S

TCP VirtualNetwo * VirtualNetwo 1433, If your SQL


rk rk 11000- Database
11999 server
connection
policy is set
to Proxy
instead of
Redirect ,
only port
1433 is
needed.

b. Outbound requirement of Azure-SSIS IR , to allow outbound traffic to SQL Managed


Instance, and other traffic needed by Azure-SSIS IR.

T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

TCP VirtualNetwor * VirtualNetwor 1433, 11000- Allow


k k 11999 outbound
traffic to SQL
Managed
Instance. If
connection
policy is set to
Proxy instead
of Redirect ,
only port
1433 is
needed.
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

TCP VirtualNetwor * AzureCloud 443 The nodes of


k your Azure-
SSIS IR in the
virtual
network use
this port to
access Azure
services, such
as Azure
Storage and
Azure Event
Hubs.

TCP VirtualNetwor * Internet 80 (Optional) The


k nodes of your
Azure-SSIS IR
in the virtual
network use
this port to
download a
certificate
revocation list
from the
internet. If you
block this
traffic, you
might
experience
performance
downgrade
when start IR
and lose
capability to
check
certificate
revocation list
for certificate
usage. If you
want to
further narrow
down
destination to
certain
FQDNs, refer
to Use Azure
ExpressRoute
or User
Defined
Route(UDR).

TCP VirtualNetwor * Storage 445 (Optional) This


k rule is only
required when
you want to
execute SSIS
package
stored in
Azure Files.
a. Inbound requirement of Azure-SSIS IR , to allow traffic needed by Azure-SSIS IR.

T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

TCP BatchNodeMa * VirtualNetwor 29876, 29877 The Data


nagement k (if you join the Factory
IR to a service uses
Resource these ports to
Manager communicate
virtual with the
network) nodes of your
Azure-SSIS IR
10100, 20100, in the virtual
30100 (if you network.
join the IR to a
classic virtual Whether or
network) not you create
a subnet-level
NSG, Data
Factory always
configures an
NSG at the
level of the
network
interface cards
(NICs)
attached to
the virtual
machines that
host the
Azure-SSIS IR.
Only inbound
traffic from
Data Factory
IP addresses
on the
specified ports
is allowed by
that NIC-level
NSG. Even if
you open
these ports to
internet traffic
at the subnet
level, traffic
from IP
addresses that
aren't Data
Factory IP
addresses is
blocked at the
NIC level.
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S

TCP CorpNetSaw * VirtualNetwor 3389 (Optional) This


k rule is only
required when
Microsoft
supporter
asks customer
to open for
advanced
troubleshootin
g, and can be
closed right
after
troubleshootin
g.
CorpNetSaw
service tag
permits only
secure access
workstations
on the
Microsoft
corporate
network to
use remote
desktop. And
this service
tag can't be
selected from
portal and is
only available
via Azure
PowerShell or
Azure CLI.

At NIC level
NSG, port
3389 is open
by default and
we allow you
to control
port 3389 at
subnet level
NSG,
meanwhile
Azure-SSIS IR
has disallowed
port 3389
outbound by
default at
windows
firewall rule on
each IR node
for protection.

e. See virtual network configuration for more info:


If you bring your own public IP addresses for the Azure-SSIS IR
If you use your own Domain Name System (DNS) server
If you use Azure ExpressRoute or a user-defined route (UDR)
If you use customized Azure-SSIS IR
Provision Azure -SSIS Integration Runtime
1. Select SQL Managed Instance private endpoint or public endpoint.
When provisioning Azure-SSIS IR in Azure portal/ADF app, on SQL Settings page, use SQL Managed
Instance private endpoint or public endpoint when creating SSIS catalog (SSISDB).
Public endpoint host name comes in the format <mi_name>.public.<dns_zone>.database.windows.net
and that the port used for the connection is 3342.

2. Select Azure AD authentication when applies.


For more info about how to enable Azure AD authentication, see Enable Azure AD on Azure SQL
Managed Instance.
3. Join Azure-SSIS IR to the virtual network when applies.
On advanced setting page, select the Virtual Network and subnet to join.
When inside the same virtual network as SQL Managed Instance, choose a different subnet than SQL
Managed Instance.
For more information about how to join Azure-SSIS IR into a virtual network, see Join an Azure-SSIS
integration runtime to a virtual network.
For more info about how to create an Azure-SSIS IR, see Create an Azure-SSIS integration runtime in Azure Data
Factory.

Clean up SSISDB logs


SSISDB logs retention policy are defined by below properties in catalog.catalog_properties:
OPERATION_CLEANUP_ENABLED
When the value is TRUE, operation details and operation messages older than RETENTION_WINDOW
(days) are deleted from the catalog. When the value is FALSE, all operation details and operation
messages are stored in the catalog. Note: a SQL Server job performs the operation cleanup.
RETENTION_WINDOW
The number of days that operation details and operation messages are stored in the catalog. When the
value is -1, the retention window is infinite. Note: If no cleanup is desired, set
OPERATION_CLEANUP_ENABLED to FALSE.
To remove SSISDB logs that are outside the retention window set by the administrator, you can trigger the
stored procedure [internal].[cleanup_server_retention_window_exclusive] . Optionally, you can schedule SQL
Managed Instance agent job execution to trigger the stored procedure.

Next steps
Execute SSIS packages by Azure SQL Managed Instance Agent job
Set up Business continuity and disaster recovery (BCDR)
Migrate on-premises SSIS workloads to SSIS in ADF
Migrate SQL Server Agent jobs to ADF with SSMS
3/5/2021 • 3 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


When migrating on-premises SQL Server Integration Services (SSIS) workloads to SSIS in ADF, after SSIS
packages are migrated, you can do batch migration of SQL Server Agent jobs with job step type of SQL Server
Integration Services Package to Azure Data Factory (ADF) pipelines/activities/schedule triggers via SQL Server
Management Studio (SSMS) SSIS Job Migration Wizard .
In general, for selected SQL agent jobs with applicable job step types, SSIS Job Migration Wizard can:
map on-premises SSIS package location to where the packages are migrated to, which are accessible by SSIS
in ADF.

NOTE
Package location of File System is supported only.

migrate applicable jobs with applicable job steps to corresponding ADF resources as below:

SQ L A GEN T JO B O B JEC T A DF RESO URC E N OT ES

SQL Agent job pipeline Name of the pipeline will be Generated


for <job name>.

Built-in agent jobs are not applicable:


SSIS Server Maintenance Job
syspolicy_purge_history
collection_set_*
mdw_purge_data_*
sysutility_*

SSIS job step Execute SSIS package activity Name of the activity will be <step
name>.
Proxy account used in job step will
be migrated as Windows
authentication of this activity.
Execution options except Use 32-bit
runtime defined in job step will be
ignored in migration.
Verification defined in job step will
be ignored in migration.
SQ L A GEN T JO B O B JEC T A DF RESO URC E N OT ES

schedule schedule trigger Name of the schedule trigger will be


Generated for <schedule name>.

Below options in SQL Agent job


schedule will be ignored in migration:
Second-level interval.
Start automatically when SQL Server
Agent starts
Start whenever the CPUs become
idle
weekday and weekend day
Below are the differences after SQL
Agent job schedule is migrated to ADF
schedule trigger:
ADF Schedule Trigger subsequent
run is independent of the execution
state of the antecedent triggered run.
ADF Schedule Trigger recurrence
configuration differs from Daily
frequency in SQL agent job.

generate Azure Resource Manager (ARM) templates in local output folder, and deploy to data factory directly
or later manually. For more information about ADF Resource Manager templates, see Microsoft.DataFactory
resource types.

Prerequisites
The feature described in this article requires SQL Server Management Studio version 18.5 or higher. To get the
latest version of SSMS, see Download SQL Server Management Studio (SSMS).

Migrate SSIS jobs to ADF


1. In SSMS, in Object Explorer, select SQL Server Agent, select Jobs, then right-click and select Migrate
SSIS Jobs to ADF .

2. Sign In Azure, select Azure Subscription, Data Factory, and Integration Runtime. Azure Storage is optional,
which is used in the package location mapping step if SSIS jobs to be migrated have SSIS File System
packages.
3. Map the paths of SSIS packages and configuration files in SSIS jobs to destination paths where migrated
pipelines can access. In this mapping step, you can:
a. Select a source folder, then Add Mapping .
b. Update source folder path. Valid paths are folder paths or parent folder paths of packages.
c. Update destination folder path. Default is relative path to the default Storage account, which is
selected in step 1.
d. Delete a selected mapping via Delete Mapping .
4. Select applicable jobs to migrate, and configure the settings of corresponding Executed SSIS Package
activity.
Default Setting, applies to all selected steps by default. For more information of each property, see
Settings tab for the Execute SSIS Package activity when package location is File System (Package).

Step Setting, configure setting for a selected step.


Apply Default Setting : default is selected. Unselect to configure setting for selected step only.
For more information of other properties, see Settings tab for the Execute SSIS Package activity
when package location is File System (Package).
5. Generate and deploy ARM template.
a. Select or input the output path for the ARM templates of the migrated ADF pipelines. Folder will be
created automatically if not exists.
b. Select the option of Deploy ARM templates to your data factor y :
Default is unselected. You can deploy generated ARM templates later manually.
Select to deploy generated ARM templates to data factory directly.

6. Migrate, then check results.


Next steps
Run and monitor pipeline
Manage packages with Azure-SSIS Integration
Runtime package store
3/5/2021 • 10 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


To lift & shift your on-premises SQL Server Integration Services (SSIS) workloads to the cloud, you can
provision Azure-SSIS Integration Runtime (IR) in Azure Data Factory (ADF). For more information, see Provision
an Azure-SSIS IR. An Azure-SSIS IR supports:
Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed
Instance (Project Deployment Model)
Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure
SQL Managed Instance (Package Deployment Model)
When you use Package Deployment Model, you can choose whether you want to provision your Azure-SSIS IR
with package stores. They provide a package management layer on top of file system, Azure Files, or MSDB
hosted by Azure SQL Managed Instance. Azure-SSIS IR package store allows you to import/export/delete/run
packages and monitor/stop running packages via SQL Server Management Studio (SSMS) similar to the legacy
SSIS package store.

Connect to Azure-SSIS IR
Once your Azure-SSIS IR is provisioned, you can connect to it to browse its package stores on SSMS.

On the Object Explorer window of SSMS, select Azure-SSIS Integration Runtime in the Connect drop-
down menu. Next, sign in to Azure and select the relevant subscription, ADF, and Azure-SSIS IR that you've
provisioned with package stores. Your Azure-SSIS IR will appear with Running Packages and Stored
Packages nodes underneath. Expand the Stored Packages node to see your package stores underneath.
Expand your package stores to see folders and packages underneath. You may be asked to enter the access
credentials for your package stores, if SSMS fails to connect to them automatically. For example, if you expand a
package store on top of MSDB, you may be asked to connect to your Azure SQL Managed Instance first.
Manage folders and packages
After you connect to your Azure-SSIS IR on SSMS, you can right-click on any package stores, folders, or
packages to pop up a menu and select New Folder , Impor t Package , Expor t Package , Delete , or Refresh .
Select New Folder to create a new folder for imported packages.
Select Impor t Package to import packages from File System , SQL Ser ver (MSDB), or the legacy SSIS
Package Store into your package store.
Depending on the Package location to import from, select the relevant Ser ver /Authentication type ,
enter the access credentials if necessary, select the Package path , and enter the new Package name .
When importing packages, their protection level can't be changed. To change it, use SQL Server Data
Tools (SSDT) or dtutil command-line utility.

NOTE
Importing SSIS packages into Azure-SSIS IR package stores can only be done one-by-one and will simply copy
them into the underlying MSDB/file system/Azure Files while preserving their SQL Server/SSIS version.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will upgrade
them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Additionally, since legacy SSIS package stores are bound to specific SQL Server version and accessible only on
SSMS for that version, lower-version packages in legacy SSIS package stores need to be exported into file system
first using the designated SSMS version before they can be imported into Azure-SSIS IR package stores using
SSMS 2019 or later versions.
Alternatively, to import multiple SSIS packages into Azure-SSIS IR package stores while switching their protection
level, you can use dtutil command line utility, see Deploying multiple packages with dtutil.

Select Expor t Package to export packages from your package store into File System , SQL Ser ver
(MSDB), or the legacy SSIS Package Store .
Depending on the Package location to export into, select the relevant Ser ver /Authentication type ,
enter the access credentials if necessary, and select the Package path . When exporting packages, if
they're encrypted, enter the passwords to decrypt them first and then you can change their protection
level, for example to avoid storing any sensitive data or to encrypt it or all data with user key or
password.

NOTE
Exporting SSIS packages from Azure-SSIS IR package stores can only be done one-by-one and doing so without
switching their protection level will simply copy them while preserving their SQL Server/SSIS version, otherwise it
will upgrade them into SSIS 2019 or later-version packages.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will upgrade
them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Alternatively, to export multiple SSIS packages from Azure-SSIS IR package stores while switching their protection
level, you can use dtutil command line utility, see Deploying multiple packages with dtutil.

Select Delete to delete existing folders/packages from your package store.


Select Refresh to show newly added folders/packages in your package store.

Execute packages
After you connect to your Azure-SSIS IR on SSMS, you can right-click on any stored packages to pop up a menu
and select Run Package . This will open the Execute Package Utility dialog, where you can configure your
package executions on Azure-SSIS IR as Execute SSIS Package activities in ADF pipelines.
The General , Configurations , Execution Options , and Logging pages of Execute Package Utility dialog
correspond to the Settings tab of Execute SSIS Package activity. On these pages, you can enter the encryption
password for your package and access information for your package configuration file. You can also enter your
package execution credentials and properties, as well as the access information for your log folder. The Set
Values page of Execute Package Utility dialog corresponds to the Proper ty Overrides tab of Execute SSIS
Package activity, where you can enter your existing package properties to override. For more information, see
Run SSIS packages as Execute SSIS Package activities in ADF pipelines.
When you select the Execute button, a new ADF pipeline with Execute SSIS Package activity will be
automatically generated and triggered. If an ADF pipeline with the same settings already exists, it will be rerun
and a new pipeline won't be generated. The ADF pipeline and Execute SSIS Package activity will be named
Pipeline_SSMS_YourPackageName_HashString and Activity_SSMS_YourPackageName , respectively.
Monitor and stop running packages
After you connect to your Azure-SSIS IR on SSMS, you can expand the Running Packages node to see your
currently running packages underneath. Right-click on any of them to pop up a menu and select Stop or
Refresh .
Select Stop to cancel the currently running ADF pipeline that runs the package as Execute SSIS Package
activity.
Select Refresh to show newly running packages from your package stores.

Monitor Azure-SSIS IR and edit package stores


After you connect to your Azure-SSIS IR on SSMS, you can right-click on it to pop up a menu and select Go to
Azure Data Factor y por tal or Refresh .
Select Go to Azure Data Factor y por tal to open the Integration runtimes page of ADF monitoring
hub, where you can monitor your Azure-SSIS IR. On the PACKAGE STORES tile, you can see the number
of package stores that are attached to your Azure-SSIS IR. Selecting that number will pop up a window
where you can edit ADF linked services that store the access information for your package stores.

Select Refresh to show newly added folders/packages in your package stores and running packages
from your package stores.
Deploying multiple packages with dtutil
To lift & shift your on-premises SSIS workloads onto SSIS in ADF while maintaining the legacy Package
Deployment Model, you need to deploy your packages from file system, MSDB hosted by SQL Server, or legacy
SSIS package stores into Azure Files, MSDB hosted by Azure SQL Managed Instance, or Azure-SSIS IR package
stores. At the same time, you should also switch their protection level from encryption by user key to
unencrypted or encryption by password if you haven't done so already.
You can use dtutil command line utility that comes with SQL Server/SSIS installation to deploy multiple
packages in batches. It's bound to specific SSIS version, so if you use it to deploy lower-version packages
without switching their protection level, it will simply copy them while preserving their SSIS version. If you use it
to deploy them and switch their protection level at the same time, it will upgrade them into its SSIS version.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will
upgrade them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Consequently, to avoid run-time upgrades, deploying packages to run on Azure-SSIS IR in Package Deployment
Model should use dtutil 2017 that comes with SQL Server/SSIS 2017 installation. You can download and install
the free SQL Server/SSIS 2017 Developer Edition for this purpose. Once installed, you can find dtutil 2017 on
this folder: YourLocalDrive:\Program Files\Microsoft SQL Server\140\DTS\Binn .
Deploying multiple packages from file system on premises into Azure Files with dtutil
To deploy multiple packages from file system into Azure Files and switch their protection level at the same time,
you can run the following commands at a command prompt. Please replace all strings that are specific to your
case.

REM Persist the access credentials for Azure Files on your local machine
cmdkey /ADD:YourStorageAccountName.file.core.windows.net /USER:azure\YourStorageAccountName
/PASS:YourStorageAccountKey

REM Connect Azure Files to a drive on your local machine


net use Z: \\YourStorageAccountName.file.core.windows.net\YourFileShare /PERSISTENT:Yes

REM Go to a local folder where you store your packages


cd YourLocalDrive:\...\YourPackageFolder

REM Run dtutil in a loop to deploy your packages from the local folder into Azure Files while switching
their protection level
for %f in (*.dtsx) do dtutil.exe /FILE %f /ENCRYPT FILE;Z:\%f;2;YourEncryptionPassword

To run the above commands in a batch file, replace %f with %%f .


To deploy multiple packages from legacy SSIS package stores on top of file system into Azure Files and switch
their protection level at the same time, you can use the same commands, but replace
YourLocalDrive:\...\YourPackageFolder with a local folder used by legacy SSIS package stores:
YourLocalDrive:\Program Files\Microsoft SQL
Server\YourSQLServerDefaultCompatibilityLevel\DTS\Packages\YourPackageFolder
. For example, if your legacy SSIS package store is bound to SQL Server 2016, go to
YourLocalDrive:\Program Files\Microsoft SQL Server\130\DTS\Packages\YourPackageFolder . You can find the value
for YourSQLServerDefaultCompatibilityLevel from a list of SQL Server default compatibility levels.
If you've configured Azure-SSIS IR package stores on top of Azure Files, your deployed packages will appear in
them when you connect to your Azure-SSIS IR on SSMS 2019 or later versions.
Deploying multiple packages from MSDB on premises into MSDB in Azure with dtutil
To deploy multiple packages from MSDB hosted by SQL Server or legacy SSIS package stores on top of MSDB
into MSDB hosted by Azure SQL Managed Instance and switch their protection level at the same time, you can
connect to your SQL Server on SSMS, right-click on Databases->System Databases->msdb node on the Object
Explorer of SSMS to open a New Quer y window, and run the following T-SQL script. Please replace all strings
that are specific to your case:

BEGIN
SELECT 'dtutil /SQL '+f.foldername+'\'+NAME+' /ENCRYPT
SQL;'+f.foldername+'\'+NAME+';2;YourEncryptionPassword /DestServer YourSQLManagedInstanceEndpoint /DestUser
YourSQLAuthUsername /DestPassword YourSQLAuthPassword'
FROM msdb.dbo.sysssispackages p
inner join msdb.dbo.sysssispackagefolders f
ON p.folderid = f.folderid
END

To use the private/public endpoint of your Azure SQL Managed Instance, replace
YourSQLManagedInstanceEndpoint with YourSQLMIName.YourDNSPrefix.database.windows.net /
YourSQLMIName.public.YourDNSPrefix.database.windows.net,3342 , respectively.

The script will generate dtutil command lines for all packages in MSDB that you can multiselect, copy & paste,
and run at a command prompt.

dtutil /SQL YourFolder\YourPackage1 /ENCRYPT SQL;YourFolder\YourPackage1;2;YourEncryptionPassword


/DestServer YourSQLManagedInstanceEndpoint /DestUser YourUserName /DestPassword YourPassword
dtutil /SQL YourFolder\YourPackage2 /ENCRYPT SQL;YourFolder\YourPackage2;2;YourEncryptionPassword
/DestServer YourSQLManagedInstanceEndpoint /DestUser YourUserName /DestPassword YourPassword
dtutil /SQL YourFolder\YourPackage3 /ENCRYPT SQL;YourFolder\YourPackage3;2;YourEncryptionPassword
/DestServer YourSQLManagedInstanceEndpoint /DestUser YourUserName /DestPassword YourPassword

If you've configured Azure-SSIS IR package stores on top of MSDB, your deployed packages will appear in them
when you connect to your Azure-SSIS IR on SSMS 2019 or later versions.
Deploying multiple packages from MSDB on premises into Azure Files with dtutil
To deploy multiple packages from MSDB hosted by SQL Server or legacy SSIS package stores on top of MSDB
into Azure Files and switch their protection level at the same time, you can connect to your SQL Server on
SSMS, right-click on Databases->System Databases->msdb node on the Object Explorer of SSMS to open a New
Quer y window, and run the following T-SQL script. Please replace all strings that are specific to your case:
BEGIN
SELECT 'dtutil /SQL '+f.foldername+'\'+NAME+' /ENCRYPT
FILE;Z:\'+f.foldername+'\'+NAME+'.dtsx;2;YourEncryptionPassword'
FROM msdb.dbo.sysssispackages p
inner join msdb.dbo.sysssispackagefolders f
ON p.folderid = f.folderid
END

The script will generate dtutil command lines for all packages in MSDB that you can multiselect, copy & paste,
and run at a command prompt.

REM Persist the access credentials for Azure Files on your local machine
cmdkey /ADD:YourStorageAccountName.file.core.windows.net /USER:azure\YourStorageAccountName
/PASS:YourStorageAccountKey

REM Connect Azure Files to a drive on your local machine


net use Z: \\YourStorageAccountName.file.core.windows.net\YourFileShare /PERSISTENT:Yes

REM Multiselect, copy & paste, and run the T-SQL-generated dtutil command lines to deploy your packages from
MSDB on premises into Azure Files while switching their protection level
dtutil /SQL YourFolder\YourPackage1 /ENCRYPT FILE;Z:\YourFolder\YourPackage1.dtsx;2;YourEncryptionPassword
dtutil /SQL YourFolder\YourPackage2 /ENCRYPT FILE;Z:\YourFolder\YourPackage2.dtsx;2;YourEncryptionPassword
dtutil /SQL YourFolder\YourPackage3 /ENCRYPT FILE;Z:\YourFolder\YourPackage3.dtsx;2;YourEncryptionPassword

If you've configured Azure-SSIS IR package stores on top of Azure Files, your deployed packages will appear in
them when you connect to your Azure-SSIS IR on SSMS 2019 or later versions.

Next steps
You can rerun/edit the auto-generated ADF pipelines with Execute SSIS Package activities or create new ones on
ADF portal. For more information, see Run SSIS packages as Execute SSIS Package activities in ADF pipelines.
Create a trigger that runs a pipeline on a schedule
6/25/2021 • 20 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article provides information about the schedule trigger and the steps to create, start, and monitor a
schedule trigger. For other types of triggers, see Pipeline execution and triggers.
When creating a schedule trigger, you specify a schedule (start date, recurrence, end date etc.) for the trigger,
and associate with a pipeline. Pipelines and triggers have a many-to-many relationship. Multiple triggers can
kick off a single pipeline. A single trigger can kick off multiple pipelines.
The following sections provide steps to create a schedule trigger in different ways.

Data Factory UI
You can create a schedule trigger to schedule a pipeline to run periodically (hourly, daily, etc.).

NOTE
For a complete walkthrough of creating a pipeline and a schedule trigger, which associates the trigger with the pipeline,
and runs and monitors the pipeline, see Quickstart: create a data factory using Data Factory UI.

1. Switch to the Edit tab, shown with a pencil symbol.

2. Select Trigger on the menu, then select New/Edit .


3. On the Add Triggers page, select Choose trigger..., then select +New .

4. On the New Trigger page, do the following steps:


a. Confirm that Schedule is selected for Type .
b. Specify the start datetime of the trigger for Star t Date . It's set to the current datetime in
Coordinated Universal Time (UTC) by default.
c. Specify the time zone that the trigger will be created in. The time zone setting will apply to Star t
Date , End Date , and Schedule Execution Times in Advanced recurrence options. Changing
Time Zone setting will not automatically change your start date. Make sure the Start Date is correct
in the specified time zone. Please note that Scheduled Execution time of Trigger will be considered
post the Start Date (Ensure Start Date is atleast 1minute lesser than the Execution time else it will
trigger pipeline in next recurrence).

NOTE
For time zones that observe daylight saving, trigger time will auto-adjust for the twice a year change. To
opt out of the daylight saving change, please select a time zone that does not observe daylight saving, for
instance UTC

d. Specify Recurrence for the trigger. Select one of the values from the drop-down list (Every
minute, Hourly, Daily, Weekly, and Monthly). Enter the multiplier in the text box. For example, if you
want the trigger to run once for every 15 minutes, you select Ever y Minute , and enter 15 in the
text box.
e. In the Recurrence , if you choose "Day(s), Week(s) or Month(s)" from the drop-down, you can find
"Advanced recurrence options".
f. To specify an end date time, select Specify an End Date , and specify Ends On, then select OK .
There is a cost associated with each pipeline run. If you are testing, you may want to ensure that
the pipeline is triggered only a couple of times. However, ensure that there is enough time for the
pipeline to run between the publish time and the end time. The trigger comes into effect only after
you publish the solution to Data Factory, not when you save the trigger in the UI.
5. In the New Trigger window, select Yes in the Activated option, then select OK . You can use this
checkbox to deactivate the trigger later.

6. In the New Trigger window, review the warning message, then select OK .
7. Select Publish all to publish the changes to Data Factory. Until you publish the changes to Data Factory,
the trigger doesn't start triggering the pipeline runs.
8. Switch to the Pipeline runs tab on the left, then select Refresh to refresh the list. You will see the
pipeline runs triggered by the scheduled trigger. Notice the values in the Triggered By column. If you
use the Trigger Now option, you will see the manual trigger run in the list.

9. Switch to the Trigger Runs \ Schedule view.

Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

This section shows you how to use Azure PowerShell to create, start, and monitor a schedule trigger. To see this
sample working, first go through the Quickstart: Create a data factory by using Azure PowerShell. Then, add the
following code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The
trigger is associated with a pipeline named Adfv2QuickStar tPipeline that you create as part of the Quickstart.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:

IMPORTANT
Before you save the JSON file, set the value of the star tTime element to the current UTC time. Set the value of
the endTime element to one hour past the current UTC time.
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2017-12-08T00:00:00Z",
"endTime": "2017-12-08T01:00:00Z",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "Adfv2QuickStartPipeline"
},
"parameters": {
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}
}
]
}
}

In the JSON snippet:


The type element of the trigger is set to "ScheduleTrigger".
The frequency element is set to "Minute" and the inter val element is set to 15. As such, the
trigger runs the pipeline every 15 minutes between the start and end times.
The timeZone element specifies the time zone that the trigger is created in. This setting affects
both star tTime and endTime .
The endTime element is one hour after the value of the star tTime element. As such, the trigger
runs the pipeline 15 minutes, 30 minutes, and 45 minutes after the start time. Don't forget to
update the start time to the current UTC time, and the end time to one hour past the start time.

IMPORTANT
For UTC timezone, the startTime and endTime need to follow format 'yyyy-MM-ddTHH:mm:ssZ ', while for
other timezones, startTime and endTime follow 'yyyy-MM-ddTHH:mm:ss'.
Per ISO 8601 standard, the Z suffix to timestamp mark the datetime to UTC timezone, and render
timeZone field useless. While missing Z suffix for UTC time zone will result in an error upon trigger
activation.

The trigger is associated with the Adfv2QuickStar tPipeline pipeline. To associate multiple
pipelines with a trigger, add more pipelineReference sections.
The pipeline in the Quickstart takes two parameters values: inputPath and outputPath . And you
pass values for these parameters from the trigger.
2. Create a trigger by using the Set-AzDataFactor yV2Trigger cmdlet:

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger" -DefinitionFile "C:\ADFv2QuickStartPSH\MyTrigger.json"
3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactor yV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

4. Start the trigger by using the Star t-AzDataFactor yV2Trigger cmdlet:

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName


-Name "MyTrigger"

5. Confirm that the status of the trigger is Star ted by using the Get-AzDataFactor yV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactor yV2TriggerRun cmdlet. To get
the information about the trigger runs, execute the following command periodically. Update the
TriggerRunStar tedAfter and TriggerRunStar tedBefore values to match the values in your trigger
definition:

Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName


-TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-08T00:00:00" -TriggerRunStartedBefore
"2017-12-08T01:00:00"

NOTE
Trigger time of Schedule triggers are specified in UTC timestamp. TriggerRunStartedAfter and
TriggerRunStartedBefore also expects UTC timestamp

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

.NET SDK
This section shows you how to use the .NET SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the .NET SDK. Then, add the following
code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The trigger is
associated with a pipeline named Adfv2QuickStar tPipeline that you create as part of the Quickstart.
To create and start a schedule trigger that runs every 15 minutes, add the following code to the main method:
// Create the trigger
Console.WriteLine("Creating the trigger");

// Set the start time to the current UTC time


DateTime startTime = DateTime.UtcNow;

// Specify values for the inputPath and outputPath parameters


Dictionary<string, object> pipelineParameters = new Dictionary<string, object>();
pipelineParameters.Add("inputPath", "adftutorial/input");
pipelineParameters.Add("outputPath", "adftutorial/output");

// Create a schedule trigger


string triggerName = "MyTrigger";
ScheduleTrigger myTrigger = new ScheduleTrigger()
{
Pipelines = new List<TriggerPipelineReference>()
{
// Associate the Adfv2QuickStartPipeline pipeline with the trigger
new TriggerPipelineReference()
{
PipelineReference = new PipelineReference(pipelineName),
Parameters = pipelineParameters,
}
},
Recurrence = new ScheduleTriggerRecurrence()
{
// Set the start time to the current UTC time and the end time to one hour after the
start time
StartTime = startTime,
TimeZone = "UTC",
EndTime = startTime.AddHours(1),
Frequency = RecurrenceFrequency.Minute,
Interval = 15,
}
};

// Now, create the trigger by invoking the CreateOrUpdate method


TriggerResource triggerResource = new TriggerResource()
{
Properties = myTrigger
};
client.Triggers.CreateOrUpdate(resourceGroup, dataFactoryName, triggerName, triggerResource);

// Start the trigger


Console.WriteLine("Starting the trigger");
client.Triggers.Start(resourceGroup, dataFactoryName, triggerName);

To create triggers in a different time zone, other than UTC, following settings are required:

<<ClientInstance>>.SerializationSettings.DateFormatHandling =
Newtonsoft.Json.DateFormatHandling.IsoDateFormat;
<<ClientInstance>>.SerializationSettings.DateTimeZoneHandling =
Newtonsoft.Json.DateTimeZoneHandling.Unspecified;
<<ClientInstance>>.SerializationSettings.DateParseHandling = DateParseHandling.None;
<<ClientInstance>>.DeserializationSettings.DateParseHandling = DateParseHandling.None;
<<ClientInstance>>.DeserializationSettings.DateFormatHandling =
Newtonsoft.Json.DateFormatHandling.IsoDateFormat;
<<ClientInstance>>.DeserializationSettings.DateTimeZoneHandling =
Newtonsoft.Json.DateTimeZoneHandling.Unspecified;

To monitor a trigger run, add the following code before the last Console.WriteLine statement in the sample:
// Check that the trigger runs every 15 minutes
Console.WriteLine("Trigger runs. You see the output every 15 minutes");

for (int i = 0; i < 3; i++)


{
System.Threading.Thread.Sleep(TimeSpan.FromMinutes(15));
List<TriggerRun> triggerRuns = client.Triggers.ListRuns(resourceGroup, dataFactoryName,
triggerName, DateTime.UtcNow.AddMinutes(-15 * (i + 1)), DateTime.UtcNow.AddMinutes(2)).ToList();
Console.WriteLine("{0} trigger runs found", triggerRuns.Count);

foreach (TriggerRun run in triggerRuns)


{
foreach (KeyValuePair<string, string> triggeredPipeline in run.TriggeredPipelines)
{
PipelineRun triggeredPipelineRun = client.PipelineRuns.Get(resourceGroup,
dataFactoryName, triggeredPipeline.Value);
Console.WriteLine("Pipeline run ID: {0}, Status: {1}", triggeredPipelineRun.RunId,
triggeredPipelineRun.Status);
List<ActivityRun> runs = client.ActivityRuns.ListByPipelineRun(resourceGroup,
dataFactoryName, triggeredPipelineRun.RunId, run.TriggerRunTimestamp.Value,
run.TriggerRunTimestamp.Value.AddMinutes(20)).ToList();
}
}
}

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Python SDK
This section shows you how to use the Python SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the Python SDK. Then, add the following
code block after the "monitor the pipeline run" code block in the Python script. This code creates a schedule
trigger that runs every 15 minutes between the specified start and end times. Update the star t_time variable to
the current UTC time, and the end_time variable to one hour past the current UTC time.

# Create a trigger
tr_name = 'mytrigger'
scheduler_recurrence = ScheduleTriggerRecurrence(frequency='Minute', interval='15',start_time='2017-12-
12T04:00:00Z', end_time='2017-12-12T05:00:00Z', time_zone='UTC')
pipeline_parameters = {'inputPath':'adftutorial/input', 'outputPath':'adftutorial/output'}
pipelines_to_run = []
pipeline_reference = PipelineReference('copyPipeline')
pipelines_to_run.append(TriggerPipelineReference(pipeline_reference, pipeline_parameters))
tr_properties = ScheduleTrigger(description='My scheduler trigger', pipelines = pipelines_to_run,
recurrence=scheduler_recurrence)
adf_client.triggers.create_or_update(rg_name, df_name, tr_name, tr_properties)

# Start the trigger


adf_client.triggers.start(rg_name, df_name, tr_name)

To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Azure Resource Manager template


You can use an Azure Resource Manager template to create a trigger. For step-by-step instructions, see Create an
Azure data factory by using a Resource Manager template.

Pass the trigger start time to a pipeline


Azure Data Factory version 1 supports reading or writing partitioned data by using the system variables:
SliceStar t , SliceEnd , WindowStar t , and WindowEnd . In the current version of Azure Data Factory, you can
achieve this behavior by using a pipeline parameter. The start time and scheduled time for the trigger are set as
the value for the pipeline parameter. In the following example, the scheduled time for the trigger is passed as a
value to the pipeline scheduledRunTime parameter:

"parameters": {
"scheduledRunTime": "@trigger().scheduledTime"
}

JSON schema
The following JSON definition shows you how to create a schedule trigger with scheduling and recurrence:

{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Month>>,
"interval": <<int>>, // Optional, specifies how often to fire (default to 1)
"startTime": <<datetime>>,
"endTime": <<datetime - optional>>,
"timeZone": "UTC"
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-23>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-59>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>" : "<parameter 2 Value>"
}
}
]
}
}

IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.

Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling of a trigger:

JSO N P RO P ERT Y DESC RIP T IO N

star tTime A Date-Time value. For simple schedules, the value of the
star tTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified star tTime value.
For UTC time zone, format is 'yyyy-MM-ddTHH:mm:ssZ' , for
other time zone, format is 'yyyy-MM-ddTHH:mm:ss' .

endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past. This property is optional.
For UTC time zone, format is 'yyyy-MM-ddTHH:mm:ssZ' , for
other time zone, format is 'yyyy-MM-ddTHH:mm:ss' .

timeZone The time zone the trigger is created in. This setting impact
star tTime , endTime , and schedule . See list of supported
time zone

recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency ,
inter val, endTime , count , and schedule elements. When
a recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.

frequency The unit of frequency at which the trigger recurs. The


supported values include "minute," "hour," "day," "week," and
"month."

inter val A positive integer that denotes the interval for the
frequency value, which determines how often the trigger
runs. For example, if the inter val is 3 and the frequency is
"week," the trigger recurs every 3 weeks.

schedule The recurrence schedule for the trigger. A trigger with a


specified frequency value alters its recurrence based on a
recurrence schedule. The schedule property contains
modifications for the recurrence that are based on minutes,
hours, weekdays, month days, and week number.

IMPORTANT
For UTC timezone, the startTime and endTime need to follow format 'yyyy-MM-ddTHH:mm:ssZ ', while for other
timezones, startTime and endTime follow 'yyyy-MM-ddTHH:mm:ss'.
Per ISO 8601 standard, the Z suffix to timestamp mark the datetime to UTC timezone, and render timeZone field useless.
While missing Z suffix for UTC time zone will result in an error upon trigger activation.

Schema defaults, limits, and examples


JSO N P RO P ERT Y TYPE REQ UIRED DEFA ULT VA L UE VA L ID VA L UES EXA M P L E
JSO N P RO P ERT Y TYPE REQ UIRED DEFA ULT VA L UE VA L ID VA L UES EXA M P L E

star tTime String Yes None ISO-8601 Date- for UTC time
Times zone
"startTime" :
"2013-01-
09T09:30:00-
08:00Z"
for other time
zone
"2013-01-
09T09:30:00-
08:00"

timeZone String Yes None Time Zone "UTC"


Values

recurrence Object Yes None Recurrence "recurrence"


object : {
"frequency" :
"monthly",
"interval" :
1 }

inter val Number No 1 1 to 1,000 "interval":10

endTime String Yes None A Date-Time for UTC time


value that zone
represents a time "endTime" :
in the future. "2013-02-
09T09:30:00-
08:00Z"
for other time
zone
"endTime" :
"2013-02-
09T09:30:00-
08:00"

schedule Object No None Schedule object "schedule" :


{ "minute" :
[30], "hour"
: [8,17] }

Time zone option


Here are some of time zones supported for Schedule triggers:

UTC O F F SET ( N O N - O B SERVE DAY L IGH T T IM E STA M P


T IM E Z O N E DAY L IGH T SAVIN G) T IM EZ O N E VA L UE SAVIN G F O RM AT

Coordinated 0 UTC No 'yyyy-MM-


Universal Time ddTHH:mm:ssZ'

Pacific Time (PT) -8 Pacific Standard Yes 'yyyy-MM-


Time ddTHH:mm:ss'

Central Time (CT) -6 Central Standard Yes 'yyyy-MM-


Time ddTHH:mm:ss'

Eastern Time (ET) -5 Eastern Standard Time Yes 'yyyy-MM-


ddTHH:mm:ss'
UTC O F F SET ( N O N - O B SERVE DAY L IGH T T IM E STA M P
T IM E Z O N E DAY L IGH T SAVIN G) T IM EZ O N E VA L UE SAVIN G F O RM AT

Greenwich Mean 0 GMT Standard Time Yes 'yyyy-MM-


Time (GMT) ddTHH:mm:ss'

Central European +1 W. Europe Yes 'yyyy-MM-


Standard Time Standard Time ddTHH:mm:ss'

India Standard Time +5:30 India Standard No 'yyyy-MM-


(IST) Time ddTHH:mm:ss'

China Standard Time +8 China Standard No 'yyyy-MM-


Time ddTHH:mm:ss'

This list is incomplete. For complete list of time zone options, explore in Data Factory portal Trigger creation
page
startTime property
The following table shows you how the star tTime property controls a trigger run:

STA RT T IM E VA L UE REC URREN C E W IT H O UT SC H EDUL E REC URREN C E W IT H SC H EDUL E

Start time in past Calculates the first future execution The trigger starts no sooner than the
time after the start time and runs at specified start time. The first
that time. occurrence is based on the schedule
that's calculated from the start time.
Runs subsequent executions based on
calculating from the last execution Runs subsequent executions based on
time. the recurrence schedule.

See the example that follows this table.

Start time in future or at present Runs once at the specified start time. The trigger starts no sooner than the
specified start time. The first
Runs subsequent executions based on occurrence is based on the schedule
calculating from the last execution that's calculated from the start time.
time.
Runs subsequent executions based on
the recurrence schedule.

Let's see an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00 , the start time is 2017-04-07 14:00 , and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the inter val
property to 2.) Notice that the star tTime value is in the past and occurs before the current time.
Under these conditions, the first execution is at 2017-04-09 at 14:00 . The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00pm , so the next instance is two days
from that time, which is 2017-04-09 at 2:00pm .
The first execution time is the same even if the star tTime value is 2017-04-05 14:00 or 2017-04-01 14:00 . After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are at 2017-04-11 at 2:00pm , then 2017-04-13 at 2:00pm , then 2017-04-15 at 2:00pm , and so on.
Finally, when the hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first
execution are used as the defaults.
schedule property
On one hand, the use of a schedule can limit the number of trigger executions. For example, if a trigger with a
monthly frequency is scheduled to run only on day 31, the trigger runs only in those months that have a 31st
day.
Whereas, a schedule can also expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the 1st and 2nd days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting. The evaluation starts with week number, and then month day, weekday, hour, and finally, minute.
The following table describes the schedule elements in detail:

JSO N EL EM EN T DESC RIP T IO N VA L ID VA L UES

minutes Minutes of the hour at which the Integer


trigger runs. Array of integers

hours Hours of the day at which the trigger Integer


runs. Array of integers

weekDays Days of the week on which the trigger Monday, Tuesday, Wednesday,
runs. The value can be specified with a Thursday, Friday, Saturday,
weekly frequency only. Sunday
Array of day values (maximum
array size is 7)
Day values are not case-
sensitive

monthlyOccurrences Days of the month on which the Array of monthlyOccurrence


trigger runs. The value can be specified objects:
with a monthly frequency only. { "day": day,
"occurrence": occurrence }
.
The day attribute is the day of
the week on which the trigger
runs. For example, a
monthlyOccurrences
property with a day value of
{Sunday} means every
Sunday of the month. The day
attribute is required.
The occurrence attribute is
the occurrence of the specified
day during the month. For
example, a
monthlyOccurrences
property with day and
occurrence values of
{Sunday, -1} means the last
Sunday of the month. The
occurrence attribute is
optional.
JSO N EL EM EN T DESC RIP T IO N VA L ID VA L UES

monthDays Day of the month on which the trigger Any value <= -1 and >= -31
runs. The value can be specified with a Any value >= 1 and <= 31
monthly frequency only. Array of values

Examples of trigger recurrence schedules


This section provides examples of recurrence schedules and focuses on the schedule object and its elements.
The examples assume that the inter val value is 1, and that the frequency value is correct according to the
schedule definition. For example, you can't have a frequency value of "day" and also have a "monthDays"
modification in the schedule object. Restrictions such as these are mentioned in the table in the previous
section.

EXA M P L E DESC RIP T IO N

{"hours":[5]} Run at 5:00 AM every day.

{"minutes":[15], "hours":[5]} Run at 5:15 AM every day.

{"minutes":[15], "hours":[5,17]} Run at 5:15 AM and 5:15 PM every day.

{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.

{"minutes":[0,15,30,45]} Run every 15 minutes.

{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour. This trigger runs every hour. The minutes
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]} are controlled by the star tTime value, when a value is
specified. If a value not specified, the minutes are controlled
by the creation time. For example, if the start time or
creation time (whichever applies) is 12:25 PM, the trigger
runs at 00:25, 01:25, 02:25, ..., and 23:25.

This schedule is equivalent to having a trigger with a


frequency value of "hour," an inter val value of 1, and no
schedule . This schedule can be used with different
frequency and inter val values to create other triggers. For
example, when the frequency value is "month," the
schedule runs only once a month, rather than every day,
when the frequency value is "day."

{"minutes":[0]} Run every hour on the hour. This trigger runs every hour on
the hour starting at 12:00 AM, 1:00 AM, 2:00 AM, and so
on.

This schedule is equivalent to a trigger with a frequency


value of "hour" and a star tTime value of zero minutes, or
no schedule but a frequency value of "day." If the
frequency value is "week" or "month," the schedule
executes one day a week or one day a month only,
respectively.

{"minutes":[15]} Run at 15 minutes past every hour. This trigger runs every
hour at 15 minutes past the hour starting at 00:15 AM, 1:15
AM, 2:15 AM, and so on, and ending at 11:15 PM.
EXA M P L E DESC RIP T IO N

{"hours":[17], "weekDays":["saturday"]} Run at 5:00 PM on Saturdays every week.

{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.

{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.

{"minutes":[0,15,30,45], "weekDays":["monday", Run every 15 minutes on weekdays.


"tuesday", "wednesday", "thursday", "friday"]}

{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}

{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.

{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the 28th day of every month (assuming
a frequency value of "month").

{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month. To run a
trigger on the last day of a month, use -1 instead of day 28,
29, 30, or 31.

{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.

{monthDays":[1,14]} Run on the first and 14th day of every month at the
specified start time.

{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.

{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.

{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}

{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.

{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time. When there's no fifth Friday in a month, the pipeline
doesn't run, since it's scheduled to run only on fifth Fridays.
To run the trigger on the last occurring Friday of the month,
consider using -1 instead of 5 for the occurrence value.

{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}
EXA M P L E DESC RIP T IO N

{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a trigger that runs a pipeline on a tumbling
window
3/22/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article provides steps to create, start, and monitor a tumbling window trigger. For general information
about triggers and the supported types, see Pipeline execution and triggers.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals. A tumbling window trigger has a one-to-one relationship with a pipeline and can only reference a
singular pipeline. Tumbling window trigger is a more heavy weight alternative for schedule trigger offering a
suite of features for complex scenarios(dependency on other tumbling window triggers, rerunning a failed job
and set user retry for pipelines). To further understand the difference between schedule trigger and tumbling
window trigger, please visit here.

Data Factory UI
1. To create a tumbling window trigger in the Data Factory UI, select the Triggers tab, and then select New .
2. After the trigger configuration pane opens, select Tumbling Window , and then define your tumbling
window trigger properties.
3. When you're done, select Save .

Tumbling window trigger type properties


A tumbling window has the following trigger type properties:
{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
"runtimeState": "<<Started/Stopped/Disabled - readonly>>",
"typeProperties": {
"frequency": <<Minute/Hour>>,
"interval": <<int>>,
"startTime": "<<datetime>>",
"endTime": <<datetime – optional>>,
"delay": <<timespan – optional>>,
"maxConcurrency": <<int>> (required, max allowed: 50),
"retryPolicy": {
"count": <<int - optional, default: 0>>,
"intervalInSeconds": <<int>>,
},
"dependsOn": [
{
"type": "TumblingWindowTriggerDependencyReference",
"size": <<timespan – optional>>,
"offset": <<timespan – optional>>,
"referenceTrigger": {
"referenceName": "MyTumblingWindowDependency1",
"type": "TriggerReference"
}
},
{
"type": "SelfDependencyTumblingWindowTriggerReference",
"size": <<timespan – optional>>,
"offset": <<timespan>>
}
]
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipelineName"
},
"parameters": {
"parameter1": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowStartTime,'-dd-MM-
yyyy-HH-mm-ss-ffff'))}"
},
"parameter2": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowEndTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
},
"parameter3": "https://mydemo.azurewebsites.net/api/demoapi"
}
}
}
}

The following table provides a high-level overview of the major JSON elements that are related to recurrence
and scheduling of a tumbling window trigger:

JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED


JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

type The type of the String "TumblingWindowTrig Yes


trigger. The type is ger"
the fixed value
"TumblingWindowTrig
ger".

runtimeState The current state of String "Started," "Stopped," Yes


the trigger run time. "Disabled"
Note : This element is
<readOnly>.

frequency A string that String "minute," "hour" Yes


represents the
frequency unit
(minutes or hours) at
which the trigger
recurs. If the
star tTime date
values are more
granular than the
frequency value, the
star tTime dates are
considered when the
window boundaries
are computed. For
example, if the
frequency value is
hourly and the
star tTime value is
2017-09-
01T10:10:10Z, the
first window is
(2017-09-
01T10:10:10Z, 2017-
09-01T11:10:10Z).

inter val A positive integer Integer A positive integer. Yes


that denotes the
interval for the
frequency value,
which determines
how often the trigger
runs. For example, if
the inter val is 3 and
the frequency is
"hour," the trigger
recurs every 3 hours.
Note : The minimum
window interval is 5
minutes.

star tTime The first occurrence, DateTime A DateTime value. Yes


which can be in the
past. The first trigger
interval is
(star tTime ,
star tTime +
inter val).
JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

endTime The last occurrence, DateTime A DateTime value. Yes


which can be in the
past.

delay The amount of time Timespan A timespan value No


to delay the start of (hh:mm:ss) where the default is
data processing for 00:00:00.
the window. The
pipeline run is
started after the
expected execution
time plus the amount
of delay . The delay
defines how long the
trigger waits past the
due time before
triggering a new run.
The delay doesn’t
alter the window
star tTime . For
example, a delay
value of 00:10:00
implies a delay of 10
minutes.

maxConcurrency The number of Integer An integer between 1 Yes


simultaneous trigger and 50.
runs that are fired for
windows that are
ready. For example,
to back fill hourly
runs for yesterday
results in 24
windows. If
maxConcurrency =
10, trigger events are
fired only for the first
10 windows (00:00-
01:00 - 09:00-10:00).
After the first 10
triggered pipeline
runs are complete,
trigger runs are fired
for the next 10
windows (10:00-
11:00 - 19:00-20:00).
Continuing with this
example of
maxConcurrency =
10, if there are 10
windows ready, there
are 10 total pipeline
runs. If there's only 1
window ready, there's
only 1 pipeline run.

retr yPolicy: Count The number of retries Integer An integer, where the No
before the pipeline default is 0 (no
run is marked as retries).
"Failed."
JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

retr yPolicy: The delay between Integer The number of No


inter valInSeconds retry attempts seconds, where the
specified in seconds. default is 30.

dependsOn: type The type of String "TumblingWindowTrig No


TumblingWindowTrig gerDependencyRefer
gerReference. ence",
Required if a "SelfDependencyTum
dependency is set. blingWindowTriggerR
eference"

dependsOn: size The size of the Timespan A positive timespan No


dependency (hh:mm:ss) value where the
tumbling window. default is the window
size of the child
trigger

dependsOn: offset The offset of the Timespan A timespan value Self-Dependency: Yes
dependency trigger. (hh:mm:ss) that must be Other: No
negative in a self-
dependency. If no
value specified, the
window is the same
as the trigger itself.

NOTE
After a tumbling window trigger is published, inter val and frequency can't be edited.

WindowStart and WindowEnd system variables


You can use the WindowStar t and WindowEnd system variables of the tumbling window trigger in your
pipeline definition (that is, for part of a query). Pass the system variables as parameters to your pipeline in the
trigger definition. The following example shows you how to pass these variables as parameters:
{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
...
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyPipelineName"
},
"parameters": {
"MyWindowStart": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowStartTime,'-dd-MM-
yyyy-HH-mm-ss-ffff'))}"
},
"MyWindowEnd": {
"type": "Expression",
"value": "@{concat('output',formatDateTime(trigger().outputs.windowEndTime,'-dd-MM-yyyy-
HH-mm-ss-ffff'))}"
}
}
}
}
}

To use the WindowStar t and WindowEnd system variable values in the pipeline definition, use your
"MyWindowStart" and "MyWindowEnd" parameters, accordingly.
Execution order of windows in a backfill scenario
If the startTime of trigger is in the past, then based on this formula, M=(CurrentTime-
TriggerStartTime)/TumblingWindowSize, the trigger will generate {M} backfill(past) runs in parallel, honoring
trigger concurrency, before executing the future runs. The order of execution for windows is deterministic, from
oldest to newest intervals. Currently, this behavior can't be modified.
Existing TriggerResource elements
The following points apply to update of existing TriggerResource elements:
The value for the frequency element (or window size) of the trigger along with inter val element cannot be
changed once the trigger is created. This is required for proper functioning of triggerRun reruns and
dependency evaluations
If the value for the endTime element of the trigger changes (added or updated), the state of the windows
that are already processed is not reset. The trigger honors the new endTime value. If the new endTime
value is before the windows that are already executed, the trigger stops. Otherwise, the trigger stops when
the new endTime value is encountered.
User assigned retries of pipelines
In case of pipeline failures, tumbling window trigger can retry the execution of the referenced pipeline
automatically, using the same input parameters, without the user intervention. This can be specified using the
property "retryPolicy" in the trigger definition.
Tumbling window trigger dependency
If you want to make sure that a tumbling window trigger is executed only after the successful execution of
another tumbling window trigger in the data factory, create a tumbling window trigger dependency.
Cancel tumbling window run
You can cancel runs for a tumbling window trigger, if the specific window is in Waiting, Waiting on Dependency,
or Running state
If the window is in Running state, cancel the associated Pipeline Run, and the trigger run will be marked as
Canceled afterwards
If the window is in Waiting or Waiting on Dependency state, you can cancel the window from Monitoring:

You can also rerun a canceled window. The rerun will take the latest published definitions of the trigger, and
dependencies for the specified window will be re-evaluated upon rerun

Sample for Azure PowerShell


NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

This section shows you how to use Azure PowerShell to create, start, and monitor a trigger.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:

IMPORTANT
Before you save the JSON file, set the value of the star tTime element to the current UTC time. Set the value of
the endTime element to one hour past the current UTC time.
{
"name": "PerfTWTrigger",
"properties": {
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Minute",
"interval": "15",
"startTime": "2017-09-08T05:30:00Z",
"delay": "00:00:01",
"retryPolicy": {
"count": 2,
"intervalInSeconds": 30
},
"maxConcurrency": 50
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "DynamicsToBlobPerfPipeline"
},
"parameters": {
"windowStart": "@trigger().outputs.windowStartTime",
"windowEnd": "@trigger().outputs.windowEndTime"
}
},
"runtimeState": "Started"
}
}

2. Create a trigger by using the Set-AzDataFactor yV2Trigger cmdlet:

Set-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger" -DefinitionFile "C:\ADFv2QuickStartPSH\MyTrigger.json"

3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactor yV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

4. Start the trigger by using the Star t-AzDataFactor yV2Trigger cmdlet:

Start-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName


-Name "MyTrigger"

5. Confirm that the status of the trigger is Star ted by using the Get-AzDataFactor yV2Trigger cmdlet:

Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -


Name "MyTrigger"

6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactor yV2TriggerRun cmdlet. To get
information about the trigger runs, execute the following command periodically. Update the
TriggerRunStar tedAfter and TriggerRunStar tedBefore values to match the values in your trigger
definition:
Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-08T00:00:00" -TriggerRunStartedBefore
"2017-12-08T01:00:00"

To monitor trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a tumbling window trigger dependency.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a tumbling window trigger dependency
3/5/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article provides steps to create a dependency on a tumbling window trigger. For general information about
Tumbling Window triggers, see How to create tumbling window trigger.
In order to build a dependency chain and make sure that a trigger is executed only after the successful execution
of another trigger in the data factory, use this advanced feature to create a tumbling window dependency.
For a demonstration on how to create dependent pipelines in your Azure Data Factory using tumbling window
trigger, watch the following video:

Create a dependency in the Data Factory UI


To create dependency on a trigger, select Trigger > Advanced > New , and then choose the trigger to depend
on with the appropriate offset and size. Select Finish and publish the data factory changes for the dependencies
to take effect.

Tumbling window dependency properties


A tumbling window trigger with a dependency has the following properties:
{
"name": "MyTriggerName",
"properties": {
"type": "TumblingWindowTrigger",
"runtimeState": <<Started/Stopped/Disabled - readonly>>,
"typeProperties": {
"frequency": <<Minute/Hour>>,
"interval": <<int>>,
"startTime": <<datetime>>,
"endTime": <<datetime – optional>>,
"delay": <<timespan – optional>>,
"maxConcurrency": <<int>> (required, max allowed: 50),
"retryPolicy": {
"count": <<int - optional, default: 0>>,
"intervalInSeconds": <<int>>,
},
"dependsOn": [
{
"type": "TumblingWindowTriggerDependencyReference",
"size": <<timespan – optional>>,
"offset": <<timespan – optional>>,
"referenceTrigger": {
"referenceName": "MyTumblingWindowDependency1",
"type": "TriggerReference"
}
},
{
"type": "SelfDependencyTumblingWindowTriggerReference",
"size": <<timespan – optional>>,
"offset": <<timespan>>
}
]
}
}
}

The following table provides the list of attributes needed to define a Tumbling Window dependency.

P RO P ERT Y N A M E DESC RIP T IO N TYPE REQ UIRED

type All the existing tumbling TumblingWindowTriggerDep Yes


window triggers are endencyReference or
displayed in this drop down. SelfDependencyTumblingWi
Choose the trigger to take ndowTriggerReference
dependency on.

offset Offset of the dependency Timespan Self-Dependency: Yes


trigger. Provide a value in (hh:mm:ss) Other: No
time span format and both
negative and positive
offsets are allowed. This
property is mandatory if
the trigger is depending on
itself and in all other cases it
is optional. Self-dependency
should always be a negative
offset. If no value specified,
the window is the same as
the trigger itself.
P RO P ERT Y N A M E DESC RIP T IO N TYPE REQ UIRED

size Size of the dependency Timespan No


tumbling window. Provide a (hh:mm:ss)
positive timespan value.
This property is optional.

NOTE
A tumbling window trigger can depend on a maximum of five other triggers.

Tumbling window self-dependency properties


In scenarios where the trigger shouldn't proceed to the next window until the preceding window is successfully
completed, build a self-dependency. A self-dependency trigger that's dependent on the success of earlier runs of
itself within the preceding hour will have the properties indicated in the following code.

NOTE
If your triggered pipeline relies on the output of pipelines in previously triggered windows, we recommend using only
tumbling window trigger self-dependency. To limit parallel trigger runs, set the maximimum trigger concurrency.

{
"name": "DemoSelfDependency",
"properties": {
"runtimeState": "Started",
"pipeline": {
"pipelineReference": {
"referenceName": "Demo",
"type": "PipelineReference"
}
},
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": 1,
"startTime": "2018-10-04T00:00:00Z",
"delay": "00:01:00",
"maxConcurrency": 50,
"retryPolicy": {
"intervalInSeconds": 30
},
"dependsOn": [
{
"type": "SelfDependencyTumblingWindowTriggerReference",
"size": "01:00:00",
"offset": "-01:00:00"
}
]
}
}
}

Usage scenarios and examples


Below are illustrations of scenarios and usage of tumbling window dependency properties.
Dependency offset
Dependency size

Self-dependency

Dependency on another tumbling window trigger


A daily telemetry processing job depending on another daily job aggregating the last seven days output and
generates seven day rolling window streams:
Dependency on itself
A daily job with no gaps in the output streams of the job:

Monitor dependencies
You can monitor the dependency chain and the corresponding windows from the trigger run monitoring page.
Navigate to Monitoring > Trigger Runs . If a Tumbling Window trigger has dependencies, Trigger Name will
bear a hyperlink to dependency monitoring view.

Click through the trigger name to view trigger dependencies. Right-hand panel shows detailed trigger run
information, such as RunID, window time, status, and so on.
You can see the status of the dependencies, and windows for each dependent trigger. If one of the dependencies
triggers fails, you must successfully rerun it in order for the dependent trigger to run.
A tumbling window trigger will wait on dependencies for seven days before timing out. After seven days, the
trigger run will fail.
For a more visual to view the trigger dependency schedule, select the Gantt view.

Transparent boxes show the dependency windows for each down stream-dependent trigger, while solid colored
boxes above show individual window runs. Here are some tips for interpreting the Gantt chart view:
Transparent box renders blue when dependent windows are in pending or running state
After all windows succeeds for a dependent trigger, the transparent box will turn green
Transparent box renders red when some dependent window fails. Look for a solid red box to identify the
failure window run
To rerun a window in Gantt chart view, select the solid color box for the window, and an action panel will pop up
with details and rerun options

Next steps
Review How to create a tumbling window trigger
Create a trigger that runs a pipeline in response to
a storage event
4/2/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes the Storage Event Triggers that you can create in your Data Factory pipelines.
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection,
consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger
pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure Blob
Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such
events.
For a ten-minute introduction and demonstration of this feature, watch the following video:

NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with
the Event Grid resource provider. For more info, see Resource providers and types. You must be able to do the
Microsoft.EventGrid/eventSubscriptions/* action. This action is part of the EventGrid EventSubscription Contributor built-
in role.

Data Factory UI
This section shows you how to create a storage event trigger within the Azure Data Factory User Interface.
1. Switch to the Edit tab, shown with a pencil symbol.
2. Select Trigger on the menu, then select New/Edit .
3. On the Add Triggers page, select Choose trigger..., then select +New .
4. Select trigger type Storage Event
5. Select your storage account from the Azure subscription dropdown or manually using its Storage account
resource ID. Choose which container you wish the events to occur on. Container selection is required, but
be mindful that selecting all containers can lead to a large number of events.

NOTE
The Storage Event Trigger currently supports only Azure Data Lake Storage Gen2 and General-purpose version 2
storage accounts. Due to an Azure Event Grid limitation, Azure Data Factory only supports a maximum of 500
storage event triggers per storage account. If you hit the limit, please contact support for recommendations and
increasing the limit upon evaluation by Event Grid team.

NOTE
To create a new or modify an existing Storage Event Trigger, the Azure account used to log into Data Factory and
publish the storage event trigger must have appropriate role based access control (Azure RBAC) permission on
the storage account. No additional permission is required: Service Principal for the Azure Data Factory does not
need special permission to either the Storage account or Event Grid. For more information about access control,
see Role based access control section.

6. The Blob path begins with and Blob path ends with properties allow you to specify the containers,
folders, and blob names for which you want to receive events. Your storage event trigger requires at least
one of these properties to be defined. You can use variety of patterns for both Blob path begins with
and Blob path ends with properties, as shown in the examples later in this article.
Blob path begins with: The blob path must start with a folder path. Valid values include 2018/ and
2018/april/shoes.csv . This field can't be selected if a container isn't selected.
Blob path ends with: The blob path must end with a file name or extension. Valid values include
shoes.csv and .csv . Container and folder names, when specified, they must be separated by a
/blobs/ segment. For example, a container named 'orders' can have a value of
/orders/blobs/2018/april/shoes.csv . To specify a folder in any container, omit the leading '/' character.
For example, april/shoes.csv will trigger an event on any file named shoes.csv in folder a called
'april' in any container.
Note that Blob path begins with and ends with are the only pattern matching allowed in Storage
Event Trigger. Other types of wildcard matching aren't supported for the trigger type.
7. Select whether your trigger will respond to a Blob created event, Blob deleted event, or both. In your
specified storage location, each event will trigger the Data Factory pipelines associated with the trigger.
8. Select whether or not your trigger ignores blobs with zero bytes.
9. After you configure you trigger, click on Next: Data preview . This screen shows the existing blobs
matched by your storage event trigger configuration. Make sure you've specific filters. Configuring filters
that are too broad can match a large number of files created/deleted and may significantly impact your
cost. Once your filter conditions have been verified, click Finish .
10. To attach a pipeline to this trigger, go to the pipeline canvas and click Trigger and select New/Edit . When
the side nav appears, click on the Choose trigger... dropdown and select the trigger you created. Click
Next: Data preview to confirm the configuration is correct and then Next to validate the Data preview
is correct.
11. If your pipeline has parameters, you can specify them on the trigger runs parameter side nav. The storage
event trigger captures the folder path and file name of the blob into the properties
@triggerBody().folderPath and @triggerBody().fileName . To use the values of these properties in a
pipeline, you must map the properties to pipeline parameters. After mapping the properties to
parameters, you can access the values captured by the trigger through the
@pipeline().parameters.parameterName expression throughout the pipeline. For detailed explanation, see
Reference Trigger Metadata in Pipelines

In the preceding example, the trigger is configured to fire when a blob path ending in .csv is created in the
folder event-testing in the container sample-data. The folderPath and fileName properties capture the
location of the new blob. For example, when MoviesDB.csv is added to the path sample-data/event-
testing, @triggerBody().folderPath has a value of sample-data/event-testing and
@triggerBody().fileName has a value of moviesDB.csv . These values are mapped, in the example, to the
pipeline parameters sourceFolder and sourceFile , which can be used throughout the pipeline as
@pipeline().parameters.sourceFolder and @pipeline().parameters.sourceFile respectively.

NOTE
If you are creating your pipeline and trigger in Azure Synapse Analytics, you must use
@trigger().outputs.body.fileName and @trigger().outputs.body.folderPath as parameters. Those two
properties capture blob information. Use those properties instead of using @triggerBody().fileName and
@triggerBody().folderPath .

12. Click Finish once you are done.

JSON schema
The following table provides an overview of the schema elements that are related to storage event triggers:

JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

scope The Azure Resource String Azure Resource Yes


Manager resource ID Manager ID
of the Storage
Account.

events The type of events Array Microsoft.Storage.Blo Yes, any combination


that cause this bCreated, of these values.
trigger to fire. Microsoft.Storage.Blo
bDeleted

blobPathBeginsWit The blob path must String Provide a value for at


h begin with the least one of these
pattern provided for properties:
the trigger to fire. For blobPathBeginsWith
example, or
/records/blobs/december/ blobPathEndsWith .
only fires the trigger
for blobs in the
december folder
under the records
container.

blobPathEndsWith The blob path must String You have to provide


end with the pattern a value for at least
provided for the one of these
trigger to fire. For properties:
example, blobPathBeginsWith
december/boxes.csv or
only fires the trigger blobPathEndsWith .
for blobs named
boxes in a
december folder.
JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

ignoreEmptyBlobs Whether or not zero- Boolean true or false No


byte blobs will trigger
a pipeline run. By
default, this is set to
true.

Examples of storage event triggers


This section provides examples of storage event trigger settings.

IMPORTANT
You have to include the /blobs/ segment of the path, as shown in the following examples, whenever you specify
container and folder, container and file, or container, folder, and file. For blobPathBeginsWith , the Data Factory UI will
automatically add /blobs/ between the folder and container name in the trigger JSON.

P RO P ERT Y EXA M P L E DESC RIP T IO N

Blob path begins with /containername/ Receives events for any blob in the
container.

Blob path begins with /containername/blobs/foldername/ Receives events for any blobs in the
containername container and
foldername folder.

Blob path begins with You can also reference a subfolder.


/containername/blobs/foldername/subfoldername/

Blob path begins with Receives events for a blob named


/containername/blobs/foldername/file.txt
file.txt in the foldername folder
under the containername container.

Blob path ends with file.txt Receives events for a blob named
file.txt in any path.

Blob path ends with /containername/blobs/file.txt Receives events for a blob named
file.txt under container
containername .

Blob path ends with foldername/file.txt Receives events for a blob named
file.txt in foldername folder
under any container.

Role-based access control


Azure Data Factory uses Azure role-based access control (Azure RBAC) to ensure that unauthorized access to
listen to, subscribe to updates from, and trigger pipelines linked to blob events, are strictly prohibited.
To successfully create a new or update an existing Storage Event Trigger, the Azure account signed into the
Data Factory needs to have appropriate access to the relevant storage account. Otherwise, the operation with
fail with Access Denied.
Data Factory needs no special permission to your Event Grid, and you do not need to assign special RBAC
permission to Data Factory service principal for the operation.
Any of following RBAC settings works for storage event trigger:
Owner role to the storage account
Contributor role to the storage account
Microsoft.EventGrid/EventSubscriptions/Write permission to storage account
/subscriptions/####/resourceGroups/####/providers/Microsoft.Storage/storageAccounts/storageAccountName
In order to understand how Azure Data Factory delivers the two promises, let's take back a step and take a sneak
peek behind the scene. Here are the high-level work flows for integration among Data Factory, Storage, and
Event Grid.
Create a new Storage Event Trigger
This high-level work flow describes how Azure Data Factory interacts with Event Grid to create a Storage Event
Trigger

Two noticeable call outs from the work flows:


Azure Data Factory makes no direct contact with Storage account. Request to create a subscription is
instead relayed and processed by Event Grid. Hence, Data Factory needs no permission to Storage
account for this step.
Access control and permission checking happen on Azure Data Factory side. Before ADF sends a request
to subscribe to storage event, it checks the permission for the user. More specifically, it checks whether
the Azure account signed in and attempting to create the Storage Event trigger has appropriate access to
the relevant storage account. If the permission check fails, trigger creation also fails.
Storage event trigger Data Factory pipeline run
This high-level work flows describe how Storage event triggers pipeline run through Event Grid
When it comes to Event triggering pipeline in Data Factory, three noticeable call outs in the workflow:
Event Grid uses a Push model that it relays the message as soon as possible when storage drops the
message into the system. This is different from messaging system, such as Kafka where a Pull system is
used.
Event Trigger on Azure Data Factory serves as an active listener to the incoming message and it properly
triggers the associated pipeline.
Storage Event Trigger itself makes no direct contact with Storage account
That said, if you have a Copy or other activity inside the pipeline to process the data in Storage
account, Data Factory will make direct contact with Storage, using the credentials stored in the Linked
Service. Ensure that Linked Service is set up appropriately
However, if you make no reference to the Storage account in the pipeline, you do not need to grant
permission to Data Factory to access Storage account

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a custom event trigger to run a pipeline in
Azure Data Factory (preview)
5/7/2021 • 4 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Event-driven architecture (EDA) is a common data integration pattern that involves production, detection,
consumption, and reaction to events. Data integration scenarios often require Azure Data Factory customers to
trigger pipelines when certain events occur. Data Factory native integration with Azure Event Grid now covers
custom topics. You send events to an event grid topic. Data Factory subscribes to the topic, listens, and then
triggers pipelines accordingly.

NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with
the Event Grid resource provider. For more information, see Resource providers and types. You must be able to do the
Microsoft.EventGrid/eventSubscriptions/ action. This action is part of the EventGrid EventSubscription Contributor
built-in role.

If you combine pipeline parameters and a custom event trigger, you can parse and reference custom data
payloads in pipeline runs. Because the data field in a custom event payload is a free-form, JSON key-value
structure, you can control event-driven pipeline runs.

IMPORTANT
If a key referenced in parameterization is missing in the custom event payload, trigger run will fail. You'll get an error
that states the expression cannot be evaluated because property keyName doesn't exist. In this case, no pipeline run
will be triggered by the event.

Set up a custom topic in Event Grid


To use the custom event trigger in Data Factory, you need to first set up a custom topic in Event Grid.
Go to Azure Event Grid and create the topic yourself. For more information on how to create the custom topic,
see Azure Event Grid portal tutorials and CLI tutorials.

NOTE
The workflow is different from Storage Event Trigger. Here, Data Factory doesn't set up the topic for you.

Data Factory expects events to follow the Event Grid event schema. Make sure event payloads have the
following fields:
[
{
"topic": string,
"subject": string,
"id": string,
"eventType": string,
"eventTime": string,
"data":{
object-unique-to-each-publisher
},
"dataVersion": string,
"metadataVersion": string
}
]

Use Data Factory to create a custom event trigger


1. Go to Azure Data Factory and sign in.
2. Switch to the Edit tab. Look for the pencil icon.
3. Select Trigger on the menu and then select New/Edit .
4. On the Add Triggers page, select Choose trigger , and then select +New .
5. Select Custom events for Type .

6. Select your custom topic from the Azure subscription dropdown or manually enter the event topic scope.

NOTE
To create or modify a custom event trigger in Data Factory, you need to use an Azure account with appropriate
role-based access control (Azure RBAC). No additional permission is required. The Data Factory service principle
does not require special permission to your Event Grid. For more information about access control, see the Role-
based access control section.

7. The Subject begins with and Subject ends with properties allow you to filter for trigger events. Both
properties are optional.
8. Use + New to add Event Types to filter on. The list of custom event triggers uses an OR relationship.
When a custom event with an eventType property that matches one on the list, a pipeline run is
triggered. The event type is case insensitive. For example, in the following screenshot, the trigger matches
all copycompleted or copysucceeded events that have a subject that begins with factories.
9. A custom event trigger can parse and send a custom data payload to your pipeline. You create the
pipeline parameters, and then fill in the values on the Parameters page. Use the format
@triggerBody().event.data._keyName_ to parse the data payload and pass values to the pipeline
parameters.
For a detailed explanation, see the following articles:
Reference trigger metadata in pipelines
System variables in custom event trigger
10. After you've entered the parameters, select OK .

JSON schema
The following table provides an overview of the schema elements that are related to custom event triggers:

JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

scope The Azure Resource String Azure Resource Yes


Manager resource ID Manager ID
of the event grid
topic.

events The type of events Array of strings Yes, at least one


that cause this value is expected.
trigger to fire.
JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED

subjectBeginsWith The subject field String No


must begin with the
provided pattern for
the trigger to fire. For
example, factories
only fires the trigger
for event subjects
that start with
factories.

subjectEndsWith The subject field String No


must end with the
provided pattern for
the trigger to fire.

Role-based access control


Azure Data Factory uses Azure RBAC to prohibit unauthorized access. To function properly, Data Factory requires
access to:
Listen to events.
Subscribe to updates from events.
Trigger pipelines linked to custom events.
To successfully create or update a custom event trigger, you need to sign in to Data Factory with an Azure
account that has appropriate access. Otherwise, the operation will fail with an Access Denied error.
Data Factory doesn't require special permission to your Event Grid. You also do not need to assign special Azure
RBAC permission to the Data Factory service principal for the operation.
Specifically, you need Microsoft.EventGrid/EventSubscriptions/Write permission on
/subscriptions/####/resourceGroups//####/providers/Microsoft.EventGrid/topics/someTopics .

Next steps
Get detailed information about trigger execution.
Learn how to reference trigger metadata in pipeline runs.
Reference trigger metadata in pipeline runs
3/17/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article describes how trigger metadata, such as trigger start time, can be used in pipeline run.
Pipeline sometimes needs to understand and reads metadata from trigger that invokes it. For instance, with
Tumbling Window Trigger run, based upon window start and end time, pipeline will process different data slices
or folders. In Azure Data Factory, we use Parameterization and System Variable to pass meta data from trigger to
pipeline.
This pattern is especially useful for Tumbling Window Trigger, where trigger provides window start and end
time, and Custom Event Trigger, where trigger parse and process values in custom defined data field.

NOTE
Different trigger type provides different meta data information. For more information, see System Variable

Data Factory UI
This section shows you how to pass meta data information from trigger to pipeline, within the Azure Data
Factory User Interface.
1. Go to the Authoring Canvas and edit a pipeline
2. Click on the blank canvas to bring up pipeline settings. Do not select any activity. You may need to pull up
the setting panel from the bottom of the canvas, as it may have been collapsed
3. Select Parameters section and select + New to add parameters
4. Add triggers to pipeline, by clicking on + Trigger .
5. Create or attach a trigger to the pipeline, and click OK
6. In the following page, fill in trigger meta data for each parameter. Use format defined in System Variable
to retrieve trigger information. You don't need to fill in the information for all parameters, just the ones
that will assume trigger metadata values. For instance, here we assign trigger run start time to
parameter_1.

7. To use the values in pipeline, utilize parameters @pipeline().parameters.parameterName, not system


variable, in pipeline definitions. For instance, in our case, to read trigger start time, we'll reference
@pipeline().parameters.parameter_1.

JSON schema
To pass in trigger information to pipeline runs, both the trigger and the pipeline json need to be updated with
parameters section.
Pipeline definition
Under proper ties section, add parameter definitions to parameters section

{
"name": "demo_pipeline",
"properties": {
"activities": [
{
"name": "demo_activity",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "@pipeline().parameters.parameter_2",
"type": "Expression"
},
"method": "GET"
}
}
],
"parameters": {
"parameter_1": {
"type": "string"
},
"parameter_2": {
"type": "string"
},
"parameter_3": {
"type": "string"
},
"parameter_4": {
"type": "string"
},
"parameter_5": {
"type": "string"
}
},
"annotations": [],
"lastPublishTime": "2021-02-24T03:06:23Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Trigger definition
Under pipelines section, assign parameter values in parameters section. You don't need to fill in the
information for all parameters, just the ones that will assume trigger metadata values.
{
"name": "trigger1",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "demo_pipeline",
"type": "PipelineReference"
},
"parameters": {
"parameter_1": "@trigger().startTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2021-03-03T04:38:00Z",
"timeZone": "UTC"
}
}
}
}

Use trigger information in pipeline


To use the values in pipeline, utilize parameters @pipeline().parameters.parameterName, not system variable, in
pipeline definitions.

Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Connect Data Factory to Azure Purview (Preview)
5/6/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


This article will explain how to connect Data Factory to Azure Purview and how to report data lineage of Azure
Data Factory activities Copy data, Data flow and Execute SSIS package.

Connect Data Factory to Azure Purview


Azure Purview is a new cloud service for use by data users centrally manage data governance across their data
estate spanning cloud and on-prem environments. You can connect your Data Factory to Azure Purview and the
connection allows you to use Azure Purview for capturing lineage data of Copy, Data flow and Execute SSIS
package. You have two ways to connect data factory to Azure Purview:
Register Azure Purview account to Data Factory
1. In the ADF portal, go to Manage -> Azure Pur view . Select Connect to a Pur view account .

2. You can choose From Azure subscription or Enter manually . From Azure subscription , you can select
the account that you have access to. 3. Once connected, you should be able to see the name of the Purview
account in the tab Pur view account . 4. You can use the Search bar at the top center of Azure Data Factory
portal to search for data.
If you see warning in Azure Data Factory portal after you register Azure Purview account to Data Factory, follow
below steps to fix the issue:
1. Go to Azure portal and find your data factory. Choose section "Tags" and see if there is a tag named
catalogUri . If not, please disconnect and reconnect the Azure Purview account in the ADF portal.

2. Check if the permission is granted for registering an Azure Purview account to Data Factory. See How to
connect Azure Data Factory and Azure Purview
Register Data Factory in Azure Purview
For how to register Data Factory in Azure Purview, see How to connect Azure Data Factory and Azure Purview.

Report Lineage data to Azure Purview


When customers run Copy, Data flow or Execute SSIS package activity in Azure Data Factory, customers could
get the dependency relationship and have a high-level overview of whole workflow process among data
sources and destination. For how to collect lineage from Azure Data Factory, see data factory lineage.

Next steps
Catalog lineage user guide
Tutorial: Push Data Factory lineage data to Azure Purview
Discover and explore data in ADF using Purview
5/6/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


In this article, you will register an Azure Purview Account to a Data Factory. That connection allows you to
discover Azure Purview assets and interact with them through ADF capabilities.
You can perform the following tasks in ADF:
Use the search box at the top to find Purview assets based on keywords
Understand the data based on metadata, lineage, annotations
Connect those data to your data factory with linked services or datasets

Prerequisites
AzurePurview account
Data Factory
Connect an Azure Purview Account into Data Factory

Using Azure Purview in Data Factory


The use Azure Purview in Data Factory requires you to have access to that Purview account. Data Factory
passes-through your Purview permission. As an example, if you have a curator permission role, you will be able
to edit metadata scanned by Azure Purview.
Data discovery: search datasets
To discover data registered and scanned by Azure Purview, you can use the Search bar at the top center of Data
Factory portal. Make sure that you select Azure Purview to search for all of your organization data.

Actions that you can perform over datasets with Data Factory resources
You can directly create Linked Service, Dataset, or dataflow over the data you search by Azure Purview.
Nextsteps
Register and scan Azure Data Factory assets in Azure Purview
How to Search Data in Azure Purview Data Catalog
Use Azure Data Factory to migrate data from your
data lake or data warehouse to Azure
3/5/2021 • 2 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


If you want to migrate your data lake or enterprise data warehouse (EDW) to Microsoft Azure, consider using
Azure Data Factory. Azure Data Factory is well-suited to the following scenarios:
Big data workload migration from Amazon Simple Storage Service (Amazon S3) or an on-premises Hadoop
Distributed File System (HDFS) to Azure
EDW migration from Oracle Exadata, Netezza, Teradata, or Amazon Redshift to Azure
Azure Data Factory can move petabytes (PB) of data for data lake migration, and tens of terabytes (TB) of data
for data warehouse migration.

Why Azure Data Factory can be used for data migration


Azure Data Factory can easily scale up the amount of processing power to move data in a serverless manner
with high performance, resilience, and scalability. And you pay only for what you use. Also note the following:
Azure Data Factory has no limitations on data volume or on the number of files.
Azure Data Factory can fully use your network and storage bandwidth to achieve the highest volume
of data movement throughput in your environment.
Azure Data Factory uses a pay-as-you-go method, so that you pay only for the time you actually use
to run the data migration to Azure.
Azure Data Factory can perform both a one-time historical load and scheduled incremental loads.
Azure Data Factory uses Azure integration runtime (IR) to move data between publicly accessible data lake
and warehouse endpoints. It can also use self-hosted IR for moving data for data lake and warehouse
endpoints inside Azure Virtual Network (VNet) or behind a firewall.
Azure Data Factory has enterprise-grade security: You can use Windows Installer (MSI) or Service Identity for
secured service-to-service integration, or use Azure Key Vault for credential management.
Azure Data Factory provides a code-free authoring experience and a rich, built-in monitoring dashboard.

Online vs. offline data migration


Azure Data Factory is a standard online data migration tool to transfer data over a network (internet, ER, or
VPN). Whereas with offline data migration, users physically ship data-transfer devices from their organization to
an Azure Data Center.
There are three key considerations when you choose between an online and offline migration approach:
Size of data to be migrated
Network bandwidth
Migration window
For example, assume you plan to use Azure Data Factory to complete your data migration within two weeks
(your migration window ). Notice the pink/blue cut line in the following table. The lowest pink cell for any given
column shows the data size/network bandwidth pairing whose migration window is closest to but less than two
weeks. (Any size/bandwidth pairing in a blue cell has an online migration window of more than two weeks.)
This table helps you determine whether you can meet your intended migration window through online
migration (Azure Data Factory) based on the size of your data and your available network bandwidth. If the
online migration window is more than two weeks, you'll want to use offline migration.

NOTE
By using online migration, you can achieve both historical data loading and incremental feeds end-to-end through a
single tool. Through this approach, your data can be kept synchronized between the existing store and the new store
during the entire migration window. This means you can rebuild your ETL logic on the new store with refreshed data.

Next steps
Migrate data from AWS S3 to Azure
Migrate data from on-premises hadoop cluster to Azure
Migrate data from on-premises Netezza server to Azure
Use Azure Data Factory to migrate data from
Amazon S3 to Azure Storage
3/5/2021 • 8 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory provides a performant, robust, and cost-effective mechanism to migrate data at scale from
Amazon S3 to Azure Blob Storage or Azure Data Lake Storage Gen2. This article provides the following
information for data engineers and developers:
Performance
Copy resilience
Network security
High-level solution architecture
Implementation best practices

Performance
ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build
pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data
movement throughput for your environment.
Customers have successfully migrated petabytes of data consisting of hundreds of millions of files from
Amazon S3 to Azure Blob Storage, with a sustained throughput of 2 GBps and higher.

The picture above illustrates how you can achieve great data movement speeds through different levels of
parallelism:
A single copy activity can take advantage of scalable compute resources: when using Azure Integration
Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-
hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to
4 nodes), and a single copy activity will partition its file set across all nodes.
A single copy activity reads from and writes to the data store using multiple threads.
ADF control flow can start multiple copy activities in parallel, for example using For Each loop.

Resilience
Within a single copy activity run, ADF has built-in retry mechanism so it can handle a certain level of transient
failures in the data stores or in the underlying network.
When doing binary copying from S3 to Blob and from S3 to ADLS Gen2, ADF automatically performs
checkpointing. If a copy activity run has failed or timed out, on a subsequent retry, the copy resumes from the
last failure point instead of starting from the beginning.

Network security
By default, ADF transfers data from Amazon S3 to Azure Blob Storage or Azure Data Lake Storage Gen2 using
encrypted connection over HTTPS protocol. HTTPS provides data encryption in transit and prevents
eavesdropping and man-in-the-middle attacks.
Alternatively, if you do not want data to be transferred over public Internet, you can achieve higher security by
transferring data over a private peering link between AWS Direct Connect and Azure Express Route. Refer to the
solution architecture below on how this can be achieved.

Solution architecture
Migrate data over public Internet:

In this architecture, data is transferred securely using HTTPS over public Internet.
Both the source Amazon S3 as well as the destination Azure Blob Storage or Azure Data Lake Storage Gen2
are configured to allow traffic from all network IP addresses. Refer to the second architecture below on how
you can restrict network access to specific IP range.
You can easily scale up the amount of horsepower in serverless manner to fully utilize your network and
storage bandwidth so that you can get the best throughput for your environment.
Both initial snapshot migration and delta data migration can be achieved using this architecture.
Migrate data over private link:
In this architecture, data migration is done over a private peering link between AWS Direct Connect and
Azure Express Route such that data never traverses over public Internet. It requires use of AWS VPC and
Azure Virtual network.
You need to install ADF self-hosted integration runtime on a Windows VM within your Azure virtual network
to achieve this architecture. You can manually scale up your self-hosted IR VMs or scale out to multiple VMs
(up to 4 nodes) to fully utilize your network and storage IOPS/bandwidth.
If it is acceptable to transfer data over HTTPS but you want to lock down network access to source S3 to a
specific IP range, you can adopt a variation of this architecture by removing AWS VPC and replacing private
link with HTTPS. You will want to keep Azure Virtual and self-hosted IR on Azure VM so you can have a static
publicly routable IP for filtering purpose.
Both initial snapshot data migration and delta data migration can be achieved using this architecture.

Implementation best practices


Authentication and credential management
To authenticate to Amazon S3 account, you must use access key for IAM account.
Multiple authentication types are supported to connect to Azure Blob Storage. Use of managed identities for
Azure resources is highly recommended: built on top of an automatically managed ADF identify in Azure AD,
it allows you to configure pipelines without supplying credentials in Linked Service definition. Alternatively,
you can authenticate to Azure Blob Storage using Service Principal, shared access signature, or storage
account key.
Multiple authentication types are also supported to connect to Azure Data Lake Storage Gen2. Use of
managed identities for Azure resources is highly recommended, although service principal or storage
account key can also be used.
When you are not using managed identities for Azure resources, storing the credentials in Azure Key Vault is
highly recommended to make it easier to centrally manage and rotate keys without modifying ADF linked
services. This is also one of the best practices for CI/CD.
Initial snapshot data migration
Data partition is recommended especially when migrating more than 100 TB of data. To partition the data,
leverage the ‘prefix’ setting to filter the folders and files in Amazon S3 by name, and then each ADF copy job can
copy one partition at a time. You can run multiple ADF copy jobs concurrently for better throughput.
If any of the copy jobs fail due to network or data store transient issue, you can rerun the failed copy job to
reload that specific partition again from AWS S3. All other copy jobs loading other partitions will not be
impacted.
Delta data migration
The most performant way to identify new or changed files from AWS S3 is by using time-partitioned naming
convention – when your data in AWS S3 has been time partitioned with time slice information in the file or
folder name (for example, /yyyy/mm/dd/file.csv), then your pipeline can easily identify which files/folders to
copy incrementally.
Alternatively, If your data in AWS S3 is not time partitioned, ADF can identify new or changed files by their
LastModifiedDate. The way it works is that ADF will scan all the files from AWS S3, and only copy the new and
updated file whose last modified timestamp is greater than a certain value. Be aware that if you have a large
number of files in S3, the initial file scanning could take a long time regardless of how many files match the filter
condition. In this case you are suggested to partition the data first, using the same ‘prefix’ setting for initial
snapshot migration, so that the file scanning can happen in parallel.
For scenarios that require self-hosted Integration runtime on Azure VM
Whether you are migrating data over private link or you want to allow specific IP range on Amazon S3 firewall,
you need to install self-hosted Integration runtime on Azure Windows VM.
The recommend configuration to start with for each Azure VM is Standard_D32s_v3 with 32 vCPU and 128-
GB memory. You can keep monitoring CPU and memory utilization of the IR VM during the data migration to
see if you need to further scale up the VM for better performance or scale down the VM to save cost.
You can also scale out by associating up to 4 VM nodes with a single self-hosted IR. A single copy job running
against a self-hosted IR will automatically partition the file set and leverage all VM nodes to copy the files in
parallel. For high availability, you are recommended to start with 2 VM nodes to avoid single point of failure
during the data migration.
Rate limiting
As a best practice, conduct a performance POC with a representative sample dataset, so that you can determine
an appropriate partition size.
Start with a single partition and a single copy activity with default DIU setting. Gradually increase the DIU setting
until you reach the bandwidth limit of your network or IOPS/bandwidth limit of the data stores, or you have
reached the max 256 DIU allowed on a single copy activity.
Next, gradually increase the number of concurrent copy activities until you reach limits of your environment.
When you encounter throttling errors reported by ADF copy activity, either reduce the concurrency or DIU
setting in ADF, or consider increasing the bandwidth/IOPS limits of the network and data stores.
Estimating Price

NOTE
This is a hypothetical pricing example. Your actual pricing depends on the actual throughput in your environment.

Consider the following pipeline constructed for migrating data from S3 to Azure Blob Storage:

Let us assume the following:


Total data volume is 2 PB
Migrating data over HTTPS using first solution architecture
2 PB is divided into 1 K partitions and each copy moves one partition
Each copy is configured with DIU=256 and achieves 1 GBps throughput
ForEach concurrency is set to 2 and aggregate throughput is 2 GBps
In total, it takes 292 hours to complete the migration
Here is the estimated price based on the above assumptions:

Additional references
Amazon Simple Storage Service connector
Azure Blob Storage connector
Azure Data Lake Storage Gen2 connector
Copy activity performance tuning guide
Creating and configuring self-hosted Integration Runtime
Self-hosted integration runtime HA and scalability
Data movement security considerations
Store credentials in Azure Key Vault
Copy file incrementally based on time partitioned file name
Copy new and changed files based on LastModifiedDate
ADF pricing page

Template
Here is the template to start with to migrate petabytes of data consisting of hundreds of millions of files from
Amazon S3 to Azure Data Lake Storage Gen2.

Next steps
Copy files from multiple containers with Azure Data Factory
Use Azure Data Factory to migrate data from an
on-premises Hadoop cluster to Azure Storage
3/5/2021 • 9 minutes to read • Edit Online

APPLIES TO: Azure Data Factory Azure Synapse Analytics


Azure Data Factory provides a performant, robust, and cost-effective mechanism for migrating data at scale
from on-premises HDFS to Azure Blob storage or Azure Data Lake Storage Gen2.
Data Factory offers two basic approaches for migrating data from on-premises HDFS to Azure. You can select
the approach based on your scenario.
Data Factor y DistCp mode (recommended): In Data Factory, you can use DistCp (distributed copy) to copy
files as-is to Azure Blob storage (including staged copy) or Azure Data Lake Store Gen2. Use Data Factory
integrated with DistCp to take advantage of an existing powerful cluster to achieve the best copy throughput.
You also get the benefit of flexible scheduling and a unified monitoring experience from Data Factory.
Depending on your Data Factory configuration, copy activity automatically constructs a DistCp command,
submits the data to your Hadoop cluster, and then monitors the copy status. We recommend Data Factory
DistCp mode for migrating data from an on-premises Hadoop cluster to Azure.
Data Factor y native integration runtime mode : DistCp isn't an option in all scenarios. For example, in
an Azure Virtual Networks environment, the DistCp tool doesn't support Azure ExpressRoute private peering
with an Azure Storage virtual network endpoint. In addition, in some cases, you don't want to use your
existing Hadoop cluster as an engine for migrating data so you don't put heavy loads on your cluster, which
might affect the performance of existing ETL jobs. Instead, you can use the native capability of the Data
Factory integration runtime as the engine that copies data from on-premises HDFS to Azure.
This article provides the following information about both approaches:
Performance
Copy resilience
Network security
High-level solution architecture
Implementation best practices

Performance
In Data Factory DistCp mode, throughput is the same as if you use the DistCp tool independently. Data Factory
DistCp mode maximizes the capacity of your existing Hadoop cluster. You can use DistCp for large inter-cluster
or intra-cluster copying.
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of
files and directories into input for task mapping. Each task copies a file partition that's specified in the source list.
You can use Data Factory integrated with DistCp to build pipelines to fully utilize your network bandwidth,
storage IOPS, and bandwidth to maximize data movement throughput for your environment.
Data Factory native integration runtime mode also allows parallelism at different levels. You can use parallelism
to fully utilize your network bandwidth, storage IOPS, and bandwidth to maximize data movement throughput:
A single copy activity can take advantage of scalable compute resources. With a self-hosted integration
runtime, you can manually scale up the machine or scale out to multiple machines (up to four nodes). A
single copy activity partitions its file set across all nodes.
A single copy activity reads from and writes to the data store by using multiple threads.
Data Factory control flow can start multiple copy activities in parallel. For example, you can use a For Each
loop.
For more information, see the copy activity performance guide.

Resilience
In Data Factory DistCp mode, you can use different DistCp command-line parameters (For example, -i , ignore
failures or -update , write data when source file and destination file differ in size) for different levels of
resilience.
In the Data Factory native integration runtime mode, in a single copy activity run, Data Factory has a built-in
retry mechanism. It can handle a certain level of transient failures in the data stores or in the underlying
network.
When doing binary copying from on-premises HDFS to Blob storage and from on-premises HDFS to Data Lake
Store Gen2, Data Factory automatically performs checkpointing to a large extent. If a copy activity run fails or
times out, on a subsequent retry (make sure that retry count is > 1), the copy resumes from the last failure point
instead of starting at the beginning.

Network security
By default, Data Factory transfers data from on-premises HDFS to Blob storage or Azure Data Lake Storage
Gen2 by using an encrypted connection over HTTPS protocol. HTTPS provides data encryption in transit and
prevents eavesdropping and man-in-the-middle attacks.
Alternatively, if you don't want data to be transferred over the public internet, for higher security, you can
transfer data over a private peering link via ExpressRoute.

Solution architecture
This image depicts migrating data over the public internet:

In this architecture, data is transferred securely by using HTTPS over the public internet.
We recommend using Data Factory DistCp mode in a public network environment. You can take advantage
of a powerful existing cluster to achieve the best copy throughput. You also get the benefit of flexible
scheduling and unified monitoring experience from Data Factory.
For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows
machine behind a corporate firewall to submit the DistCp command to your Hadoop cluster and to monitor
the copy status. Because the machine isn't the engine that will move data (for control purpose only), the
capacity of the machine doesn't affect the throughput of data movement.
Existing parameters from the DistCp command are supported.
This image depicts migrating data over a private link:

In this architecture, data is migrated over a private peering link via Azure ExpressRoute. Data never traverses
over the public internet.
The DistCp tool doesn't support ExpressRoute private peering with an Azure Storage virtual network
endpoint. We recommend that you use Data Factory's native capability via the integration runtime to migrate
the data.
For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows VM in
your Azure virtual network. You can manually scale up your VM or scale out to multiple VMs to fully utilize
your network and storage IOPS or bandwidth.
The recommended configuration to start with for each Azure VM (with the Data Factory self-hosted
integration runtime installed) is Standard_D32s_v3 with a 32 vCPU and 128 GB of memory. You can monitor
the CPU and memory usage of the VM during data migration to see whether you need to scale up the VM for
better performance or to scale down the VM to reduce cost.
You can also scale out by associating up to four VM nodes with a single self-hosted integration runtime. A
single copy job running against a self-hosted integration runtime automatically partitions the file set and
makes use of all VM nodes to copy the files in parallel. For high availability, we recommend that you start
with two VM nodes to avoid a single-point-of-failure scenario during data migration.
When you use this architecture, initial snapshot data migration and delta data migration are available to you.

Implementation best practices


We recommend that you follow these best practices when you implement your data migration.
Authentication and credential management
To authenticate to HDFS, you can use either Windows (Kerberos) or Anonymous.
Multiple authentication types are supported for connecting to Azure Blob storage. We highly recommend
using managed identities for Azure resources. Built on top of an automatically managed Data Factory identity
in Azure Active Directory (Azure AD), managed identities allow you to configure pipelines without supplying
credentials in the linked service definition. Alternatively, you can authenticate to Blob storage by using a
service principal, a shared access signature, or a storage account key.
Multiple authentication types also are supported for connecting to Data Lake Storage Gen2. We highly
recommend using managed identities for Azure resources, but you also can use a service principal or a
storage account key.
When you're not using managed identities for Azure resources, we highly recommend storing the credentials
in Azure Key Vault to make it easier to centrally manage and rotate keys without modifying Data Factory
linked services. This is also a best practice for CI/CD.
Initial snapshot data migration
In Data Factory DistCp mode, you can create one copy activity to submit the DistCp command and use different
parameters to control initial data migration behavior.
In Data Factory native integration runtime mode, we recommend data partition, especially when you migrate
more than 10 TB of data. To partition the data, use the folder names on HDFS. Then, each Data Factory copy job
can copy one folder partition at a time. You can run multiple Data Factory copy jobs concurrently for better
throughput.
If any of the copy jobs fail due to network or data store transient issues, you can rerun the failed copy job to
reload that specific partition from HDFS. Other copy jobs that are loading other partitions aren't affected.
Delta data migration
In Data Factory DistCp mode, you can use the DistCp command-line parameter -update , write data when
source file and destination file differ in size, for delta data migration.
In Data Factory native integration mode, the most performant way to identify new or changed files from HDFS is
by using a time-partitioned naming convention. When your data in HDFS has been time-partitioned with time
slice information in the file or folder name (for example, /yyyy/mm/dd/file.csv), your pipeline can easily identify
which files and folders to copy incrementally.
Alternatively, if your data in HDFS isn't time-partitioned, Data Factory can identify new or changed files by using
their LastModifiedDate value. Data Factory scans all the files from HDFS and copies only new and updated
files that have a last modified timestamp that's greater than a set value.
If you have a large number of files in HDFS, the initial file scanning might take a long time, regardless of how
many files match the filter condition. In this scenario, we recommend that you first partition the data by using
the same partition you used for the initial snapshot migration. Then, file scanning can occur in parallel.
Estimate price
Consider the following pipeline for migrating data from HDFS to Azure Blob storage:

Let's assume the following information:


Total data volume is 1 PB.
You migrate data by using the Data Factory native integration runtime mode.
1 PB is divided into 1,000 partitions and each copy moves one partition.
Each copy activity is configured with one self-hosted integration runtime that's associated with four machines
and which achieves 500-MBps throughput.
ForEach concurrency is set to 4 and aggregate throughput is 2 GBps.
In total, it takes 146 hours to complete the migration.
Here's the estimated price based on our assumptions:
NOTE
This is a hypothetical pricing example. Your actual pricing depends on the actual throughput in your environment. The
price for an Azure Windows VM (with self-hosted integration runtime installed) isn't included.

Additional references
HDFS connector
Azure Blob storage connector
Azure Data Lake Storage Gen2 connector
Copy activity performance tuning guide
Create and configure a self-hosted integration runtime
Self-hosted integration runtime high availability and scalability
Data movement security considerations
Store credentials in Azure Key Vault
Copy a file incrementally based on a time-partitioned file name
Copy new and changed files based on LastModifiedDate
Data Factory pricing page

Next steps
Copy files from multiple containers by using Azure Data Factory
Use Azure Data Factory to migrate data fro

You might also like